Elasticsearch - Plain-text search

Hello,

I'm working on an index on which I want to do plain-text search in several strings :
for example, for 2 documents

  • Doc 1 : strings "This is an example" and "another example"
  • Doc 2 : strings "This is an test" and "another test"
    if I search with the string "an", I found the two documents.
    if I search with the string "amp" I found Doc 1
    if I search with the string "tes" I found Doc 2
    if I search with the string "anoo" no document is found

To do that I use ngram_tokenizer and for a document it can have a large set of string where to search. Is it a good solution ? Is there a better one ?

When I put documents in the index several errors occur :

  • In logstash :
    [2020-05-18T00:16:59,550][INFO ][logstash.outputs.elasticsearch][main] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of processing of [27859947][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[test_nouv_structure_es_on_demand_index-2020.05.15][0]] containing [34] requests, target allocation id: HuXyeaExRJOhKUGOVOE5Fg, primary term: 1 on EsThreadPoolExecutor[name = l4g-centoselk02/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@158523e4[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 3246756]]"})

  • In Elasticsearch :
    [2020-03-17T18:18:21,949][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [l4g-centoselk02] fatal error in thread [elasticsearch[l4g-centoselk02][write][T#1]], exiting
    java.lang.OutOfMemoryError: Java heap space
    at org.apache.lucene.util.BytesRefHash.rehash(BytesRefHash.java:398) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]
    at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:309) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]
    at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:151) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]
    ...

Have you information on these errors ?

Thank you for your help.

Hi @Julien1

Yes maybe ngram_tokenizer is not the best solution.

That may explain this:

exiting
java.lang.OutOfMemoryError: Java heap space

Do you really want to do this?

if I search with the string "an", I found the two documents.

as "an" "is" "this" are stop words and will add noise in your search you should better not consider them as result.

Maybe it's better to take more real example.
I think with default configuration and mapping you can already have some good result.

Can you provide which query you use to search, and some near real example it will be more helpful. And if you can also provide some metrics about your data maybe the solution can be different if you search in billion of documents or some thousands.

A last question what do you use Logstash for, can you consider using ingest?

Hi @gabriel_tessier,

here is an example of document inserted in ES :

{
	"_index" : "test_nouv_structure_es_on_demand_index-2020.05.19",
	"_type" : "_doc",
	"_id" : "a396ead4-1fef-4bb4-b4ae-d14b296d03e3",
	"_score" : 4.7024536,
	"_source" : {
	  "method" : "WS1_OpenSession-1.2",
	  "id" : "a396ead4-1fef-4bb4-b4ae-d14b296d03e3",
	  "dateIn" : "2020-05-19T02:54:00.2103641+02:00",
	  "notManagedException" : false,
	  "streamIn" : """
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:exp="http://www.expedito.fr/">
<soap:Header/>
<soap:Body>
   <exp:WS1_OpenSession>
	   <exp:login>A_login</exp:login>
	   <exp:password>A_password_blablabla</exp:password>
   </exp:WS1_OpenSession>
</soap:Body>
</soap:Envelope>
""",
	  "@timestamp" : "2020-05-19T00:57:04.040Z",
	  "properties" : {
		"indexName" : "test_nouv_structure_es_on_demand_index"
	  },
	  "serverName" : "L4G-W12IISXX",
	  "login" : "A_login",
	  "indexName" : "test_nouv_structure_es_on_demand_index",
	  "iPAddress" : "109.239.113.1",
	  "companyId" : 207,
	  "streamInForFTSearch" : [
		"A_login",
		"A_password_blablabla"
	  ],
	  "streamOut" : """<?xml version="1.0" encoding="utf-8"?><soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><soap:Body><WS1_OpenSessionResponse xmlns="http://www.expedito.fr/"><WS1_OpenSessionResult><ErrorCode>0</ErrorCode><Value>948e5391-d7e3-4fdb-9952-020e0eb1d5f2</Value></WS1_OpenSessionResult></WS1_OpenSessionResponse></soap:Body></soap:Envelope>""",
	  "dateOut" : "2020-05-19T02:54:00.2572482+02:00",
	  "sessionId" : "948e5391-d7e3-4fdb-9952-020e0eb1d5f2",
	  "streamOutForFTSearch" : [
		"0",
		"948e5391-d7e3-4fdb-9952-020e0eb1d5f2"
	  ],
	  "duration" : 46.0
	}
  }

I want to make plain-text seach on "streamInForFTSearch" and "streamOutForFTSearch" string arrays.
For example with "ord_blablabl" in "streamInForFTSearch" the document must be found. With "948e5391-d7e3-4fdb-99XX" in "streamOutForFTSearch" the document mustn't be found.

At the moment, I use Logstash to insert documents in Elasticsearch (Filebeat -> Logstash -> Elasticsearch) but I want to delete Logstash in the future and use the Elasticsearch ingest nodes to insert documents.

Thank you for your help.

The metrics : 12 250 000‬ documents with different kind of stream in "streamIn" and "streamOut" fields. The one I have indicate behind is a simple.

Thanks for your help.

Ok I got it,

Searching in an array have some side effect like with your example streamOutForFRSearch have the value 0 mapped as a string, so you may lost some feature if you want to make some math.

From the documentation:

In Elasticsearch, there is no dedicated array datatype. Any field can contain zero or more values by default, however, all values in the array must be of the same datatype.

https://www.elastic.co/guide/en/elasticsearch/reference/current/array.html

And also if you want to exclude for example the 0 but include the 948xxx you may not able to get the result you expect.

This one is just for information maybe you already know about and you already structured your document this way, and you know it's the correct way... if so you can forget.

But just in case, did you check about ECS it may be useful to structure your document before hand.
https://www.elastic.co/guide/en/ecs/current/ecs-reference.html

Sorry to be verbose. :sweat_smile:

What about your mapping? Depend on your mapping if you have a default mapping for string content you may have multi-field with text and keyword so you can use something like:

{
  "bool": {
    "should": [
      # search string that start with
      {
        "query_string": {
          "query": "keyword_low*",
          "fields": [
            "field_1^5",
            "field_1.*^3",
            "field_2",
            ...
          ],
          "default_operator": "AND",
          "analyzer": "Your_Analyzer",
          "fuzziness": "auto"
        }
      },
       # term search to search on the exact term
      {"term": {"field_1.keyword": keyword}},
      {"term": {"field_2": keyword}},
      ....

Few comment about the code below:
I wrap my query with a bool should to get a search with text starting with in lower case and a term search.
Here one important things about your data if you search in any European language other than English it's better to index your data in lower case (using multi-field, that why mapping is important). Also store your data in ascii to prevent problem with [éèëù....].
You can normalize your keyword and search in your normalized field to have a better match.

About fields list, here you can list the fields your search with a weight (^5 notation), depend on your needs, you may not need. And search on all or parts of your multi-fields.
All details and options are listed here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-multi-field

And at the end of the should you have a term query, it can be useful if you search on exact match, let's say you have the value of login and search for this field, here also you can apply boost.
More details in the doc here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html

All the full text queries are listed here you may find a better one suited to your need.
https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html

I already wrote a lot but again if you can provide your mapping and index settings to know which analyzer your use it can help.

For sure maybe my example is a little complicated but so far this request work fine (I maybe made small changes) from version 0.9 to 6.x, maybe also 7. (need to upgrade my server :grin:)

If you plan to remove Logstash so it's mean that you don't have so much data to ingest as it's better that logstash server ingest and transform the data to remove load from your elasitc server. But if your elastic can handle both it's ok.

Sorry it took so much time to reply that I mess your post with metrics:

The metrics : 12 250 000‬ documents

According to the example below you store your data in daily index.

"_index" : "test_nouv_structure_es_on_demand_index-2020.05.19",

Is it 12M doc by index?

It can be important if you need to reindex your data to change your mapping.

The 12M documents will be divided in several indices created by ILM. But the search have to be performed on the 12M documents.

My mapping :

PUT _template/test_nouv_structure_es_on_demand_index_template
{
	"index_patterns" : ["test_nouv_structure_es_on_demand_index*"],
	"settings": {
		"index.mapping.total_fields.limit": 1000,		
        "max_ngram_diff" : 47,
		"analysis": {
		  "analyzer": {
			"wordPartAnalyzer": {
			  "tokenizer": "ngram_tokenizer",
			  "max_token_count" : 100000,
			  "filter" : ["lowercase","asciifolding"]
			},
			"customKeywordAnalyzer": {
			  "tokenizer": "keyword",
			  "filter" : ["lowercase","asciifolding"]
			},
			"serverName_wordPartAnalyzer": {
			  "tokenizer": "serverName_ngram_tokenizer",
			  "max_token_count" : 100000,
			  "filter" : ["lowercase","asciifolding"]
			}
		  },
		  "tokenizer": {
			"ngram_tokenizer": {
			  "type": "ngram",
			  "min_gram": 3,
			  "max_gram": 50
			},
			"serverName_ngram_tokenizer": {
			  "type": "ngram",
			  "min_gram": 4,
			  "max_gram": 25
			}
		  }
		}
	},
	"mappings" : {
		"properties": {
			"StreamInAttributesForFTSearch" : {
				"type" : "text",
				"analyzer": "wordPartAnalyzer",
				"search_analyzer": "customKeywordAnalyzer"
			},
			"StreamOutAttributesForFTSearch" : {
				"type" : "text",
				"analyzer": "wordPartAnalyzer",
				"search_analyzer": "customKeywordAnalyzer"
			}
		},
		"dynamic_templates" : [
		{
			"indexNameProperties" : {
				"match_pattern": "regex",
				"match":   "^(indexName)$",
				"mapping" : {
					"type" : "text"
				}
			}
		},
		{
			"idProperty" : {
				"match_pattern": "regex",
				"match":   "^(id)$",
				"mapping" : {
					"type" : "text"
				}
			}
		},
		{
			"dateProperties" : {
				"match_mapping_type": "date",
				"match":   "*",
				"mapping" : {
					"type" : "date"
				}
			}
		},		
		{
			"streamInOutStringProperties" : {
				"match_pattern": "regex",
				"match":   "^(streamIn|streamOut)$",
				"mapping" : {
					"type" : "text"
				}
			}
		},
		{
			"sessionIdProperty" : {
				"match_pattern": "regex",
				"match":   "^(sessionId)$",
				"mapping" : {
					"type" : "text",
					"ignore_above" : 36					
				}
			}
		},
		{
			"durationProperty" : {
				"match_pattern": "regex",
				"match" : "^(duration)$",
				"mapping" : {
					"type" : "double"
				}
			}
		},
		{
			"externCallDurationProperty" : {
				"match_pattern": "regex",
				"match" : "^(externCallDuration)$",
				"mapping" : {
					"type" : "long"
				}
			}
		},
		{
			"exceptionMessageProperty" : {
				"match_pattern": "regex",
				"match" : "^(exceptionMessage)$",
				"mapping" : {
					"type" : "text",
					"ignore_above" : 4000					
				}
			}
		},
		{
			"loginProperty" : {
				"match_pattern": "regex",
				"match":   "^(login)$",
				"mapping" : {
					"type" : "text",
					"fields" : {
						"keyword" : {
							"type" : "keyword",
							"ignore_above" : 25
						}
					}
				}
			}
		},
		{
			"notManagedExceptionProperty" : {
				"match_pattern": "regex",
				"match":   "^(notManagedException)$",
				"mapping" : {
					"type" : "boolean"
				}
			}
		},
		{
			"returnTreatmentExceptionProperty" : {
				"match_pattern": "regex",
				"match":   "^(returnTreatmentException)$",
				"mapping" : {
					"type" : "long"
				}
			}
		},
		{
			"serverNameProperty" : {
				"match_pattern": "regex",
				"match":   "^(serverName)$",
				"mapping" : {
					"type" : "text",
					"analyzer": "serverName_wordPartAnalyzer",
					"search_analyzer": "customKeywordAnalyzer"
				}
			}
		},
		{
			"methodProperty" : {
				"match_pattern": "regex",
				"match":   "^(method)$",
				"mapping" : {
					"type" : "text",
					"analyzer": "wordPartAnalyzer",
					"search_analyzer": "customKeywordAnalyzer"
				}
			}
		},					
		{
			"tagSenderProperty" : {
				"match_pattern": "regex",
				"match":   "^(tagSender)$",
				"mapping" : {
					"type" : "text",
					"ignore_above" : 25					
				}
			}
		},
		{
			"tagServiceProperty" : {
				"match_pattern": "regex",
				"match":   "^(tagService)$",
				"mapping" : {
					"type" : "text",
					"ignore_above" : 25					
				}
			}
		},
		{
			"treatmentExceptionProperty" : {
				"match_pattern": "regex",
				"match":   "^(treatmentException)$",
				"mapping" : {
					"type" : "text",
					"ignore_above" : 30					
				}
			}
		},
		{
			"companyIdProperty" : {
				"match_pattern": "regex",
				"match":   "^(companyId)$",
				"mapping" : {
					"type" : "long"
				}
			}
		},	
		{
			"timestampProperty" : {
				"match_pattern": "regex",
				"match":   "^(@timestamp)$",
				"mapping" : {
					"type" : "date"
				}
			}
		},		
		{
			"streamInFullTextSearchStringProperty" : {
				"path_match":   "streamInForFTSearch",
				"mapping" : {
					"type" : "text",
					"copy_to" : "StreamInAttributesForFTSearch"
				}
			}
		},		
		{
			"streamOutFullTextSearchStringProperty" : {
				"path_match":   "streamOutForFTSearch",
				"mapping" : {
					"type" : "text",
					"copy_to" : "StreamOutAttributesForFTSearch"
				}
			}
		}
	]
	}
}

Sorry for the late reply.

Mapping can be better if you use the multi-field
https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html#_multi_fields_with_multiple_analyzers

this way you can reduce the ngram diff that can prevent using too much memory and mix between several analyzers. If you just take the keyword or white space analyzer that split on space you already have one part of you ngram.

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-whitespace-analyzer.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-analyzer.html

I read again your first post about what you want to search.

if I search with the string "an", I found the two documents.
if I search with the string "amp" I found Doc 1
if I search with the string "tes" I found Doc 2
if I search with the string "anoo" no document is found

For the last one do you mean that you want to find something when you misspell ? if so you can add a fuzzy query in your bool should:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html