Elasticsearch 7.7 crashing for Term Query if term text 200 - 300 char

pappukhode · July 16, 2020, 6:44am

Hi,
I have upgraded Elasticsearch 5.5 to 7.7 recently.
I have only 1 index of 30 fields and 6K data. Dataset is very simple in nature. [provided 1g memory in jvm options]
In the dataset i have a field description which has text around 250-400 chars.

I am using a search query having combination of bool, must, query_string, term.
When I perform a search operation for exact indexed description having 250 chars, search query takes long time to respond [15 sec] and Elasticsearch gets crashed.
If I use small search term of 20-25 chars ES works well.
All the words in search term are fuzzy term, we have appended ~ at end of each word.
In ES 5.5 above scenario was working pretty well with no issue and looks like something has broken in ES 7.7

Could you please suggest how should I proceed on my issue?
Is there any limit to input search term ?
How much memory should I set in development and production ?

Ignacio_Vera · July 16, 2020, 7:15am

Could you share a bit more information about the crash? (e.g stacktrace)

pappukhode · July 16, 2020, 7:22am

Its out of memory. I could see lot of logs like
[gc][2119] overhead, spent [3.7s] collecting in the last [6s]
Do you feel Fuzzy is consuming lot of memory?
Above query without fuzzy taking 500ms and with fuzzy taking 11 sec with gc log and gets crashed.

Ignacio_Vera · July 16, 2020, 7:31am

I think you are hitting this Lucene issue:

https://issues.apache.org/jira/browse/LUCENE-9286

But it is difficult to tell without a heap dump. Maybe you can try to get the hot threads when running the query?

pappukhode · July 16, 2020, 8:33am

@Ignacio_Vera
[2020-07-16T01:44:40,518][WARN ][o.e.m.j.JvmGcMonitorService] [gc][2118] overhead, spent [1.1s] collecting in the last [1.1s]
[2020-07-16T01:44:46,589][WARN ][o.e.m.j.JvmGcMonitorService] [gc][2119] overhead, spent [3.7s] collecting in the last [6s]
[2020-07-16T01:44:46,170][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [HJL013760] fatal error in thread [elasticsearchsearch][T#11]], exiting
java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.util.ArrayUtil.growExact(ArrayUtil.java:302) ~[lucene-core-8.5.1.jar:8.5.1 edb9fc409398f2c3446883f9f80595c884d245d0 - ivera - 2020-04-08 08:55:42]
at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:311) ~[lucene-core-8.5.1.jar:8.5.1 edb9fc409398f2c3446883f9f80595c884d245d0 - ivera - 2020-04-08 08:55:42]
at org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:770) ~[lucene-core-8.5.1.jar:8.5.1 edb9fc409398f2c3446883f9f80595c884d245d0 - ivera - 2020-04-08 08:55:42]
at org.apache.lucene.util.automaton.UTF32ToUTF8.all(UTF32ToUTF8.java:251) ~[lucene-core-8.5.1.jar:8.5.1 edb9fc409398f2c3446883f9f80595c884d245d0 - ivera - 2020-04-08 08:55:42]
at org.apache.lucene.util.automaton.UTF32ToUTF8.end(UTF32ToUTF8.java:231) ~[lucene-core-8.5.1.jar:8.5.1 edb9fc409398f2c3446883f9f80595c884d245d0 - ivera - 2020-04-08 08:55:42]
at org.apache.lucene.util.automaton.UTF32ToUTF8.build(UTF32ToUTF8.java:194) ~[lucene-core-8.5.1.jar:8.5.1 edb9fc409398f2c3446883f9f80595c884d245d0 - ivera - 2020-04-08 08:55:42]
at org.apache.lucene.util.automaton.UTF32ToUTF8.convertOneEdge(UTF32ToUTF8.java:137) ~[lucene-core-8.5.1.jar:8.5.1 edb9fc409398f2c3446883f9f80595c884d245d0 - ivera - 2020-04-08 08:55:42]
at org.apache.lucene.util.automaton.UTF32ToUTF8.convert(UTF32ToUTF8.java:307) ~[lucene-core-8.5.1.jar:8.5.1 edb9fc409398f2c3446883f9f80595c884d245d0 - ivera - 2020-04-08 08:55:42]
at org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:237) ~[lucene-core-8.5.1.jar:8.5.1 edb9fc409398f2c3446883f9f80595c884d245d0 - ivera - 2020-04-08 08:55:42]
at org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:140) ~[lucene-core-8.5.1.jar:8.5.1 edb9fc409398f2c3446883f9f80595c884d245d0 - ivera - 2020-04-08 08:55:42]
at org.apache.lucene.search.FuzzyTermsEnum.buildAutomata(FuzzyTermsEnum.java:154) ~[lucene-core-8.5.1.jar:8.5.1 edb9fc409398f2c3446883f9f80595c884d245d0 - ivera - 2020-04-08 08:55:42]
at org.apache.lucene.search.FuzzyQuery.(FuzzyQuery.java:111) ~[lucene-core-8.5.1.jar:8.5.1 edb9fc409398f2c3446883f9f80595c884d245d0 - ivera - 2020-04-08 08:55:42]
at org.elasticsearch.index.mapper.StringFieldType.fuzzyQuery(StringFieldType.java:78) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.index.search.QueryStringQueryParser.getFuzzyQuerySingle(QueryStringQueryParser.java:466) ~[elasticsearch-7.7.0.jar:7.7.0]

Mark_Harwood · July 16, 2020, 8:51am

Is this field tokenized? Can you share the mapping?
Generally, looking for similar texts would be done using tokenised fields and using the more like this query.

Searching long untokenized fields with fuzzy will be expensive and only allows for max 2 characters difference between search string and matched values.

pappukhode · July 16, 2020, 9:06am

We have not modified or customizing _mapping. ES be default creating fields and datatypes for _mappings.
As per elasticsearch behaviour, it has created type as a text and keyword for field description.

Mark_Harwood · July 16, 2020, 9:07am

Which of these 2 fields are you searching?
Seeing the mapping and query would help.

pappukhode · July 16, 2020, 9:13am

snippet of my query:
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "Carl~ Rogers~ founder~ humanistic~ psychology~ movement~ revolutionized~ psychotherapy~ influence~ has~ become~ mainstream~ psychology~ and so on.",
"default_operator": "AND"
}
},
{
"term": {
"containerName.keyword": "Book Catalog"
}
.....
}

_mappings : description fields looks like below:
{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}

If i remove fuzzy ~ char from search terms query works fine. but with ~ char i am facing performance issue and es gets crashed.

Mark_Harwood · July 16, 2020, 9:29am

OK. That's searching the tokenized field but using fuzzy on everything which is expensive.

If you do a lot of this type of fuzzy matching it's probably more efficient to use ngrams
e.g.

PUT my_index
{
  "settings": {
	"analysis": {
	  "analyzer": {
		"my_analyzer": {
		  "tokenizer": "my_tokenizer",
		  "filter": [
			"apostrophe",
			"lowercase"
		  ]
		}
	  },
	  "tokenizer": {
		"my_tokenizer": {
		  "type": "ngram",
		  "min_gram": 3,
		  "max_gram": 3,
		  "output_unigrams": true
		}
	  }
	}
  },
  "mappings": {
	  "properties": {
		"description": {
		  "type": "text",
		  "fields": {
			"keyword": {
			  "type": "keyword"
			},
			"ngram": {
			  "type": "text",
			  "analyzer": "my_analyzer"
			}
		  }
		}
	  }
  }
}

POST my_index/_doc/1
{
  "description":"Carl Rogers founder humanistic psychology movement revolutionized psychotherapy influence has become mainstream psychology "
}

POST my_index/_doc/_search
{
  "query": {
	"query_string": {
	  "query": "Carl Rogers founder humanistic sychology movement revolutionised psychotherapy influence has become mainstream",
	  "default_operator": "OR",
	  "default_field": "description.ngram"
	}
  }
}

system · August 13, 2020, 9:29am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.