BooleanQuery$TooManyClauses with synonym filter

How can I avoid TooManyCaluses failure when I use kuromoji tokenizer with the synonym filter with specifying synonym dictionary which contains about 500K words? How can I calculate to find an appropriate value as the max clause count? Or is there any other way not to cause this failure other than increasing max_clause_count value with using this synonym dictionary?

Situation: Simple Query String fails with TooManyClauses for a word. I can avoid this failure for a word with configuring max_clause_count to 153600, but it occurs again for another word, which seems to need more.

Additional information: I'm using kuromoji tokenizer with synonym filter with synonym dictionary which contains about 500K words. It does not fail without synonym filter even when I query the same word. It does not fail with synonym filter with another small synonym dictionary, either.

Environment:
Elasticsearch v5.6.8
/etc/elasticsearch/elasticsearch.yml
indices.query.bool.max_clause_count: 153600

Query body:
{
"query": {
"simple_query_string": {
"query": "atext",
"fields": ["field1"],
}
}
}

Error log: /var/log/elasticsearch/elasticsearch.log
org.elasticsearch.index.query.QueryShardException: failed to create query: {
"simple_query_string" : {
"query" : "atext",
"fields" : [
"field1^1.0"
],
"flags" : -1,
"default_operator" : "or",
"lenient" : false,
"analyze_wildcard" : false,
"boost" : 1.0
}
}
Caused by: org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 153600

Do you know the word the query fails for? Is it just one token? Can you check what it analyzes to if you use the "_analyze" endpoint for it, using the same analyzer you have configured for that target field?
For further ideas and for others to chime in on your question it would also be helpful to see the analysis chain and the mapping of your index. It would also be interesting to know if you observe the same behaviour with a more recent version of ES than 5.6.

Hi, cbuescher. Thank you for your support. Here are answers.

Elasticsearch analyzes the word into 5097 tokens with using the same analyzer configured for that target field. The word is Japanese.

data = es.indices.analyze(
index=esIndex,
body={
"analyzer": my_analyzer,
"text": "atext"}
)
numoftoken = len(data.get("tokens"))

When max_clause_count is 102400,
word.A : 5097 tokens -> TooManyClause error
word.B : 716 tokens -> TooManyClause error
word.C : 4691 tokens -> OK

When max_clause_count is 153600,
word.A : 5097 tokens -> TooManyClause error
word.B : 716 tokens -> OK
word.C : 4691 tokens -> OK

Here is the mapping of the index.
setting = {
"settings": {
"index" : { "number_of_shards": 1 },
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom", "tokenizer": "search-kuromoji", "filter" : [ "synonym", "greek_lowercase", "katakana_readingform"]}
},
"tokenizer": {
"search-kuromoji": {"type": "kuromoji_tokenizer", "mode": "search"}
},
"filter": {
"synonym": {"type": "synonym", "synonyms_path": SYNONYMS_PATH},
"greek_lowercase": {"type": "lowercase", "language": "greek"},
"katakana_readingform": {"type": "kuromoji_readingform", "use_romaji" : False}
}
}
}
}
mappings = {
'symptom': {
'properties': {
'field1': {'type': 'text', 'index': 'true', 'analyzer': 'my_analyzer'},
'code': {'type': 'text'}
}
}
}

es.indices.create(index=an_index, body=setting, request_timeout=30)
es.indices.put_mapping(index=index, doc_type=a_doctype, body=mappings)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.