Fuzz search thrown too complex to determinize exception

Youxu · January 29, 2018, 10:52pm

We saw some too_complex_to_determinize_exception when doing fuzz search.

Exception details:
too_complex_to_determinize_exception: Determinizing automaton with 41207 states and 74541 transitions would result in more than 10000 states., ES Status: 500\

Following is the example query DSL which caused the exception:

"query": {
  "multi_match": {
    "query": "nous savons que vous voulez vraiment commencer dès maintenant, mais vous allez devoir patienter un peu. recherchez dans le windows store la date de lancement.",
    "fuzziness": "AUTO",
    "fields": ["Term"]
  }
}

Basically, the exception occurred when the query is a long length query.
But if I removed the "fuzziness" from the query DSL, then no exception.

And following is the analysis settings:

{
 "analysis": {
   "filter": {
     "whitespace_normalization": {
       "pattern":"\\s+",
       "type":"pattern_replace",
       "replacement":""
      }
    },
   "analyzer": {
     "keyword_ngram_suggest": {
       "filter": [
         "lowercase",
         "whitespace_normalization",
         "ngram_filter"
        ],
       "type":"custom",
       "tokenizer":"keyword"
      },
     "lowercase_norm_keyword": {
       "filter": [
         "lowercase",
         "whitespace_normalization",
         "trim"
        ],
       "type":"custom",
       "tokenizer":"keyword"
      }
    }
  }
}

And field mappings:

"Term": {
  "type": "text",
  "analyzer": "keyword_ngram_suggest",
  "search_analyzer": "lowercase_norm_keyword"
}

I guess the reason is because in my setting, the whole input query is treated as a single token, and fuzz search cannot handle long length token well.

My question is, is there any soft limit setting which I can increase the threshold so that allow longer token when dong fuzz search?

AlanWoodward · January 30, 2018, 1:14pm

Hi @Youxu,

There isn't a setting in elasticsearch to lift the cap on determinized states, no (in fact, there isn't even a way of doing it in Lucene at the moment via FuzzyQuery). I wouldn't recommend doing it anyway, as it would be a very inefficient way of searching. Is there a reason why you aren't breaking things up on whitespace and doing something like a minimum-should-match query here?

Youxu · February 8, 2018, 5:57pm

I am implementing a auto complete feature using Elasticsearch. Our requirement is, for any user input text in search box, only those terms starts with the user input text (allow typo and redundant whitespaces) appears in suggested terms list.

Some examples:

Suppose there are 2 terms in index:

google account
sign in with google account
how to sign in google account

when user input "g", "go" or "goo", only "google account" appears in terms list.
when user input "si", "sign in", or "sig in", only sign in with google account appears.
when user input "sign with", nothing appears.

That is why I defined the Term field as key word with ngram filter and whitespace_normalization filter.

Do you have alternative solutions for our starts with auto complete?

AlanWoodward · February 15, 2018, 2:29pm

Hi @Youxu

Could you use a 'truncate' filter to limit token length? Putting in an entire sentence of text is presumably an edge case, most of the time you're expecting people to type a few characters, correct? So truncating things to say 100 characters should leave the vast majority of users unaffected, while preventing errors for those who copy and paste large amounts of text in.

Youxu · February 17, 2018, 10:03pm

thanks your suggest, I think it makes sense!

发自我的 iPhone

system · March 17, 2018, 10:03pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fuzzy search "too complex to determinize exception" with unicode characters Elasticsearch	2	2610	March 19, 2019
Too complex to determinize exception: "Determinizing automaton with 13186 states and 23192 transitions would result in more than 10000 states" Elasticsearch	2	8173	May 8, 2017
Too_Complex_To_Determinize Exception: Determinizing automaton with 57621 states and 60867 transitions would result in more than 10000 states Elasticsearch	1	1493	June 25, 2020
Setting "max_determinized_states" when using _source to filter fields during match_all Elasticsearch	1	1078	October 30, 2019
Apply fuzzy match only on short search arguments? Elasticsearch	1	274	July 17, 2021

Fuzz search thrown too complex to determinize exception

Related topics