Fuzz search thrown too complex to determinize exception

We saw some too_complex_to_determinize_exception when doing fuzz search.

Exception details:
too_complex_to_determinize_exception: Determinizing automaton with 41207 states and 74541 transitions would result in more than 10000 states., ES Status: 500\

Following is the example query DSL which caused the exception:

"query": {
  "multi_match": {
    "query": "nous savons que vous voulez vraiment commencer dès maintenant, mais vous allez devoir patienter un peu. recherchez dans le windows store la date de lancement.",
    "fuzziness": "AUTO",
    "fields": ["Term"]
  }
}

Basically, the exception occurred when the query is a long length query.
But if I removed the "fuzziness" from the query DSL, then no exception.

And following is the analysis settings:

{
 "analysis": {
   "filter": {
     "whitespace_normalization": {
       "pattern":"\\s+",
       "type":"pattern_replace",
       "replacement":""
      }
    },
   "analyzer": {
     "keyword_ngram_suggest": {
       "filter": [
         "lowercase",
         "whitespace_normalization",
         "ngram_filter"
        ],
       "type":"custom",
       "tokenizer":"keyword"
      },
     "lowercase_norm_keyword": {
       "filter": [
         "lowercase",
         "whitespace_normalization",
         "trim"
        ],
       "type":"custom",
       "tokenizer":"keyword"
      }
    }
  }
}

And field mappings:

"Term": {
  "type": "text",
  "analyzer": "keyword_ngram_suggest",
  "search_analyzer": "lowercase_norm_keyword"
}

I guess the reason is because in my setting, the whole input query is treated as a single token, and fuzz search cannot handle long length token well.

My question is, is there any soft limit setting which I can increase the threshold so that allow longer token when dong fuzz search?

Hi @Youxu,

There isn't a setting in elasticsearch to lift the cap on determinized states, no (in fact, there isn't even a way of doing it in Lucene at the moment via FuzzyQuery). I wouldn't recommend doing it anyway, as it would be a very inefficient way of searching. Is there a reason why you aren't breaking things up on whitespace and doing something like a minimum-should-match query here?

I am implementing a auto complete feature using Elasticsearch. Our requirement is, for any user input text in search box, only those terms starts with the user input text (allow typo and redundant whitespaces) appears in suggested terms list.

Some examples:

Suppose there are 2 terms in index:

google account
sign in with google account
how to sign in google account

when user input "g", "go" or "goo", only "google account" appears in terms list.
when user input "si", "sign in", or "sig in", only sign in with google account appears.
when user input "sign with", nothing appears.

That is why I defined the Term field as key word with ngram filter and whitespace_normalization filter.

Do you have alternative solutions for our starts with auto complete?

Hi @Youxu

Could you use a 'truncate' filter to limit token length? Putting in an entire sentence of text is presumably an edge case, most of the time you're expecting people to type a few characters, correct? So truncating things to say 100 characters should leave the vast majority of users unaffected, while preventing errors for those who copy and paste large amounts of text in.

thanks your suggest, I think it makes sense!

发自我的 iPhone

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.