Fuzzy matching hashtag with Simple Query String Query

I have an indexed field named "searchBucket" with the term "#revit" in it. Here is the analyzer applied to that field:

"default" : {
              "filter" : [
                "lowercase",
                "asciifolding",
                "english_stopwords_filter",
                "minimal_english_stemmer"
              ],
              "char_filter" : "html_strip",
              "type" : "custom",
              "tokenizer" : "uax_url_email"
            }

Via the Analyze API I've confirmed that this is the token that is generated:

{
  "tokens" : [
    {
      "token" : "revit",
      "start_offset" : 1,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

I'm trying to search this field using the Simple Query String Query as follows:

"simple_query_string": {
                        "default_operator": "and",
                        "fields": ["searchBucket"],
                        "flags": "AND|OR|NOT|PREFIX|PHRASE|PRECEDENCE|WHITESPACE|FUZZY",
                        "fuzzy_prefix_length": 1,
                        "fuzzy_transpositions": true,
                        "lenient": true,
                        "query": "#reivt~",
                        "quote_field_suffix": ".exact"
                    }

As you can see, I'm trying to run a fuzzy query where the "i" and "v" are transposed. This does not match. However, if I remove the "#" from the beginning of the query and search "reivt~" it does match. Also if I remove the tilde from the query and search "#revit" it matches.

So it seems like perhaps using the tilde (fuzzy query) means that the field's analyzer isn't being applied? Is there anyone that can confirm this? Or can anyone give me a way to get the lucene query that is being generated under the hood for the simple query string query?

You should be able to use the "_validate/query" endpoint for this. Use the parameter "explain=true" as described in the docs to see what query gets generated from this.

Thank you Christophe, that was helpful. It appears that the "fuzzy_prefix_length": 1 might be the problem. If that value is set to 0, the match succeeds. Since the "#" is the first character it seems to be left intact, even though the analyzer would typically strip that out. I guess I can try resetting that value back to the default of 0. It's still confusing to me why the analyzer doesn't get applied consistently - it's not like it ignores the first character completely. For example, if I type "Revit" the lowercase token filter still gets applied to that first character.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.