Fuzzy matching hashtag with Simple Query String Query

I have an indexed field named "searchBucket" with the term "#revit" in it. Here is the analyzer applied to that field:

"default" : {
              "filter" : [
              "char_filter" : "html_strip",
              "type" : "custom",
              "tokenizer" : "uax_url_email"

Via the Analyze API I've confirmed that this is the token that is generated:

  "tokens" : [
      "token" : "revit",
      "start_offset" : 1,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0

I'm trying to search this field using the Simple Query String Query as follows:

"simple_query_string": {
                        "default_operator": "and",
                        "fields": ["searchBucket"],
                        "fuzzy_prefix_length": 1,
                        "fuzzy_transpositions": true,
                        "lenient": true,
                        "query": "#reivt~",
                        "quote_field_suffix": ".exact"

As you can see, I'm trying to run a fuzzy query where the "i" and "v" are transposed. This does not match. However, if I remove the "#" from the beginning of the query and search "reivt~" it does match. Also if I remove the tilde from the query and search "#revit" it matches.

So it seems like perhaps using the tilde (fuzzy query) means that the field's analyzer isn't being applied? Is there anyone that can confirm this? Or can anyone give me a way to get the lucene query that is being generated under the hood for the simple query string query?

You should be able to use the "_validate/query" endpoint for this. Use the parameter "explain=true" as described in the docs to see what query gets generated from this.

Thank you Christophe, that was helpful. It appears that the "fuzzy_prefix_length": 1 might be the problem. If that value is set to 0, the match succeeds. Since the "#" is the first character it seems to be left intact, even though the analyzer would typically strip that out. I guess I can try resetting that value back to the default of 0. It's still confusing to me why the analyzer doesn't get applied consistently - it's not like it ignores the first character completely. For example, if I type "Revit" the lowercase token filter still gets applied to that first character.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.