Search with fuzzy and on a field with word_delimiter doesn't work as expected

I have an index defined like this:

PUT /fuzzytest
{
  "settings": {
    "index": {
      "number_of_shards": "1",
      "analysis": {
        "filter": {
            "my_word_delimiter": {
              "type": "word_delimiter",
              "preserve_original": "true",
              "catenate_numbers": "true",
              "catenate_words": "true",
              "catenate_all": "true"
            }
        },
        "analyzer": {
           "my_identifier_analyzer": {
              "filter": [
                "standard",
                "my_word_delimiter",
                "lowercase"
              ],
              "tokenizer": "keyword"
            }
        }
      }
    }
  },
  "mappings": {
    "tip": {
      "properties": {
        "code": {
          "type": "string",
          "analyzer": "my_identifier_analyzer"
        }
      }
    }
  }
}

I put this document:

PUT fuzzytest/tip/1
{
  "code": "335/25R20"
}

If I search it without fuzzy, I find it:

POST fuzzytest/_search
{
  "query" : {
    "query_string" : {
      "query" : "25r20",
      "fields" : [ "code" ]
    }
  }
}

But if I append the ~2, for fuzzy, I get no results.

POST fuzzytest/_search
{
  "query" : {
    "query_string" : {
      "query" : "25r20~2",
      "fields" : [ "code" ]
    }
  }
}

My question is why the last search doesn't bring any result, while the first one brings.

In the spirit of teaching a man to fish [1] let's walk through what's going on here.

First we can see what query is being executed once it has been parsed and rewritten for execution using the explain API on the document ID you expect to match:

GET fuzzytest/tip/1/_explain
{
  "query" : {
	"query_string" : {
	  "query" : "25r20~",
	  "fields" : [ "code" ]
	}
  }
}

This reveals that the query is devoid of clauses (we have an empty query):

 "description": "no match on required clause (MatchNoDocsQuery(\"empty BooleanQuery\"))",

What this tells us is that the fuzzy query produced no terms at all. So let's look at what is actually in the index for the term in question. We'll use the analyze api:

GET fuzzytest/_analyze
{
	"analyzer":"my_identifier_analyzer",
	"text":"335/25R20"
}

This shows us the terms in the index:

"tokens": [
  {
	 "token": "335/25r20",
  },
  {
	 "token": "335",
  },
  {
	 "token": "33525",
  },
  {
	 "token": "33525r20",
  },
  {
	 "token": "25",
  },
  {
	 "token": "r",
  },
  {
	 "token": "20",
  }
]

Let's fix that. The correct place to put the fuzziness setting according to the docs [2] is so that should be:

GET fuzzytest/tip/1/_explain
{
   "query": {
	  "query_string": {
		 "query": "25r20~",
		 "fuzziness": 2,
		 "fields": [
			"code"
		 ]
	  }
   }
}	

Sadly that still does not match and increasing the fuzziness setting does not alter this. The reason is that the token being "fuzzied" is 25r20 which is not being fed through your analyzer. If you reconsider the tokens we saw in the index none of them are within 2 edit distances (the maximum edits allowed) of 25r20.

The irony is that the non-fuzzy version of your query is better at fuzzy matching because it uses the tokenization policy for splitting 25r20 into multiple tokens whereas the query_string parser with fuzzy assumes that all non-whitespace text prior to the ~ character (25r20) is what should be edit-distance matched rather than just the last token (20).

It's a messy world, huh?

[1] "Give a man a fish, feed him for a day. Teach a man to fish, feed him for a lifetime."
[2] Common options | Elasticsearch Guide [5.0] | Elastic

1 Like