Match Query 2.x vs 5.x Results Discrepancy

I ran into an issue when updating an application to support ES 5 -- the results being returned were not accurate, or at least, not the same as what was returned in 2.x. After doing some digging, I realized that the match query in 2.x and in 5.x were creating different underlying queries. As far as I understand how the match query is supposed to work (someone please correct me if I'm wrong), it's supposed to analyze the search query using the field's specified analyzer (search or index if not specified) and then crafts a boolean query, I'm guessing, using term queries. So, I'm guessing something changed from 2.x to 5.x to change how match queries are processed.

The fields I'm searching are custom analyzed, so I'm not sure if that's causing an issue or it's some other fundamental issue I'm missing. So, I created a sample template/mapping and data to quickly examine the issue using the docker images for 2 (2.4) and 5 (5.2).

Here's the 2 template:

{
    "template": "test-*",
    "settings" : { 
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": {
            "analyzer": {
                    "domain_analyzer": {
                        "type": "custom",
                        "tokenizer": "keyword",
                        "filter": ["lowercase", "domain", "unique"]
                    }   
            },  
            "filter": {
                    "domain" : { 
                       "type" : "pattern_capture",
                       "preserve_original" : "true",
                       "patterns" : [ 
                          "(\\w+)",
                          "(\\p{L}+)",
                          "(\\d+)"
                       ]   
                    }   
            }   
        }   
    },  
    "mappings": {
        "_default_":{
            "properties": {
                "domainName": {
                    "type": "string",
                    "analyzer": "domain_analyzer"
                }   
            }   
        }   
    }   
}

The 5.x mapping is the same except "string" is changed to "text"

The following "documents" are put into both versions:

{
	"domainName": "foo.bar"
}

{
	"domainName": "bar.foo"
}

When I run a validate query (_validate/query?explain) on '{"query": {"match": {"domainName": "foo.bar"}}}', 2 produces the following explanation:

"explanations" : [ {
    "index" : "test-0",
    "valid" : true,
    "explanation" : "domainName:foo.bar domainName:foo domainName:bar"
  } ]

while 5 produces this explanation:

"explanations" : [
    {
      "index" : "test-0",
      "valid" : true,
      "explanation" : "Synonym(domainName:bar domainName:foo domainName:foo.bar)"
    }
  ]

The results of the actual queries look like:
for 2:

{
  "took" : 50,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.3065004,
    "hits" : [ {
      "_index" : "test-0",
      "_type" : "test",
      "_id" : "1",
      "_score" : 1.3065004,
      "_source" : {
        "domainName" : "foo.bar"
      }
    }, {
      "_index" : "test-0",
      "_type" : "test",
      "_id" : "2",
      "_score" : 0.5410969,
      "_source" : {
        "domainName" : "bar.foo"
      }
    } ]
  }
}

for 5:

{
  "took" : 20,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.33425623,
    "hits" : [
      {
        "_index" : "test-0",
        "_type" : "test",
        "_id" : "1",
        "_score" : 0.33425623,
        "_source" : {
          "domainName" : "foo.bar"
        }
      },
      {
        "_index" : "test-0",
        "_type" : "test",
        "_id" : "2",
        "_score" : 0.30854422,
        "_source" : {
          "domainName" : "bar.foo"
        }
      }
    ]
  }
}

I manually wrote a boolean term query with the tokens that would be produced by the analyzer ('{"query": {"bool": {"should": [{"term": {"domainName": "foo.bar"}}, {"term": {"domainName": "foo"}}, {"term": {"domainName": "bar"}}]}}}}') and in 5 here's the explanation:

"explanations" : [
    {
      "index" : "test-0",
      "valid" : true,
      "explanation" : "domainName:foo.bar domainName:foo domainName:bar"
    }
  ]

and the search results:

{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.4544617,
    "hits" : [
      {
        "_index" : "test-0",
        "_type" : "test",
        "_id" : "1",
        "_score" : 1.4544617,
        "_source" : {
          "domainName" : "foo.bar"
        }
      },
      {
        "_index" : "test-0",
        "_type" : "test",
        "_id" : "2",
        "_score" : 0.5013843,
        "_source" : {
          "domainName" : "bar.foo"
        }
      }
    ]
  }
}

These results look a lot closer to what 2 was producing -- so does anybody know what changed and if it's possible to get something similar to the old behavior back? Or, alternatively, Is there something wrong with my analyzer or is there something else I'm missing?

Edit:

The discrepancy definitely seems to be caused by the Synonym function in the explanation. Assuming Synonym is the Lucene SynonymFilter https://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymFilter.html, I wonder if the ordering of the tokens passed into Synonym is making a difference.

Since ES 5, there is an optimization in the match query (and related queries). This optimization wraps tokens in the analyzer stream with the same token position into a synonym query.

The PatternCaptureGroupTokenFilter always sets position increment to 0 so all pattern capture group output will be placed on the same token position

https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.java

The position is 0 for all tokens as you can check with the _analyze endpoint

POST /test/_analyze?analyzer=domain_analyzer
{
    "text" : "foo.bar"
}

Result

{
   "tokens": [
      {
         "token": "foo.bar",
         "start_offset": 0,
         "end_offset": 7,
         "type": "word",
         "position": 0
      },
      {
         "token": "foo",
         "start_offset": 0,
         "end_offset": 7,
         "type": "word",
         "position": 0
      },
      {
         "token": "bar",
         "start_offset": 0,
         "end_offset": 7,
         "type": "word",
         "position": 0
      }
   ]
}

I think the ES 5 query optimization is correct.

To get an alternative behavior of PatternCaptureGroupTokenFilter you would have to modify the implementation to respect the position increment parameter, such as

"filter": {
 "domain_pattern": {
   "type": "pattern_capture",
   "preserve_original": "true",
   "position_increment": 1,
   "patterns": [
	  "(\\w+)",
	  "(\\p{L}+)",
	  "(\\d+)"
   ]
 }
}

At the moment, the parameter position_increment is accepted, but ignored.

Well, that sheds light onto why it occurs. But if I'm reading your response correctly, there's currently no way to fix this?

What do you mean by this? In this case it's clearly causing unintended side-effects. I can see that it would help in many searches, but this is a case where it breaks functionality. Do you know if ES5 has any way to turn off this 'optimization'? If not, this sounds like a 'breaking change' ...

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.