I ran into an issue when updating an application to support ES 5 -- the results being returned were not accurate, or at least, not the same as what was returned in 2.x. After doing some digging, I realized that the match query in 2.x and in 5.x were creating different underlying queries. As far as I understand how the match query is supposed to work (someone please correct me if I'm wrong), it's supposed to analyze the search query using the field's specified analyzer (search or index if not specified) and then crafts a boolean query, I'm guessing, using term queries. So, I'm guessing something changed from 2.x to 5.x to change how match queries are processed.
The fields I'm searching are custom analyzed, so I'm not sure if that's causing an issue or it's some other fundamental issue I'm missing. So, I created a sample template/mapping and data to quickly examine the issue using the docker images for 2 (2.4) and 5 (5.2).
Here's the 2 template:
{ "template": "test-*", "settings" : { "number_of_shards": 1, "number_of_replicas": 0, "analysis": { "analyzer": { "domain_analyzer": { "type": "custom", "tokenizer": "keyword", "filter": ["lowercase", "domain", "unique"] } }, "filter": { "domain" : { "type" : "pattern_capture", "preserve_original" : "true", "patterns" : [ "(\\w+)", "(\\p{L}+)", "(\\d+)" ] } } } }, "mappings": { "_default_":{ "properties": { "domainName": { "type": "string", "analyzer": "domain_analyzer" } } } } }
The 5.x mapping is the same except "string" is changed to "text"
The following "documents" are put into both versions:
{ "domainName": "foo.bar" } { "domainName": "bar.foo" }
When I run a validate query (_validate/query?explain) on '{"query": {"match": {"domainName": "foo.bar"}}}
', 2 produces the following explanation:
"explanations" : [ { "index" : "test-0", "valid" : true, "explanation" : "domainName:foo.bar domainName:foo domainName:bar" } ]
while 5 produces this explanation:
"explanations" : [ { "index" : "test-0", "valid" : true, "explanation" : "Synonym(domainName:bar domainName:foo domainName:foo.bar)" } ]
The results of the actual queries look like:
for 2:
{ "took" : 50, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 1.3065004, "hits" : [ { "_index" : "test-0", "_type" : "test", "_id" : "1", "_score" : 1.3065004, "_source" : { "domainName" : "foo.bar" } }, { "_index" : "test-0", "_type" : "test", "_id" : "2", "_score" : 0.5410969, "_source" : { "domainName" : "bar.foo" } } ] } }
for 5:
{ "took" : 20, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 0.33425623, "hits" : [ { "_index" : "test-0", "_type" : "test", "_id" : "1", "_score" : 0.33425623, "_source" : { "domainName" : "foo.bar" } }, { "_index" : "test-0", "_type" : "test", "_id" : "2", "_score" : 0.30854422, "_source" : { "domainName" : "bar.foo" } } ] } }
I manually wrote a boolean term query with the tokens that would be produced by the analyzer ('{"query": {"bool": {"should": [{"term": {"domainName": "foo.bar"}}, {"term": {"domainName": "foo"}}, {"term": {"domainName": "bar"}}]}}}}
') and in 5 here's the explanation:
"explanations" : [ { "index" : "test-0", "valid" : true, "explanation" : "domainName:foo.bar domainName:foo domainName:bar" } ]
and the search results:
{ "took" : 6, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 1.4544617, "hits" : [ { "_index" : "test-0", "_type" : "test", "_id" : "1", "_score" : 1.4544617, "_source" : { "domainName" : "foo.bar" } }, { "_index" : "test-0", "_type" : "test", "_id" : "2", "_score" : 0.5013843, "_source" : { "domainName" : "bar.foo" } } ] } }
These results look a lot closer to what 2 was producing -- so does anybody know what changed and if it's possible to get something similar to the old behavior back? Or, alternatively, Is there something wrong with my analyzer or is there something else I'm missing?
The discrepancy definitely seems to be caused by the Synonym function in the explanation. Assuming Synonym is the Lucene SynonymFilter https://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymFilter.html, I wonder if the ordering of the tokens passed into Synonym is making a difference.