To better understand dis_max, I am using _validate to review the Lucene syntax produced. However, the generated Lucene syntax appears to be incorrect. I've tried this with 6.4.0 and 6.1.2.
Here is my corpus:
PUT index1/awe/1 { "field_1":"blue red", "field_2":"elephant pig" } PUT index1/awe/2 { "field_1":"blue elephant", "field_2":"pig red" }
Here is the dis_max query:
GET index1/awe/_search?search_type=dfs_query_then_fetch { "query": { "dis_max": { "tie_breaker": 0.1, "queries": [ { "match": { "field_1": "blue elephant" } }, { "match": { "field_2": "blue elephant" } } ] } } }
This produces the following results:
{ "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 2, "max_score": 0.87546873, "hits": [ { "_index": "index1", "_type": "awe", "_id": "2", "_score": 0.87546873, "_source": { "field_1": "blue elephant", "field_2": "pig red" } }, { "_index": "index1", "_type": "awe", "_id": "1", "_score": 0.71137935, "_source": { "field_1": "blue red", "field_2": "elephant pig" } } ] } }
To examine the Lucene syntax for this query, I use the same query as above, but with a different HTTP GET:
GET index1/awe/_validate/query?explain=true
which produces the following results:
{ "_shards": { "total": 1, "successful": 1, "failed": 0 }, "valid": true, "explanations": [ { "index": "index1", "valid": true, "explanation": "+((field_1:blue field_1:elephant) | (field_2:blue field_2:elephant))~0.1 #*:*" } ] }
Note the "~0.1" at the end of the explanation field. This appears to correspond to the "tie_breaker" value, but it is NOT valid Lucene syntax. I also ignored the final "#*:*" at the end of the explanation - that doesn't parse and I don't know what that represents.
Here is the Lucene query using the bad syntax (note again the "~0.1"):
GET index1/awe/_search?search_type=dfs_query_then_fetch { "query": { "query_string": { "query": "+((field_1:blue field_1:elephant) | (field_2:blue field_2:elephant))~0.1" } } }
which produces:
{ "error": { "root_cause": [ { "type": "query_shard_exception", "reason": "Failed to parse query [+((field_1:blue field_1:elephant) | (field_2:blue field_2:elephant))~0.1]", "index_uuid": "hXbH-YvCSnSwnmduTExXNA", "index": "index1" } ], "type": "search_phase_execution_exception", ... "reason": "Failed to parse query [+((field_1:blue field_1:elephant) | (field_2:blue field_2:elephant))~0.1]", ... "type": "parse_exception", "reason": "Cannot parse '+((field_1:blue field_1:elephant) | (field_2:blue field_2:elephant))~0.1': Encountered \" \"~0.1 \"\" at line 1, ...
To fix the Lucene query, I changed the "~1.0" to a boost "^0.1" as follows:
GET index1/awe/_search?search_type=dfs_query_then_fetch { "query": { "query_string": { "query": "+((field_1:blue field_1:elephant) | (field_2:blue field_2:elephant))^0.1" } } }
This parses correctly, but the results are not the same as the original dis_max query:
{ "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 2, "max_score": 0.08754688, "hits": [ { "_index": "index1", "_type": "awe", "_id": "2", "_score": 0.08754688, "_source": { "field_1": "blue elephant", "field_2": "pig red" } }, { "_index": "index1", "_type": "awe", "_id": "1", "_score": 0.08754688, "_source": { "field_1": "blue red", "field_2": "elephant pig" } } ] } }
Document 2 should have scored "_score": 0.87546873, and Document 1 should have scored: "_score": 0.71137935.
The boost for the tie_breaker is apparently not positioned correctly and I'm not able to figure out exactly what the correct Lucene syntax should be. This isn't just an academic exercise; I really would like to understand and verify how dis_max is working. What is the Lucene syntax that corresponds to this basic dis_max query?
Thanks for your help.