To better understand dis_max, I am using _validate to review the Lucene syntax produced. However, the generated Lucene syntax appears to be incorrect. I've tried this with 6.4.0 and 6.1.2.
Here is my corpus:
PUT index1/awe/1
{
"field_1":"blue red",
"field_2":"elephant pig"
}
PUT index1/awe/2
{
"field_1":"blue elephant",
"field_2":"pig red"
}
Here is the dis_max query:
GET index1/awe/_search?search_type=dfs_query_then_fetch
{
"query": {
"dis_max": {
"tie_breaker": 0.1,
"queries": [
{
"match": {
"field_1": "blue elephant"
}
},
{
"match": {
"field_2": "blue elephant"
}
}
]
}
}
}
This produces the following results:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.87546873,
"hits": [
{
"_index": "index1",
"_type": "awe",
"_id": "2",
"_score": 0.87546873,
"_source": {
"field_1": "blue elephant",
"field_2": "pig red"
}
},
{
"_index": "index1",
"_type": "awe",
"_id": "1",
"_score": 0.71137935,
"_source": {
"field_1": "blue red",
"field_2": "elephant pig"
}
}
]
}
}
To examine the Lucene syntax for this query, I use the same query as above, but with a different HTTP GET:
GET index1/awe/_validate/query?explain=true
which produces the following results:
{
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"valid": true,
"explanations": [
{
"index": "index1",
"valid": true,
"explanation": "+((field_1:blue field_1:elephant) | (field_2:blue field_2:elephant))~0.1 #*:*"
}
]
}
Note the "~0.1" at the end of the explanation field. This appears to correspond to the "tie_breaker" value, but it is NOT valid Lucene syntax. I also ignored the final "#*:*" at the end of the explanation - that doesn't parse and I don't know what that represents.
Here is the Lucene query using the bad syntax (note again the "~0.1"):
GET index1/awe/_search?search_type=dfs_query_then_fetch
{
"query": {
"query_string": {
"query": "+((field_1:blue field_1:elephant) | (field_2:blue field_2:elephant))~0.1"
}
}
}
which produces:
{
"error": {
"root_cause": [
{
"type": "query_shard_exception",
"reason": "Failed to parse query [+((field_1:blue field_1:elephant) | (field_2:blue field_2:elephant))~0.1]",
"index_uuid": "hXbH-YvCSnSwnmduTExXNA",
"index": "index1"
}
],
"type": "search_phase_execution_exception",
...
"reason": "Failed to parse query [+((field_1:blue field_1:elephant) | (field_2:blue field_2:elephant))~0.1]",
...
"type": "parse_exception",
"reason": "Cannot parse '+((field_1:blue field_1:elephant) | (field_2:blue field_2:elephant))~0.1': Encountered \" \"~0.1 \"\" at line 1,
...
To fix the Lucene query, I changed the "~1.0" to a boost "^0.1" as follows:
GET index1/awe/_search?search_type=dfs_query_then_fetch
{
"query": {
"query_string": {
"query": "+((field_1:blue field_1:elephant) | (field_2:blue field_2:elephant))^0.1"
}
}
}
This parses correctly, but the results are not the same as the original dis_max query:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.08754688,
"hits": [
{
"_index": "index1",
"_type": "awe",
"_id": "2",
"_score": 0.08754688,
"_source": {
"field_1": "blue elephant",
"field_2": "pig red"
}
},
{
"_index": "index1",
"_type": "awe",
"_id": "1",
"_score": 0.08754688,
"_source": {
"field_1": "blue red",
"field_2": "elephant pig"
}
}
]
}
}
Document 2 should have scored "_score": 0.87546873, and Document 1 should have scored: "_score": 0.71137935.
The boost for the tie_breaker is apparently not positioned correctly and I'm not able to figure out exactly what the correct Lucene syntax should be. This isn't just an academic exercise; I really would like to understand and verify how dis_max is working. What is the Lucene syntax that corresponds to this basic dis_max query?
Thanks for your help.