Difficulty understanding dis_max query using _validate


#1

To better understand dis_max, I am using _validate to review the Lucene syntax produced. However, the generated Lucene syntax appears to be incorrect. I've tried this with 6.4.0 and 6.1.2.

Here is my corpus:

PUT index1/awe/1
{
  "field_1":"blue red",
  "field_2":"elephant pig"
}
PUT index1/awe/2
{
  "field_1":"blue elephant",
  "field_2":"pig red"
}

Here is the dis_max query:

GET index1/awe/_search?search_type=dfs_query_then_fetch
{   
  "query": {
    "dis_max": {
      "tie_breaker": 0.1, 
      "queries": [
      {
        "match": {
          "field_1": "blue elephant"
        }
      },
      {
        "match": {
          "field_2": "blue elephant"
        }
      }
      ]
    }
  }
}

This produces the following results:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.87546873,
    "hits": [
      {
        "_index": "index1",
        "_type": "awe",
        "_id": "2",
        "_score": 0.87546873,
        "_source": {
          "field_1": "blue elephant",
          "field_2": "pig red"
        }
      },
      {
        "_index": "index1",
        "_type": "awe",
        "_id": "1",
        "_score": 0.71137935,
        "_source": {
          "field_1": "blue red",
          "field_2": "elephant pig"
        }
      }
    ]
  }
}

To examine the Lucene syntax for this query, I use the same query as above, but with a different HTTP GET:

GET index1/awe/_validate/query?explain=true

which produces the following results:

{
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "valid": true,
  "explanations": [
    {
      "index": "index1",
      "valid": true,
      "explanation": "+((field_1:blue field_1:elephant) | (field_2:blue field_2:elephant))~0.1 #*:*"
    }
  ]
}

Note the "~0.1" at the end of the explanation field. This appears to correspond to the "tie_breaker" value, but it is NOT valid Lucene syntax. I also ignored the final "#*:*" at the end of the explanation - that doesn't parse and I don't know what that represents.

Here is the Lucene query using the bad syntax (note again the "~0.1"):

GET index1/awe/_search?search_type=dfs_query_then_fetch
{
  "query": {
    "query_string": {
      "query": "+((field_1:blue field_1:elephant) | (field_2:blue field_2:elephant))~0.1"
    }
  }
}

which produces:

{
  "error": {
    "root_cause": [
      {
        "type": "query_shard_exception",
        "reason": "Failed to parse query [+((field_1:blue field_1:elephant) | (field_2:blue field_2:elephant))~0.1]",
        "index_uuid": "hXbH-YvCSnSwnmduTExXNA",
        "index": "index1"
      }
    ],
    "type": "search_phase_execution_exception",
    ...
    "reason": "Failed to parse query [+((field_1:blue field_1:elephant) | (field_2:blue field_2:elephant))~0.1]",
    ...
    "type": "parse_exception",
            "reason": "Cannot parse '+((field_1:blue field_1:elephant) | (field_2:blue field_2:elephant))~0.1': Encountered \"  \"~0.1 \"\" at line 1,
    ...

To fix the Lucene query, I changed the "~1.0" to a boost "^0.1" as follows:

GET index1/awe/_search?search_type=dfs_query_then_fetch
{
  "query": {
    "query_string": {
      "query": "+((field_1:blue field_1:elephant) | (field_2:blue field_2:elephant))^0.1"
    }
  }
}

This parses correctly, but the results are not the same as the original dis_max query:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.08754688,
    "hits": [
      {
        "_index": "index1",
        "_type": "awe",
        "_id": "2",
        "_score": 0.08754688,
        "_source": {
          "field_1": "blue elephant",
          "field_2": "pig red"
        }
      },
      {
        "_index": "index1",
        "_type": "awe",
        "_id": "1",
        "_score": 0.08754688,
        "_source": {
          "field_1": "blue red",
          "field_2": "elephant pig"
        }
      }
    ]
  }
}

Document 2 should have scored "_score": 0.87546873, and Document 1 should have scored: "_score": 0.71137935.

The boost for the tie_breaker is apparently not positioned correctly and I'm not able to figure out exactly what the correct Lucene syntax should be. This isn't just an academic exercise; I really would like to understand and verify how dis_max is working. What is the Lucene syntax that corresponds to this basic dis_max query?

Thanks for your help.


#2

I see the same behavior when using minimum_should_match:

GET index1/awe/_search?search_type=dfs_query_then_fetch
{  "query": 
  {
    "bool": { 
      "should": [
        { "match": {"field_1":"blue elephant"} },
        { "match": {"field_2": {"query": "blue elephant", "boost": 1.0} } }
        ],
        "minimum_should_match": 2
    }
  }
}

which produces the following query string:
"explanation": "+(((field_1:blue field_1:elephant) (field_2:blue field_2:elephant))~2) #*:*"

That query string is not accepted:

GET index1/awe/_search?search_type=dfs_query_then_fetch
{
  "query": {
    "query_string": {
      "query": "+(((field_1:blue field_1:elephant) (field_2:blue field_2:elephant))~2)"
    }
  }
}

with the error

{
  "error": {
    "root_cause": [
      {
        "type": "query_shard_exception",
        "reason": "Failed to parse query [+(((field_1:blue field_1:elephant) (field_2:blue field_2:elephant))~2)]",

Any suggestions as to what the proper syntax should be? Thanks!


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.