Why query_string doesn't honor the usual precedence rules?

Can anyone please explain why I got all different results in the following query_string queries (all are the same except the parentheses)

  1. A OR B AND C

  2. (A OR B) AND C

  3. A OR (B AND C)

The index only has the three documents in total details as below, #1 and #3 are expected to have the same result, but all the three are different.

but actual result:
#1 returns only 1 doc
#2 returns 2 docs
#3 returns all the 3 docs

Here is the test case

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "testquery",
        "_type": "_doc",
        "_id": "00IB737J1OF7D8RB4TUJ42LAES0082HF",
        "_score": 0.5753642,
        "_source": {
          "A": "3100007",
          "B": "Routed"
        }
      },
      {
        "_index": "testquery",
        "_type": "_doc",
        "_id": "00MJOP7K6OF7D1SA51UJ42LAES005PIJ",
        "_score": 0.2876821,
        "_source": {
          "A": "3100000",
          "B": "Terminated"
        }
      },
      {
        "_index": "testquery",
        "_type": "_doc",
        "_id": "00THNBNJVCF7D2PI51MJ42LAES005LV1",
        "_score": 0.2876821,
        "_source": {
          "A": "3100000",
          "B": "Routed"
        }
      }
    ]
  }
}

For A OR B AND C

Query:

POST testquery/_search
{
  "query": {
      "query_string": {
            "query": "A.keyword:3100000 OR A.keyword:3100007 AND B.keyword:Routed"
      }
  }
}

Result:

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "testquery",
        "_type": "_doc",
        "_id": "00IB737J1OF7D8RB4TUJ42LAES0082HF",
        "_score": 0.5753642,
        "_source": {
          "A": "3100007",
          "B": "Routed"
        }
      }
    ]
  }
}

For (A OR B) AND C

Query

POST testquery/_search
{
  "query": {
    "query_string": {
      "query": "(A.keyword:3100000 OR A.keyword:3100007) AND B.keyword:Routed"
    }
  }
}

Results

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "testquery",
        "_type": "_doc",
        "_id": "00IB737J1OF7D8RB4TUJ42LAES0082HF",
        "_score": 0.5753642,
        "_source": {
          "A": "3100007",
          "B": "Routed"
        }
      },
      {
        "_index": "testquery",
        "_type": "_doc",
        "_id": "00THNBNJVCF7D2PI51MJ42LAES005LV1",
        "_score": 0.5753642,
        "_source": {
          "A": "3100000",
          "B": "Routed"
        }
      }
    ]
  }
}

For A OR (B AND C)
Query

POST testquery/_search
{
  "query": {
    "query_string": {
      "query": "A.keyword:3100000 OR (A.keyword:3100007 AND B.keyword:Routed)"
    }
  }
}

Results

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "testquery",
        "_type": "_doc",
        "_id": "00IB737J1OF7D8RB4TUJ42LAES0082HF",
        "_score": 0.5753642,
        "_source": {
          "A": "3100007",
          "B": "Routed"
        }
      },
      {
        "_index": "testquery",
        "_type": "_doc",
        "_id": "00MJOP7K6OF7D1SA51UJ42LAES005PIJ",
        "_score": 0.2876821,
        "_source": {
          "A": "3100000",
          "B": "Terminated"
        }
      },
      {
        "_index": "testquery",
        "_type": "_doc",
        "_id": "00THNBNJVCF7D2PI51MJ42LAES005LV1",
        "_score": 0.2876821,
        "_source": {
          "A": "3100000",
          "B": "Routed"
        }
      }
    ]
  }
}

Does anyone have any idea, or it's a defect?

I had a look into this using the validate query API.

Here's the command to debug aa OR bb AND cc:

GET githubcommits/_validate/query?q=aa+OR+bb+AND+cc&rewrite=true&df=myfield

The result is:

  "explanation" : "myfield:aa +myfield:bb +myfield:cc"

Lucene's Boolean query has the idea of mandatory must clauses and should clauses which are just nice-to-haves. In the above query aa is relegated to a wholly optional should clause that gives extra scoring points to documents that contain both of the mandatory must clauses bb and cc.
If you want to have pure OR clauses in Lucene you need to use a Boolean query with should clauses but no must clauses. Something like this:

bool
    should
         aa
         bool
              must
                  bb
                  cc

Note the use of a nested bool query above to get the required logic.
The introduction of brackets in query_string syntax forces the creation of these sub boolean clauses and makes the logic behave in a more predictable way.

Weird, but I'd hesitate to call it a bug - more a quirk of Lucene.
For readability's sake alone I would advocate using brackets to make the logic clear.

3 Likes

Thank you very much for the great explanation! it makes more sense now. will use brackets as you suggested!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.