Query string operators seem to not be working correctly

My query is in this format:

{
"query": {
"query_string": {
"default_field": "_all",
"query": "QUERY",
"default_operator": "AND"
}
}
}

Here are the different outputs for QUERY and their counts:

sofa 2,818
rugs 75,309
red 33,839

red AND rugs 9,441

red AND sofa 149
rugs AND sofa 82

sofa AND rugs AND red 3

(sofa OR rugs) AND red 9,587

(sofa OR rugs) red 9,587

sofa OR (rugs AND red) 12,256

sofa OR (rugs red) 12,256

*sofa *OR rugs AND red 9,441

*sofa OR rugs *red 33,839

The last two seem to be a bug. It seems as if the bolded are ignored.

expect sofa OR rugs AND red == sofa OR (rugs AND red) == 12,256

actual: sofa OR rugs AND red == rugs AND red == 9,441

expect sofa OR rugs red == sofa OR (rugs red) == sofa OR (rugs AND red) == 12,256

actual: sofa OR rugs red == red == 33,839

Is this a bug / known issue? I am using ES 0.90.11

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a5039e4a-edf5-4177-9f80-312d6a24a82f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

It's a "feature" of the query_string. What's happening is this query:

sofa OR rugs AND red

Actually means "rugs" and "red" must be there (always) to match. And if a
document is a match (i.e. it contains both rugs and red) and it contains
sofa also, then boost that document up some more ahead of the others.

This query:

sofa OR (rugs AND red)

Actually means either "sofa" is there, or ("rugs" and "red") is there to be a match. This is what you expect as normal boolean logic.

The easiest way to see and understand whats happening is to use the _validate API like this:

curl -XPOST "http://localhost:9200/f/_validate/query?explain&pretty" -d '

{
"query": {
"query_string": {
"query": "sofa OR rugs AND red",
"default_operator": "AND"
}
}
}'

If you _validate/explain the other query, you will understand how it is "interpreted".

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/32f4de3a-4ee8-44a5-ad7e-e010adb334d2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thank you Binh. That validate API with explain is quite helpful. The
feature seems a bit confusing because the API for query_string states that
the precedence order of logical operators follow:
AND first, then OR.

Thus, when I see 'sofa OR rugs AND red', my brain would translate that into

  1. Do the highest precedence operator : AND -> sofa OR (rugs AND red)

Could you explain why this would be a feature and how it does not conflict
with the API's definition of precedence?

Erich

On Friday, May 9, 2014 7:18:02 AM UTC-7, Binh Ly wrote:

It's a "feature" of the query_string. What's happening is this query:

sofa OR rugs AND red

Actually means "rugs" and "red" must be there (always) to match. And if a
document is a match (i.e. it contains both rugs and red) and it contains
sofa also, then boost that document up some more ahead of the others.

This query:

sofa OR (rugs AND red)

Actually means either "sofa" is there, or ("rugs" and "red") is there to be a match. This is what you expect as normal boolean logic.

The easiest way to see and understand whats happening is to use the _validate API like this:

curl -XPOST "http://localhost:9200/f/_validate/query?explain&pretty" -d '

{
"query": {
"query_string": {
"query": "sofa OR rugs AND red",
"default_operator": "AND"
}
}
}'

If you _validate/explain the other query, you will understand how it is "interpreted".

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/01695d3b-3200-4785-876b-bb960cc242dc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I believe I understand the problem now.

  1. ES applies operators immediately to the left and right operand
  2. ES does not virtually parenthesize groups after evaluating a higher
    precedence operator

Thus, with a default operator of AND.

Z B C OR D E F is interpreted as +Z + B C D +E +F
My expectation would have been interpreted as Z AND B AND C OR D AND E AND
F to evalulate to (+Z + B +C) OR (+D +E + F)

Does ES intend to fix this to more match the latter expectation? It seems
the only mitigation right now would be to:
(Z B C) OR (D E F)

On Friday, May 9, 2014 10:21:49 AM UTC-7, Erich Lin wrote:

Thank you Binh. That validate API with explain is quite helpful. The
feature seems a bit confusing because the API for query_string states that
the precedence order of logical operators follow:
AND first, then OR.

Thus, when I see 'sofa OR rugs AND red', my brain would translate that
into

  1. Do the highest precedence operator : AND -> sofa OR (rugs AND red)

Could you explain why this would be a feature and how it does not conflict
with the API's definition of precedence?

Erich

On Friday, May 9, 2014 7:18:02 AM UTC-7, Binh Ly wrote:

It's a "feature" of the query_string. What's happening is this query:

sofa OR rugs AND red

Actually means "rugs" and "red" must be there (always) to match. And if a
document is a match (i.e. it contains both rugs and red) and it contains
sofa also, then boost that document up some more ahead of the others.

This query:

sofa OR (rugs AND red)

Actually means either "sofa" is there, or ("rugs" and "red") is there to be a match. This is what you expect as normal boolean logic.

The easiest way to see and understand whats happening is to use the _validate API like this:

curl -XPOST "http://localhost:9200/f/_validate/query?explain&pretty" -d '

{
"query": {
"query_string": {
"query": "sofa OR rugs AND red",
"default_operator": "AND"
}
}
}'

If you _validate/explain the other query, you will understand how it is "interpreted".

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/07152b76-a588-4305-a3f3-78738e92717a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Erich,

A colleague pointed out to me a much more complete explanation that I could
ever do:

http://searchhub.org//2011/12/28/why-not-and-or-and-not/

But the short of it is, it is working as expected and just need to "map" a
bit back to Lucene Boolean logic to fully understand why/how it works.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9477552e-3fa8-4771-8277-064c73ea70f7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thanks Binh!

To summarize for everyone else:

  1. Queries are parsed left to right
  2. NOT sets the Occurs flag of the clause to it’s right to MUST_NOT
  3. AND will change the Occurs flag of the clause to it’s left to MUST
    unless it has already been set to MUST_NOT
  4. AND sets the Occurs flag of the clause to it’s right to MUST
  5. If the default operator of the query parser has been set to “And”: OR
    will change the Occurs flag of the clause to it’s left to SHOULD unless it
    has already been set to MUST_NOT
  6. OR sets the Occurs flag of the clause to it’s right to SHOULD

Practically speaking this means that NOT takes precedence over AND which
takes precedence over OR — but only if the default operator for the query
parser has not been changed from the default (“Or”). If the default
operator is set to “And” then the behavior is just plain weird.

Erich

On Monday, May 12, 2014 12:37:24 PM UTC-7, Binh Ly wrote:

Erich,

A colleague pointed out to me a much more complete explanation that I
could ever do:

http://searchhub.org//2011/12/28/why-not-and-or-and-not/

But the short of it is, it is working as expected and just need to "map" a
bit back to Lucene Boolean logic to fully understand why/how it works.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/447bfb62-4094-4024-9f53-6e713b11b895%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.