OR statement braking the "NOT" statement ElasticSearch


(acv2) #1

Hi Everyone,
I'm here to give back to community, i spent hours looking for an answer for my problem without luck, so I decided to share my findings with you all, so we can all take a look on the mystic of elasticsearch.

My problem was very simple, but not easy as it seems:
i needed to search for a field where the fields match a keyword or it were simply blank either null
ex:

SELECT * FROM bucket1 WHERE country="canada" AND (language=en OR language=NULL)

simple right?

for my query i constructed the following:
bucket/document/_search [POST]
{
"size": 10,
"query": {
"query_string": {
"query": "country:ca AND region1:quebec AND (language:en OR -language:*)"
}
}
}

It brought me 0 (zero) results what made me think.. because when i did the split query, it broght me around 35k results
ex:
{
"size": 10,
"query": {
"query_string": {
"query": "country:ca AND region1:quebec AND (language:en)"
}
}
}
32k results
THEN
{
"size": 10,
"query": {
"query_string": {
"query": "country:ca AND region1:quebec AND (-language:*)"
}
}
}
3k results

so i decided go for different approach
{
"size": 10,
"query": {
"query_string": {
"query": "(region1:quebec AND -geo_city:* ) OR (region1:quebec AND geo_city:montreal )"
}
}
}
35K results
conclusion NOT statement is broking everything if set with an OR statement.
after that i finally did the following:
{
"size": 10,
"query": {
"query_string": {
"query": "country:ca AND region1:quebec AND ((language:en) OR (-language:*))"
}
}
}
35k results, and it was working

now im looking for a different approach to make this query more efficient, cuz the NOT statement is taking really long time to excecute, around x20 factor than without the NOT statement.

i hope to have some feedbacks on this =)

thanks in advance


(Doug Turnbull) #2

A OR NOT B is actually not a possible thing to express surprising enough in
the query syntax.

The reason? Lucene's operators, though often appearing like "OR" "AND" etc
are actually not binary operators. Instead of OR that takes two parameters,
you have SHOULD that takes one parameter. Instead of AND that takes two
parameters, you really have MUST that takes one paramater. NOT as you expect operates on a single query, using MUST_NOT.

In other words

A OR B

is ACTUALLY

SHOULD(A) SHOULD(B)

A OR NOT B

you'd imagine would be something like

SHOULD(A) MUST_NOT(SHOULD(B))

but you can only run MUST_NOT(B) or you can run SHOULD(B) or you can run
MUST(B).

MUST_NOT(SHOULD(B)) or SHOULD(MUST_NOT(B)) is kind of nonsensical if you think about it. Its saying Not
having B would be nice and docs with this property should be favored in
scoring. Sadly you can only exclude with NOT you can't bump things up or
down via scoring with NOT. NOT is purely an exclusionary operator.


(Doug Turnbull) #3

If you want to be exacting and precise with boolean queries, I strongly recommend using the bool query directly.


(acv2) #4

this was really enlightening, thank you for sharing i didnt know that


(system) #5