Why is this wildcard and query returning zero results?


(Dan Tuffery) #1

I have a query that returns results. For example lets say the following
query returns 5 results

{
"from" : 0,
"size" : 20,
"query" : {
"query_string" : {
"query" : "name:elastic OR title:elastic"
}
}
}

If I add an AND wildcard query to the query above it returns zero results.

{
"from" : 0,
"size" : 20,
"query" : {
"query_string" : {
"query" : "name:elastic OR title:elastic AND description:*"
}
}
}

My descripiton field is never null if it doesn't have a value it will be an
empty string.

In the documentation it says:

"Supported wildcards are *, which matches any character sequence (including
the empty one)"

So why isn't the second query returning the same results as the first?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brian Yoder) #2

Good question.

If I recreate your search but in my own database (with only 2 documents
remaining because all the other documents have been expired), with 1
document matching the first term, another document matching the second
term, and both documents matching the text:* term, I see the same behavior
as you do:

  1. If I omit the third (last) term, I get 2 search hits, as expected.

"query_string" : { "query" : "cn:celeborn OR cn:galadriel" }

{ "cn" : "Celeborn" , "text" : "Lives forever" }
{ "cn" : "Galadriel" , "text" : "Lives forever" }

  1. If I omit the first two terms, I get 2 search hits, as expected:

"query_string" : { "query" : "text:*" }

{ "cn" : "Celeborn" , "text" : "Lives forever" }
{ "cn" : "Galadriel" , "text" : "Lives forever" }

  1. If I surround the first two terms in parenthesis, I get the 2 search
    hits as expected.

"query_string" : { "query" : "(cn:celeborn OR cn:galadriel) AND text:*" }

{ "cn" : "Celeborn" , "text" : "Lives forever" }
{ "cn" : "Galadriel" , "text" : "Lives forever" }

  1. If I surround the last two terms in parenthesis, I get the 2 search hits
    as expected:

"query_string" : { "query" : "cn:celeborn OR (cn:galadriel AND text:*") }

{ "cn" : "Celeborn" , "text" : "Lives forever" }
{ "cn" : "Galadriel" , "text" : "Lives forever" }

But without any parenthesis, I only get one search hit:

"query_string" : { "query" : "cn:celeborn OR cn:galadriel AND text:*" }

{ "cn" : "Galadriel" , "text" : "Lives forever" }

It seems that with the AND and OR operators strung along, Lucene's query
parser doesn't know whether to **** or go blind (as the expression goes).
Very sad.

By the way, this is in Lucene 4 inside ES 0.90.3.

It looks like a Lucene query parser bug, not a wildcard bug. For example,
replacing text:* with text:lives I see the same one record returned:

"query_string" : { "query" : "cn:celeborn OR cn:galadriel AND text:lives" }

{ "cn" : "Galadriel" , "text" : "Lives forever" }

Brian

On Monday, September 9, 2013 7:06:16 AM UTC-4, dan wrote:

I have a query that returns results. For example lets say the following
query returns 5 results

{
"from" : 0,
"size" : 20,
"query" : {
"query_string" : {
"query" : "name:elastic OR title:elastic"
}
}
}

If I add an AND wildcard query to the query above it returns zero results.

{
"from" : 0,
"size" : 20,
"query" : {
"query_string" : {
"query" : "name:elastic OR title:elastic AND description:*"
}
}
}

My descripiton field is never null if it doesn't have a value it will be
an empty string.

In the documentation it says:

"Supported wildcards are *, which matches any character sequence
(including the empty one)"

So why isn't the second query returning the same results as the first?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #3

This is perfectly valid. It shows how Lucene operator precedence works
under the hood.

Lucene parses a query string from left to right, and creates intermediary
structures for documents that match the condition so far.

As a side note, using a clause description:* is not equivalent to "match
every term in the field" in a Lucene boolean expression.

If you want a condition "match every term in field" you can either drop
such a clause (if a field always exists, it always evaluates to true) or
you must work with the NULL negation trick: index a JSON "null" or a string
"NULL" to each field in a document where it is supposed to be empty, and
then you can query with negation for the term NULL to filter out the
documents.

Consider this gist for a quick demo:

Like in the example (query 12)

name:abc OR name:def NOT text:NULL

This will return all documents with either name "abc" or "def" and where
the field "text" is not empty.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brian Yoder) #4

Jörg,

Thanks for the tip about the query for a non-NULL field. Yeah, I didn't
think that description:* was the best way to do that. But leaving that
aside, the question is about Lucene's precedence rules and behavior:

*This is perfectly valid. It shows how Lucene operator precedence works

under the hood.*

Lucene parses a query string from left to right, and creates
intermediary structures for documents that match the condition so far.

Left-to-right precedence means that the abstract term expression a OR b
AND c
is equivalent to (a OR b) AND c. Left to right. Correct?

But here's the rub. The difference between theory and practice is that in
theory, there is no difference.

This is the 3-term query (note: no non-NULL checking) with explicitly
specified left-to-right precedence:

"query_string" : { "query" : "(cn:celeborn OR cn:galadriel) AND text:lives"
}

When I explicitly specify the precedence as left to right, it returns the 2
documents as expected:

{ "cn" : "Celeborn" , "text" : "Lives forever" }
{ "cn" : "Galadriel" , "text" : "Lives forever" }

This is the equivalent 3-term query, but this time we let Lucene handle
the left-to-right precedence:

"query_string" : { "query" : "cn:celeborn OR cn:galadriel AND text:lives" }

But this query returns only the second document:

{ "cn" : "Galadriel" , "text" : "Lives forever" }

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #5

For Lucene, 2-clause queries are not equivalent to 3-clause queries. A
3-clause makes only sense if the default search mode is OR (for AND,
results are very weird).

a OR b AND c is a different execution path as in (a OR b) AND c. It is
executed a AND c if there are docs for a, otherwise b AND c if there are
docs for b. And if there are no docs for c, it is always empty.

(a OR b) AND c is executed so that docs for a are searched, and docs for b
are searched. If there are no docs for this operation, or no docs for c,
the result is empty, otherwise, the intersection of docs between the two
"must" searches is returned.

With prefix notation in Lucene's SHOULD/MUST logic, I hope the difference
is visible:

a OR b AND c --> (a b +c)

(a OR b) AND c --> (+(a b) +b)

I think a clear picture is given in this document

http://searchhub.org/2011/12/28/why-not-and-or-and-not/

So you are right to recommend always using braces in boolean expressions,
since with braces, Lucene can be forced to behave like we all have learnt
in school for evaluating 2-clause queries. Note the performance may be
affected in the "AND/OR brace mode", at most times, performance is better
with SHOULD/MUST queries. This is the reason why many recommend to use
SHOULD/MUST operators.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brian Yoder) #6

Jörg,

Wow! THANK YOU!!!! That searchhub link really opened my eyes!

Now, I must say in response that the link
http://www.elasticsearch.org/guide/reference/query-dsl/query-string-query/ threw
me off course.

In there it gives the following example:

{
"query_string" : {
"default_field" : "content",
"query" : "this AND that OR thus"
}
}

And the AND and OR is the first thing I saw and therefore the thing that
stuck in my head the most prominently.

Instead, the ES guide page SHOULD be updated to +this +that thus (or
whatever the equivalent actually might be), or it SHOULD add parenthesis.
But it MUST include your link!

I've already updated my own local documentation to reference your link.
Good stuff. I've been playing around with it, and it really does work
nicely. Just not how I naively expected in the beginning.

Thanks again, both for the information and for your patience!

Brian

On Tuesday, September 10, 2013 11:59:18 AM UTC-4, Jörg Prante wrote:

For Lucene, 2-clause queries are not equivalent to 3-clause queries. A
3-clause makes only sense if the default search mode is OR (for AND,
results are very weird).

a OR b AND c is a different execution path as in (a OR b) AND c. It is
executed a AND c if there are docs for a, otherwise b AND c if there are
docs for b. And if there are no docs for c, it is always empty.

(a OR b) AND c is executed so that docs for a are searched, and docs for b
are searched. If there are no docs for this operation, or no docs for c,
the result is empty, otherwise, the intersection of docs between the two
"must" searches is returned.

With prefix notation in Lucene's SHOULD/MUST logic, I hope the difference
is visible:

a OR b AND c --> (a b +c)

(a OR b) AND c --> (+(a b) +b)

I think a clear picture is given in this document

http://searchhub.org/2011/12/28/why-not-and-or-and-not/

So you are right to recommend always using braces in boolean expressions,
since with braces, Lucene can be forced to behave like we all have learnt
in school for evaluating 2-clause queries. Note the performance may be
affected in the "AND/OR brace mode", at most times, performance is better
with SHOULD/MUST queries. This is the reason why many recommend to use
SHOULD/MUST operators.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brian Yoder) #7

Dan,

Based on the information and excellent link from Jörg, here is the
following query string that is intended by yours:

"query_string" : {
"query" : "name:elastic title:elastic +description:*"
}

Of course, as also noted there is a much better way to do the "description
is not null" clause. But the filter above is logically correct and works as
intended. I tried it in my own examples and they all worked just as I had
expected.

Brian

On Monday, September 9, 2013 7:06:16 AM UTC-4, dan wrote:

I have a query that returns results. For example lets say the following
query returns 5 results

{
"from" : 0,
"size" : 20,
"query" : {
"query_string" : {
"query" : "name:elastic OR title:elastic"
}
}
}

If I add an AND wildcard query to the query above it returns zero results.

{
"from" : 0,
"size" : 20,
"query" : {
"query_string" : {
"query" : "name:elastic OR title:elastic AND description:*"
}
}
}

My descripiton field is never null if it doesn't have a value it will be
an empty string.

In the documentation it says:

"Supported wildcards are *, which matches any character sequence
(including the empty one)"

So why isn't the second query returning the same results as the first?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Dan Tuffery) #8

Thanks both. It is working now :slight_smile:

On Tuesday, September 10, 2013 11:59:35 PM UTC+1, InquiringMind wrote:

Dan,

Based on the information and excellent link from Jörg, here is the
following query string that is intended by yours:

"query_string" : {
"query" : "name:elastic title:elastic +description:*"
}

Of course, as also noted there is a much better way to do the "description
is not null" clause. But the filter above is logically correct and works as
intended. I tried it in my own examples and they all worked just as I had
expected.

Brian

On Monday, September 9, 2013 7:06:16 AM UTC-4, dan wrote:

I have a query that returns results. For example lets say the following
query returns 5 results

{
"from" : 0,
"size" : 20,
"query" : {
"query_string" : {
"query" : "name:elastic OR title:elastic"
}
}
}

If I add an AND wildcard query to the query above it returns zero
results.

{
"from" : 0,
"size" : 20,
"query" : {
"query_string" : {
"query" : "name:elastic OR title:elastic AND description:*"
}
}
}

My descripiton field is never null if it doesn't have a value it will be
an empty string.

In the documentation it says:

"Supported wildcards are *, which matches any character sequence
(including the empty one)"

So why isn't the second query returning the same results as the first?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #9