Stopwords and minimum_should_match in multi-field query_string


(Matthew A. Brown) #1

I've run into an interesting conundrum. I don't think this is a bug,
but I'm also not sure how to get the behavior I want, so I was hoping
someone might have a brilliant idea.

Let's say I've got a type "books", with a "title" and "author" field.
Let's further say that I'm indexing "title" with a stopwords filter,
but "author" has no stopwords filter.

The following document is the only one in the index:

{"title":"The Great Gatsby", "author":"F. Scott Fitzgerald"}

Now let's say I want to perform the following search:

{"query": {"query_string": {"query": "the great gatsby", "fields":
["title", "author"], "minimum_should_match": 3}}}

This search won't return any results. I believe this is because the
minimum_should_match is 3, and the stopwords filter is dropping "the".
So, only two tokens match, but since the original input string had 3
tokens, it's still looking for a 3-token match.

However, the behavior is different if I only search the "title" field:

{"query": {"query_string": {"query": "the great gatsby", "fields":
["title"], "minimum_should_match": 3}}}

In this case, I do get a result back, presumably because the only
analyzer in play is only generating two search tokens, so that becomes
the ceiling for the purposes of the minimum_should_match.

I'm not sure how that ceiling is calculated by the DisMax parser -- is
it the max # of terms generated by any component of the disjunction? I
don't suppose it would be possible for the minimum_should_match to be
"local" to each component? This could easily be getting pretty deep
into Lucene internals.

Anyway, just wondering if anyone has any fantastic insight : )

Thanks!

Mat


(Shay Banon) #2

Yea, minimum should match is a bit meaningless with multi field querying,
because the boolean query generated is one of several dis max queries
broken down to queries on each "field" parsed by the query parser.

On Mon, Apr 2, 2012 at 11:07 PM, Matthew A. Brown mat.a.brown@gmail.comwrote:

I've run into an interesting conundrum. I don't think this is a bug,
but I'm also not sure how to get the behavior I want, so I was hoping
someone might have a brilliant idea.

Let's say I've got a type "books", with a "title" and "author" field.
Let's further say that I'm indexing "title" with a stopwords filter,
but "author" has no stopwords filter.

The following document is the only one in the index:

{"title":"The Great Gatsby", "author":"F. Scott Fitzgerald"}

Now let's say I want to perform the following search:

{"query": {"query_string": {"query": "the great gatsby", "fields":
["title", "author"], "minimum_should_match": 3}}}

This search won't return any results. I believe this is because the
minimum_should_match is 3, and the stopwords filter is dropping "the".
So, only two tokens match, but since the original input string had 3
tokens, it's still looking for a 3-token match.

However, the behavior is different if I only search the "title" field:

{"query": {"query_string": {"query": "the great gatsby", "fields":
["title"], "minimum_should_match": 3}}}

In this case, I do get a result back, presumably because the only
analyzer in play is only generating two search tokens, so that becomes
the ceiling for the purposes of the minimum_should_match.

I'm not sure how that ceiling is calculated by the DisMax parser -- is
it the max # of terms generated by any component of the disjunction? I
don't suppose it would be possible for the minimum_should_match to be
"local" to each component? This could easily be getting pretty deep
into Lucene internals.

Anyway, just wondering if anyone has any fantastic insight : )

Thanks!

Mat


(Matthew A. Brown) #3

Thanks, Shay! Any suggestions on how to get the behavior I'm looking for?

On Tue, Apr 3, 2012 at 10:30, Shay Banon kimchy@gmail.com wrote:

Yea, minimum should match is a bit meaningless with multi field querying,
because the boolean query generated is one of several dis max queries broken
down to queries on each "field" parsed by the query parser.

On Mon, Apr 2, 2012 at 11:07 PM, Matthew A. Brown mat.a.brown@gmail.com
wrote:

I've run into an interesting conundrum. I don't think this is a bug,
but I'm also not sure how to get the behavior I want, so I was hoping
someone might have a brilliant idea.

Let's say I've got a type "books", with a "title" and "author" field.
Let's further say that I'm indexing "title" with a stopwords filter,
but "author" has no stopwords filter.

The following document is the only one in the index:

{"title":"The Great Gatsby", "author":"F. Scott Fitzgerald"}

Now let's say I want to perform the following search:

{"query": {"query_string": {"query": "the great gatsby", "fields":
["title", "author"], "minimum_should_match": 3}}}

This search won't return any results. I believe this is because the
minimum_should_match is 3, and the stopwords filter is dropping "the".
So, only two tokens match, but since the original input string had 3
tokens, it's still looking for a 3-token match.

However, the behavior is different if I only search the "title" field:

{"query": {"query_string": {"query": "the great gatsby", "fields":
["title"], "minimum_should_match": 3}}}

In this case, I do get a result back, presumably because the only
analyzer in play is only generating two search tokens, so that becomes
the ceiling for the purposes of the minimum_should_match.

I'm not sure how that ceiling is calculated by the DisMax parser -- is
it the max # of terms generated by any component of the disjunction? I
don't suppose it would be possible for the minimum_should_match to be
"local" to each component? This could easily be getting pretty deep
into Lucene internals.

Anyway, just wondering if anyone has any fantastic insight : )

Thanks!

Mat


(Shay Banon) #4

The only thing that I can think is using dismax explicitly, and have
several query_string queries in it. It won't behave the same as specifying
fields in the query_string (the dismax will be on the whole parsed query),
but it might make sense.

On Tue, Apr 3, 2012 at 5:34 PM, Matthew A. Brown mat.a.brown@gmail.comwrote:

Thanks, Shay! Any suggestions on how to get the behavior I'm looking for?

On Tue, Apr 3, 2012 at 10:30, Shay Banon kimchy@gmail.com wrote:

Yea, minimum should match is a bit meaningless with multi field querying,
because the boolean query generated is one of several dis max queries
broken
down to queries on each "field" parsed by the query parser.

On Mon, Apr 2, 2012 at 11:07 PM, Matthew A. Brown <mat.a.brown@gmail.com

wrote:

I've run into an interesting conundrum. I don't think this is a bug,
but I'm also not sure how to get the behavior I want, so I was hoping
someone might have a brilliant idea.

Let's say I've got a type "books", with a "title" and "author" field.
Let's further say that I'm indexing "title" with a stopwords filter,
but "author" has no stopwords filter.

The following document is the only one in the index:

{"title":"The Great Gatsby", "author":"F. Scott Fitzgerald"}

Now let's say I want to perform the following search:

{"query": {"query_string": {"query": "the great gatsby", "fields":
["title", "author"], "minimum_should_match": 3}}}

This search won't return any results. I believe this is because the
minimum_should_match is 3, and the stopwords filter is dropping "the".
So, only two tokens match, but since the original input string had 3
tokens, it's still looking for a 3-token match.

However, the behavior is different if I only search the "title" field:

{"query": {"query_string": {"query": "the great gatsby", "fields":
["title"], "minimum_should_match": 3}}}

In this case, I do get a result back, presumably because the only
analyzer in play is only generating two search tokens, so that becomes
the ceiling for the purposes of the minimum_should_match.

I'm not sure how that ceiling is calculated by the DisMax parser -- is
it the max # of terms generated by any component of the disjunction? I
don't suppose it would be possible for the minimum_should_match to be
"local" to each component? This could easily be getting pretty deep
into Lucene internals.

Anyway, just wondering if anyone has any fantastic insight : )

Thanks!

Mat


(Matthew A. Brown) #5

Thanks, Shay. I did try that but as you mentioned, the behavior isn't
quite equivalent. I'll have to do some more thinking on it...

On Wed, Apr 4, 2012 at 09:49, Shay Banon kimchy@gmail.com wrote:

The only thing that I can think is using dismax explicitly, and have several
query_string queries in it. It won't behave the same as specifying fields in
the query_string (the dismax will be on the whole parsed query), but it
might make sense.

On Tue, Apr 3, 2012 at 5:34 PM, Matthew A. Brown mat.a.brown@gmail.com
wrote:

Thanks, Shay! Any suggestions on how to get the behavior I'm looking for?

On Tue, Apr 3, 2012 at 10:30, Shay Banon kimchy@gmail.com wrote:

Yea, minimum should match is a bit meaningless with multi field
querying,
because the boolean query generated is one of several dis max queries
broken
down to queries on each "field" parsed by the query parser.

On Mon, Apr 2, 2012 at 11:07 PM, Matthew A. Brown
mat.a.brown@gmail.com
wrote:

I've run into an interesting conundrum. I don't think this is a bug,
but I'm also not sure how to get the behavior I want, so I was hoping
someone might have a brilliant idea.

Let's say I've got a type "books", with a "title" and "author" field.
Let's further say that I'm indexing "title" with a stopwords filter,
but "author" has no stopwords filter.

The following document is the only one in the index:

{"title":"The Great Gatsby", "author":"F. Scott Fitzgerald"}

Now let's say I want to perform the following search:

{"query": {"query_string": {"query": "the great gatsby", "fields":
["title", "author"], "minimum_should_match": 3}}}

This search won't return any results. I believe this is because the
minimum_should_match is 3, and the stopwords filter is dropping "the".
So, only two tokens match, but since the original input string had 3
tokens, it's still looking for a 3-token match.

However, the behavior is different if I only search the "title" field:

{"query": {"query_string": {"query": "the great gatsby", "fields":
["title"], "minimum_should_match": 3}}}

In this case, I do get a result back, presumably because the only
analyzer in play is only generating two search tokens, so that becomes
the ceiling for the purposes of the minimum_should_match.

I'm not sure how that ceiling is calculated by the DisMax parser -- is
it the max # of terms generated by any component of the disjunction? I
don't suppose it would be possible for the minimum_should_match to be
"local" to each component? This could easily be getting pretty deep
into Lucene internals.

Anyway, just wondering if anyone has any fantastic insight : )

Thanks!

Mat


(system) #6