Confused about query_string and the use of wildcards

Hi Enrique

All my confusion is about the expected behaviour of the wildcard
queries. So let's say, if my user wants to search for "iPhone" and I
run a query with wildcards "iPhone" then:

The wildcards should work as you expect them to work - from your
example, you expect them to find any word containing "iPhone", and that
is correct.

  1. As it is working now, it will not analyze the search term, but just
    use iPhone as the token itself, therefore not finding "iPhone" which
    has a token of "iphon".

If it weren't analyzed, then the term resulting from "iPhone" would be
"iPhone" not "iphone" so even case would make a difference.

  1. As I expected it to be, i.e. the "iPhone" is analyzed into
    "iphon" and then executed the search, and "iPhone" results are
    returned.

This is the correct behaviour.

Try it out without the snowball stemmer:

You'll see that it works as you expect.

clint

Try:

{
"query" : {
"field" : {
"_all" : "iphone"
}
}
}

and you'll see that it doesn't work either...

On Wed, Mar 16, 2011 at 4:29 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

Hi Enrique

All my confusion is about the expected behaviour of the wildcard
queries. So let's say, if my user wants to search for "iPhone" and I
run a query with wildcards "iPhone" then:

The wildcards should work as you expect them to work - from your
example, you expect them to find any word containing "iPhone", and that
is correct.

  1. As it is working now, it will not analyze the search term, but just
    use iPhone as the token itself, therefore not finding "iPhone" which
    has a token of "iphon".

If it weren't analyzed, then the term resulting from "iPhone" would be
"iPhone" not "iphone" so even case would make a difference.

  1. As I expected it to be, i.e. the "iPhone" is analyzed into
    "iphon" and then executed the search, and "iPhone" results are
    returned.

This is the correct behaviour.

Try it out without the snowball stemmer:

gist:fa32a01ff47b2c2377d9 · GitHub

You'll see that it works as you expect.

clint

{
"query" : {
"field" : {
"_all" : "iphone"
}
}
}

and you'll see that it doesn't work either...

It works for me:

[Wed Mar 16 16:36:17 2011] Protocol: http, Server: 192.168.5.103:9200

curl -XGET 'http://127.0.0.1:9200/test/_search?pretty=1' -d '
{
"query" : {
"field" : {
"_all" : "iphone"
}
}
}
'

[Wed Mar 16 16:36:17 2011] Response:

{

"hits" : {

"hits" : [

{

"_source" : {

"text" : "I have in iPhone"

},

"_score" : 1,

"_index" : "test",

"_id" : "O9tKwG64Td2_lRhfBm_jag",

"_type" : "doc"

}

],

"max_score" : 1,

"total" : 1

},

"timed_out" : false,

"_shards" : {

"failed" : 0,

"successful" : 5,

"total" : 5

},

"took" : 2

}

What version of ES are you using? What other settings do you have?

clint

This is ES 0.16.0.SNAPSHOT, and IMHO your example works because you didn't
use a Spanish analyzer, but the standard one, which basically generates same
tokens as words (it doesn't stem as Spanish does).

On Wed, Mar 16, 2011 at 4:37 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

{
"query" : {
"field" : {
"_all" : "iphone"
}
}
}

and you'll see that it doesn't work either...

It works for me:

[Wed Mar 16 16:36:17 2011] Protocol: http, Server: 192.168.5.103:9200

curl -XGET 'http://127.0.0.1:9200/test/_search?pretty=1' -d '
{
"query" : {
"field" : {
"_all" : "iphone"
}
}
}
'

[Wed Mar 16 16:36:17 2011] Response:

{

"hits" : {

"hits" : [

{

"_source" : {

"text" : "I have in iPhone"

},

"_score" : 1,

"_index" : "test",

"_id" : "O9tKwG64Td2_lRhfBm_jag",

"_type" : "doc"

}

],

"max_score" : 1,

"total" : 1

},

"timed_out" : false,

"_shards" : {

"failed" : 0,

"successful" : 5,

"total" : 5

},

"took" : 2

}

What version of ES are you using? What other settings do you have?

clint

The more I think about this the more I get convinced that the issue is with
analyzers using stemmers, and the fact that the search ignores the stemmer
when parsing the query terms...

On Wed, Mar 16, 2011 at 4:46 PM, Enrique Medina Montenegro <
e.medina.m@gmail.com> wrote:

This is ES 0.16.0.SNAPSHOT, and IMHO your example works because you didn't
use a Spanish analyzer, but the standard one, which basically generates same
tokens as words (it doesn't stem as Spanish does).

On Wed, Mar 16, 2011 at 4:37 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

{
"query" : {
"field" : {
"_all" : "iphone"
}
}
}

and you'll see that it doesn't work either...

It works for me:

[Wed Mar 16 16:36:17 2011] Protocol: http, Server: 192.168.5.103:9200

curl -XGET 'http://127.0.0.1:9200/test/_search?pretty=1' -d '
{
"query" : {
"field" : {
"_all" : "iphone"
}
}
}
'

[Wed Mar 16 16:36:17 2011] Response:

{

"hits" : {

"hits" : [

{

"_source" : {

"text" : "I have in iPhone"

},

"_score" : 1,

"_index" : "test",

"_id" : "O9tKwG64Td2_lRhfBm_jag",

"_type" : "doc"

}

],

"max_score" : 1,

"total" : 1

},

"timed_out" : false,

"_shards" : {

"failed" : 0,

"successful" : 5,

"total" : 5

},

"took" : 2

}

What version of ES are you using? What other settings do you have?

clint

On Wed, 2011-03-16 at 16:55 +0100, Enrique Medina Montenegro wrote:

The more I think about this the more I get convinced that the issue is
with analyzers using stemmers, and the fact that the search ignores
the stemmer when parsing the query terms...

Exactly. That's why I said there is a bug.

clint

And just to make it more clear, the bug would be ONLY when using
wildcards...

But I'm afraid this is a Lucene bug, but Shay can comment on it in the
issue...

In the meantime, a workaround is to analyze the search words manually before
composing the query, which is easy to implement but adds some hassle to the
process, obviously.

On Wed, Mar 16, 2011 at 4:56 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

On Wed, 2011-03-16 at 16:55 +0100, Enrique Medina Montenegro wrote:

The more I think about this the more I get convinced that the issue is
with analyzers using stemmers, and the fact that the search ignores
the stemmer when parsing the query terms...

Exactly. That's why I said there is a bug.

clint

Hi,

Yea, wildcard text does not get analyzed when used in a query string. This is for the simple reason that analysis might produce more than one token and then it can be confusing as to where to apply the wildcard element.

Having said that, I think I saw somewhere a query parser that did analysis on wildcard. I need to find it. Maybe it can be added as an option.

-shay.banon
On Wednesday, March 16, 2011 at 5:59 PM, Enrique Medina Montenegro wrote:

And just to make it more clear, the bug would be ONLY when using wildcards...

But I'm afraid this is a Lucene bug, but Shay can comment on it in the issue...

In the meantime, a workaround is to analyze the search words manually before composing the query, which is easy to implement but adds some hassle to the process, obviously.

On Wed, Mar 16, 2011 at 4:56 PM, Clinton Gormley clinton@iannounce.co.uk wrote:

On Wed, 2011-03-16 at 16:55 +0100, Enrique Medina Montenegro wrote:

The more I think about this the more I get convinced that the issue is
with analyzers using stemmers, and the fact that the search ignores
the stemmer when parsing the query terms...

Exactly. That's why I said there is a bug.

clint

Yea, wildcard text does not get analyzed when used in a query
string. This is for the simple reason that analysis might produce more
than one token and then it can be confusing as to where to apply the
wildcard element.

Gaah - so after all that, I was wrong!

Apologies Enrique

clint

No apologies, that's what users' lists are made for... discussion, right?

In any case, I think that the workaround explained below in this thread
could work fine if in need of stemmizing a wildcard search :wink:

On Wed, Mar 16, 2011 at 6:44 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

Yea, wildcard text does not get analyzed when used in a query
string. This is for the simple reason that analysis might produce more
than one token and then it can be confusing as to where to apply the
wildcard element.

Gaah - so after all that, I was wrong!

Apologies Enrique

clint

Shay,

What would be an example of a word producing more than one token?

Thanks.

On Wed, Mar 16, 2011 at 6:40 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hi,

Yea, wildcard text does not get analyzed when used in a query string.
This is for the simple reason that analysis might produce more than one
token and then it can be confusing as to where to apply the wildcard
element.

Having said that, I think I saw somewhere a query parser that did
analysis on wildcard. I need to find it. Maybe it can be added as an option.

-shay.banon

On Wednesday, March 16, 2011 at 5:59 PM, Enrique Medina Montenegro wrote:

And just to make it more clear, the bug would be ONLY when using
wildcards...

But I'm afraid this is a Lucene bug, but Shay can comment on it in the
issue...

In the meantime, a workaround is to analyze the search words manually
before composing the query, which is easy to implement but adds some hassle
to the process, obviously.

On Wed, Mar 16, 2011 at 4:56 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

On Wed, 2011-03-16 at 16:55 +0100, Enrique Medina Montenegro wrote:

The more I think about this the more I get convinced that the issue is
with analyzers using stemmers, and the fact that the search ignores
the stemmer when parsing the query terms...

Exactly. That's why I said there is a bug.

clint

It really depends on the analyzer you use, and because its unknown on the query parser level, I guess it is not supported. There are analyzers that can result in multiple tokens, like ngram ones.
On Wednesday, March 16, 2011 at 7:51 PM, Enrique Medina Montenegro wrote:

Shay,

What would be an example of a word producing more than one token?

Thanks.

On Wed, Mar 16, 2011 at 6:40 PM, Shay Banon shay.banon@elasticsearch.com wrote:

Hi,

Yea, wildcard text does not get analyzed when used in a query string. This is for the simple reason that analysis might produce more than one token and then it can be confusing as to where to apply the wildcard element.

Having said that, I think I saw somewhere a query parser that did analysis on wildcard. I need to find it. Maybe it can be added as an option.

-shay.banon
On Wednesday, March 16, 2011 at 5:59 PM, Enrique Medina Montenegro wrote:

And just to make it more clear, the bug would be ONLY when using wildcards...

But I'm afraid this is a Lucene bug, but Shay can comment on it in the issue...

In the meantime, a workaround is to analyze the search words manually before composing the query, which is easy to implement but adds some hassle to the process, obviously.

On Wed, Mar 16, 2011 at 4:56 PM, Clinton Gormley clinton@iannounce.co.uk wrote:

On Wed, 2011-03-16 at 16:55 +0100, Enrique Medina Montenegro wrote:

The more I think about this the more I get convinced that the issue is
with analyzers using stemmers, and the fact that the search ignores
the stemmer when parsing the query terms...

Exactly. That's why I said there is a bug.

clint

As commented below in this thread, I eventually went for the workaround of
analyzing the search terms before enclosing them between wildcards and
executing the 'query_string'. As my analyzer is a simple spanish snowball, I
guess I shouldn't get more than 1 token per word, but in any case, my code
would only pick up the first one.

So my question is: would this be a good feature to include in ES? Basically
there could be a property/parameter in the 'query_string' (and even in the
'wildcard' query) that would tell ES to analyze the search terms before
querying when they are enclosed between wildcards (or just have left or
right wildcard). By default it would be set to false, so regular behaviour
applies.

On Wed, Mar 16, 2011 at 8:46 PM, Shay Banon shay.banon@elasticsearch.comwrote:

It really depends on the analyzer you use, and because its unknown on the
query parser level, I guess it is not supported. There are analyzers that
can result in multiple tokens, like ngram ones.

On Wednesday, March 16, 2011 at 7:51 PM, Enrique Medina Montenegro wrote:

Shay,

What would be an example of a word producing more than one token?

Thanks.

On Wed, Mar 16, 2011 at 6:40 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hi,

Yea, wildcard text does not get analyzed when used in a query string.
This is for the simple reason that analysis might produce more than one
token and then it can be confusing as to where to apply the wildcard
element.

Having said that, I think I saw somewhere a query parser that did
analysis on wildcard. I need to find it. Maybe it can be added as an option.

-shay.banon

On Wednesday, March 16, 2011 at 5:59 PM, Enrique Medina Montenegro wrote:

And just to make it more clear, the bug would be ONLY when using
wildcards...

But I'm afraid this is a Lucene bug, but Shay can comment on it in the
issue...

In the meantime, a workaround is to analyze the search words manually
before composing the query, which is easy to implement but adds some hassle
to the process, obviously.

On Wed, Mar 16, 2011 at 4:56 PM, Clinton Gormley <clinton@iannounce.co.uk

wrote:

On Wed, 2011-03-16 at 16:55 +0100, Enrique Medina Montenegro wrote:

The more I think about this the more I get convinced that the issue is
with analyzers using stemmers, and the fact that the search ignores
the stemmer when parsing the query terms...

Exactly. That's why I said there is a bug.

clint

Shay,

I have filed an enhancement for your consideration to allow pre-analysis of
search terms in a wildcard 'query_string' query:

Not sure how to label it as enhancement, so left it as issue so far.

Looking forward to hearing your feedback on it.

Regards.

On Thu, Mar 17, 2011 at 10:06 AM, Enrique Medina Montenegro <
e.medina.m@gmail.com> wrote:

As commented below in this thread, I eventually went for the workaround of
analyzing the search terms before enclosing them between wildcards and
executing the 'query_string'. As my analyzer is a simple spanish snowball, I
guess I shouldn't get more than 1 token per word, but in any case, my code
would only pick up the first one.

So my question is: would this be a good feature to include in ES? Basically
there could be a property/parameter in the 'query_string' (and even in the
'wildcard' query) that would tell ES to analyze the search terms before
querying when they are enclosed between wildcards (or just have left or
right wildcard). By default it would be set to false, so regular behaviour
applies.

On Wed, Mar 16, 2011 at 8:46 PM, Shay Banon shay.banon@elasticsearch.comwrote:

It really depends on the analyzer you use, and because its unknown on
the query parser level, I guess it is not supported. There are analyzers
that can result in multiple tokens, like ngram ones.

On Wednesday, March 16, 2011 at 7:51 PM, Enrique Medina Montenegro wrote:

Shay,

What would be an example of a word producing more than one token?

Thanks.

On Wed, Mar 16, 2011 at 6:40 PM, Shay Banon <shay.banon@elasticsearch.com

wrote:

Hi,

Yea, wildcard text does not get analyzed when used in a query string.
This is for the simple reason that analysis might produce more than one
token and then it can be confusing as to where to apply the wildcard
element.

Having said that, I think I saw somewhere a query parser that did
analysis on wildcard. I need to find it. Maybe it can be added as an option.

-shay.banon

On Wednesday, March 16, 2011 at 5:59 PM, Enrique Medina Montenegro wrote:

And just to make it more clear, the bug would be ONLY when using
wildcards...

But I'm afraid this is a Lucene bug, but Shay can comment on it in the
issue...

In the meantime, a workaround is to analyze the search words manually
before composing the query, which is easy to implement but adds some hassle
to the process, obviously.

On Wed, Mar 16, 2011 at 4:56 PM, Clinton Gormley <
clinton@iannounce.co.uk> wrote:

On Wed, 2011-03-16 at 16:55 +0100, Enrique Medina Montenegro wrote:

The more I think about this the more I get convinced that the issue is
with analyzers using stemmers, and the fact that the search ignores
the stemmer when parsing the query terms...

Exactly. That's why I said there is a bug.

clint

Got it. Certainly possible.
On Thursday, March 17, 2011 at 11:59 AM, Enrique Medina Montenegro wrote:

Shay,

I have filed an enhancement for your consideration to allow pre-analysis of search terms in a wildcard 'query_string' query:

Query: Provide an option to analyze wildcard/prefix in query_string / field queries · Issue #787 · elastic/elasticsearch · GitHub

Not sure how to label it as enhancement, so left it as issue so far.

Looking forward to hearing your feedback on it.

Regards.

On Thu, Mar 17, 2011 at 10:06 AM, Enrique Medina Montenegro e.medina.m@gmail.com wrote:

As commented below in this thread, I eventually went for the workaround of analyzing the search terms before enclosing them between wildcards and executing the 'query_string'. As my analyzer is a simple spanish snowball, I guess I shouldn't get more than 1 token per word, but in any case, my code would only pick up the first one.

So my question is: would this be a good feature to include in ES? Basically there could be a property/parameter in the 'query_string' (and even in the 'wildcard' query) that would tell ES to analyze the search terms before querying when they are enclosed between wildcards (or just have left or right wildcard). By default it would be set to false, so regular behaviour applies.

On Wed, Mar 16, 2011 at 8:46 PM, Shay Banon shay.banon@elasticsearch.com wrote:

It really depends on the analyzer you use, and because its unknown on the query parser level, I guess it is not supported. There are analyzers that can result in multiple tokens, like ngram ones.
On Wednesday, March 16, 2011 at 7:51 PM, Enrique Medina Montenegro wrote:

Shay,

What would be an example of a word producing more than one token?

Thanks.

On Wed, Mar 16, 2011 at 6:40 PM, Shay Banon shay.banon@elasticsearch.com wrote:

Hi,

Yea, wildcard text does not get analyzed when used in a query string. This is for the simple reason that analysis might produce more than one token and then it can be confusing as to where to apply the wildcard element.

Having said that, I think I saw somewhere a query parser that did analysis on wildcard. I need to find it. Maybe it can be added as an option.

-shay.banon
On Wednesday, March 16, 2011 at 5:59 PM, Enrique Medina Montenegro wrote:

And just to make it more clear, the bug would be ONLY when using wildcards...

But I'm afraid this is a Lucene bug, but Shay can comment on it in the issue...

In the meantime, a workaround is to analyze the search words manually before composing the query, which is easy to implement but adds some hassle to the process, obviously.

On Wed, Mar 16, 2011 at 4:56 PM, Clinton Gormley clinton@iannounce.co.uk wrote:

On Wed, 2011-03-16 at 16:55 +0100, Enrique Medina Montenegro wrote:

The more I think about this the more I get convinced that the issue is
with analyzers using stemmers, and the fact that the search ignores
the stemmer when parsing the query terms...

Exactly. That's why I said there is a bug.

clint