Confused about query_string and the use of wildcards

Hi,

I'm struggling on why a simple query like this one:

"query": {
"query_string": {
"default_operator": "AND",
"query": "*phone"
}
}

does not return any results, whereas this one (notice the extra '*' as a
suffix to the query):

"query": {
"query_string": {
"default_operator": "AND",
"query": "phone"
}
}

does return some results. But cannot understand why, to be honest...

Thanks.

Maybe because you have terms that don't end with phone?
On Tuesday, March 15, 2011 at 9:52 PM, Enrique Medina Montenegro wrote:

Hi,

I'm struggling on why a simple query like this one:

"query": {
"query_string": {
"default_operator": "AND",
"query": "*phone"
}
}

does not return any results, whereas this one (notice the extra '*' as a suffix to the query):

"query": {
"query_string": {
"default_operator": "AND",
"query": "phone"
}
}

does return some results. But cannot understand why, to be honest...

Thanks.

Shay,

But what about iPhone? Shouldn't it be included as part of the results for
"*phone"?

Or maybe it's just that the '*' doesn't work here as a real wildcard, as in
SQL a '%'?

Thanks.

On Tue, Mar 15, 2011 at 11:56 PM, Shay Banon
shay.banon@elasticsearch.comwrote:

Maybe because you have terms that don't end with phone?

On Tuesday, March 15, 2011 at 9:52 PM, Enrique Medina Montenegro wrote:

Hi,

I'm struggling on why a simple query like this one:

"query": {
"query_string": {
"default_operator": "AND",
"query": "*phone"
}
}

does not return any results, whereas this one (notice the extra '*' as a
suffix to the query):

"query": {
"query_string": {
"default_operator": "AND",
"query": "phone"
}
}

does return some results. But cannot understand why, to be honest...

Thanks.

Hi Enrique

But what about iPhone? Shouldn't it be included as part of the results
for "*phone"?

Or maybe it's just that the '*' doesn't work here as a real wildcard,
as in SQL a '%'?

It is the same as % in SQL, and your example works for me.

I suggest you gist a complete curl recreation, from index creation, data
indexing, and searching to demonstrate the problem.

clint

Clinton,

I found the issue and it was on my default Spanish analyzer. For some
reason, iPhone gets analyzed in Spanish like this:

http://localhost:9200/mytest/_analyze?text=iPhone+4

{"tokens":[{"token":"iphon","start_offset":0,"end_offset":6,"type":"","position":1},{"token":"4","start_offset":7,"end_offset":8,"type":"","position":2}]}

whereas in the case of the default analyzer it gets like this:

http://localhost:9200/mytest/_analyze?text=iPhone+4&analyzer=standard

{"tokens":[{"token":"iphone","start_offset":0,"end_offset":6,"type":"","position":1},{"token":"4","start_offset":7,"end_offset":8,"type":"","position":2}]}

Hence, the token "iphon" in Spanish was not matching the "phone", but
matches "phon".

What do you recommend in these particular cases? Adding iPhone as a stop
word?

Thanks.

On Wed, Mar 16, 2011 at 11:02 AM, Clinton Gormley
clinton@iannounce.co.ukwrote:

Hi Enrique

But what about iPhone? Shouldn't it be included as part of the results
for "*phone"?

Or maybe it's just that the '*' doesn't work here as a real wildcard,
as in SQL a '%'?

It is the same as % in SQL, and your example works for me.

I suggest you gist a complete curl recreation, from index creation, data
indexing, and searching to demonstrate the problem.

clint

Hi Enrique

I found the issue and it was on my default Spanish analyzer. For some
reason, iPhone gets analyzed in Spanish like this:

http://localhost:9200/mytest/_analyze?text=iPhone+4
{"tokens":[{"token":"iphon","start_offset":0,"end_offset":6,"type":"","position":1},{"token":"4","start_offset":7,"end_offset":8,"type":"","position":2}]}

Presumably you're using the snowball stemmer? It analyzes 'iphone' as
'iphon' to be able to recognise eg "cansada" and "cansado" as the same
stem.

All you need to do is to be sure that you're using the same analyzer at
index time as at search time.

You have a few options here:

  1. you're searching on a field (eg product_name) and you set the
    'analyzer' for that field to be the spanish stemmer, when
    you put the mapping

    Elasticsearch Platform — Find real-time answers at scale | Elastic

  2. you're searching on the '_all' field (which is the default)
    and you can set the analyzer for the '_all' field to
    be the spanish stemmer when you put the mapping

    Elasticsearch Platform — Find real-time answers at scale | Elastic

    (this is probably not what you want, as the _all field will contain
    some fields which shouldn't have the stemmer applied)

  3. you can't determine at mapping time which language you're
    going to be using at search time, and you specify the
    analyzer in the query_string query itself:

    Elasticsearch Platform — Find real-time answers at scale | Elastic

  4. you could do something wizzy per document with the _analyzer field

    Elasticsearch Platform — Find real-time answers at scale | Elastic

clint

Clinton,

Based on some other discussion with Shay, I defined this in the
elasticsearch.yml config file:

index:
analysis:
analyzer:
default:
type: es.cuestamenos.lucene.analizadores.SpanishAnalyzerProvider

And the analyzer is this one:

Shouldn't that be enough both for index and search?

Thanks.

On Wed, Mar 16, 2011 at 11:51 AM, Clinton Gormley
clinton@iannounce.co.ukwrote:

Hi Enrique

I found the issue and it was on my default Spanish analyzer. For some
reason, iPhone gets analyzed in Spanish like this:

http://localhost:9200/mytest/_analyze?text=iPhone+4

{"tokens":[{"token":"iphon","start_offset":0,"end_offset":6,"type":"","position":1},{"token":"4","start_offset":7,"end_offset":8,"type":"","position":2}]}

Presumably you're using the snowball stemmer? It analyzes 'iphone' as
'iphon' to be able to recognise eg "cansada" and "cansado" as the same
stem.

All you need to do is to be sure that you're using the same analyzer at
index time as at search time.

You have a few options here:

  1. you're searching on a field (eg product_name) and you set the
    'analyzer' for that field to be the spanish stemmer, when
    you put the mapping

Elasticsearch Platform — Find real-time answers at scale | Elastic

  1. you're searching on the '_all' field (which is the default)
    and you can set the analyzer for the '_all' field to
    be the spanish stemmer when you put the mapping

Elasticsearch Platform — Find real-time answers at scale | Elastic

(this is probably not what you want, as the _all field will contain
some fields which shouldn't have the stemmer applied)

  1. you can't determine at mapping time which language you're
    going to be using at search time, and you specify the
    analyzer in the query_string query itself:

Elasticsearch Platform — Find real-time answers at scale | Elastic

  1. you could do something wizzy per document with the _analyzer field

Elasticsearch Platform — Find real-time answers at scale | Elastic

clint

Hi Enrique

Based on some other discussion with Shay, I defined this in the
elasticsearch.yml config file:

index:
analysis:
analyzer:
default:
type:
es.cuestamenos.lucene.analizadores.SpanishAnalyzerProvider

Shouldn't that be enough both for index and search?

I would have thought so. But it doesn't appear to be applied at search
time. Are you searching against a specific field, or against _all?

If it works against a specific field, but not against _all, then perhaps
there is a bug.

A complete curl recreation would be useful

clint

Clinton,

I'm searching against "_all", which is the default.

I get consistent results (even the lack of results) when adding a specific
field or specific analyzer:

{
"query": {
"query_string": {
"default_operator": "AND",
"query": "*phon",
"default_field": "name",
"analyzer": "default"
}
}
}

So I guess it's not a bug, but as explained in my previous email, the fact
that the Spanish analyzer created a token = "iphon" for iPhone so no matter
how I search, it will never match "*phone", right?

Regards.

On Wed, Mar 16, 2011 at 12:10 PM, Clinton Gormley
clinton@iannounce.co.ukwrote:

Hi Enrique

Based on some other discussion with Shay, I defined this in the
elasticsearch.yml config file:

index:
analysis:
analyzer:
default:
type:
es.cuestamenos.lucene.analizadores.SpanishAnalyzerProvider

Shouldn't that be enough both for index and search?

I would have thought so. But it doesn't appear to be applied at search
time. Are you searching against a specific field, or against _all?

If it works against a specific field, but not against _all, then perhaps
there is a bug.

A complete curl recreation would be useful

clint

hi Enrique

I've just remembered your original question, which was:

"*phone"

vs
"phone"

As I understand it, the way this wildcard search works is that Lucene
looks up all matching terms, and searches against each of these.

So for some reason, "*phone" doesn't find the the right term, but
"phone" does.

I get consistent results (even the lack of results) when adding a
specific field or specific analyzer:

You mean, you see the same thing?

So I guess it's not a bug, but as explained in my previous email, the
fact that the Spanish analyzer created a token = "iphon" for iPhone so
no matter how I search, it will never match "*phone", right?

No. This should work. For instance, using the default analyzer, if you
index "The Quick BROWN fox" you end up with the terms
"quick","brown","fox"

If you then search for "The Quick BROWN fox", it performs the same
analysis, resulting in the same terms, and searches for those.

So to me (and I'm ignorant of the Lucene internals) it sounds like a
potential bug in the lucene query parser syntax.

A complete recreation would be very useful for debugging.

clint

I don't know why your pattern search is not working, but on this
analyzer you're ascii folding the terms before you remove the stop
words, and your stop words contain non ascii letters. You should
either ascii fold your stop words or remove them before the ascii
folding step.

Cheers,

On Wed, Mar 16, 2011 at 12:03 PM, Enrique Medina Montenegro
e.medina.m@gmail.com wrote:

Clinton,
Based on some other discussion with Shay, I defined this in the
elasticsearch.yml config file:
index:
analysis:
analyzer:
default:
type: es.cuestamenos.lucene.analizadores.SpanishAnalyzerProvider
And the analyzer is this one:
Custom analyzer for Spanish · GitHub
Shouldn't that be enough both for index and search?
Thanks.

On Wed, Mar 16, 2011 at 11:51 AM, Clinton Gormley clinton@iannounce.co.uk
wrote:

Hi Enrique

I found the issue and it was on my default Spanish analyzer. For some
reason, iPhone gets analyzed in Spanish like this:

http://localhost:9200/mytest/_analyze?text=iPhone+4

{"tokens":[{"token":"iphon","start_offset":0,"end_offset":6,"type":"","position":1},{"token":"4","start_offset":7,"end_offset":8,"type":"","position":2}]}

Presumably you're using the snowball stemmer? It analyzes 'iphone' as
'iphon' to be able to recognise eg "cansada" and "cansado" as the same
stem.

All you need to do is to be sure that you're using the same analyzer at
index time as at search time.

You have a few options here:

  1. you're searching on a field (eg product_name) and you set the
    'analyzer' for that field to be the spanish stemmer, when
    you put the mapping

Elasticsearch Platform — Find real-time answers at scale | Elastic

  1. you're searching on the '_all' field (which is the default)
    and you can set the analyzer for the '_all' field to
    be the spanish stemmer when you put the mapping

Elasticsearch Platform — Find real-time answers at scale | Elastic

(this is probably not what you want, as the _all field will contain
some fields which shouldn't have the stemmer applied)

  1. you can't determine at mapping time which language you're
    going to be using at search time, and you specify the
    analyzer in the query_string query itself:

Elasticsearch Platform — Find real-time answers at scale | Elastic

  1. you could do something wizzy per document with the _analyzer field

Elasticsearch Platform — Find real-time answers at scale | Elastic

clint

--
Joaquin Cuenca Abela -- presspeople.com: Fuentes de prensa y comunicados

Nice catch, Joaquín.

I'll fix it and try to recreate the issue for Clinton.

On Wed, Mar 16, 2011 at 12:52 PM, Joaquin Cuenca Abela <
joaquin@cuencaabela.com> wrote:

I don't know why your pattern search is not working, but on this
analyzer you're ascii folding the terms before you remove the stop
words, and your stop words contain non ascii letters. You should
either ascii fold your stop words or remove them before the ascii
folding step.

Cheers,

On Wed, Mar 16, 2011 at 12:03 PM, Enrique Medina Montenegro
e.medina.m@gmail.com wrote:

Clinton,
Based on some other discussion with Shay, I defined this in the
elasticsearch.yml config file:
index:
analysis:
analyzer:
default:
type: es.cuestamenos.lucene.analizadores.SpanishAnalyzerProvider
And the analyzer is this one:
Custom analyzer for Spanish · GitHub
Shouldn't that be enough both for index and search?
Thanks.

On Wed, Mar 16, 2011 at 11:51 AM, Clinton Gormley <
clinton@iannounce.co.uk>
wrote:

Hi Enrique

I found the issue and it was on my default Spanish analyzer. For some
reason, iPhone gets analyzed in Spanish like this:

http://localhost:9200/mytest/_analyze?text=iPhone+4

{"tokens":[{"token":"iphon","start_offset":0,"end_offset":6,"type":"","position":1},{"token":"4","start_offset":7,"end_offset":8,"type":"","position":2}]}

Presumably you're using the snowball stemmer? It analyzes 'iphone' as
'iphon' to be able to recognise eg "cansada" and "cansado" as the same
stem.

All you need to do is to be sure that you're using the same analyzer at
index time as at search time.

You have a few options here:

  1. you're searching on a field (eg product_name) and you set the
    'analyzer' for that field to be the spanish stemmer, when
    you put the mapping

Elasticsearch Platform — Find real-time answers at scale | Elastic

  1. you're searching on the '_all' field (which is the default)
    and you can set the analyzer for the '_all' field to
    be the spanish stemmer when you put the mapping

Elasticsearch Platform — Find real-time answers at scale | Elastic

(this is probably not what you want, as the _all field will contain
some fields which shouldn't have the stemmer applied)

  1. you can't determine at mapping time which language you're
    going to be using at search time, and you specify the
    analyzer in the query_string query itself:

Elasticsearch Platform — Find real-time answers at scale | Elastic

  1. you could do something wizzy per document with the _analyzer field

Elasticsearch Platform — Find real-time answers at scale | Elastic

clint

--
Joaquin Cuenca Abela -- presspeople.com: Fuentes de prensa y comunicados

I think I found the issue without having to do a full recreation...

If I search using this:

{
"query": {
"query_string": {
"default_operator": "AND",
"query": "iphone"
}
}
}

then it works as expected, I do get the expected results with the word
"iPhone".

However, if I use:

{
"query": {
"query_string": {
"default_operator": "AND",
"query": "*phone"
}
}
}

then I don't get them. It seems that when you specify a wildcard in the
query, it's not being properly analyzed like it should:

http://localhost:9200/mytest/_analyze?text=*phone

{"tokens":[{"token":"phon","start_offset":0,"end_offset":5,"type":"","position":1}]}

Therefore the wildcard is lost when tokenizing it and the search
doesn't return any results, as "iPhone" doesn't match the token
"phon".

Does this make sense now?

On Wed, Mar 16, 2011 at 12:33 PM, Clinton Gormley
clinton@iannounce.co.ukwrote:

hi Enrique

I've just remembered your original question, which was:

"*phone"
vs
"phone"

As I understand it, the way this wildcard search works is that Lucene
looks up all matching terms, and searches against each of these.

So for some reason, "*phone" doesn't find the the right term, but
"phone" does.

I get consistent results (even the lack of results) when adding a
specific field or specific analyzer:

You mean, you see the same thing?

So I guess it's not a bug, but as explained in my previous email, the
fact that the Spanish analyzer created a token = "iphon" for iPhone so
no matter how I search, it will never match "*phone", right?

No. This should work. For instance, using the default analyzer, if you
index "The Quick BROWN fox" you end up with the terms
"quick","brown","fox"

If you then search for "The Quick BROWN fox", it performs the same
analysis, resulting in the same terms, and searches for those.

So to me (and I'm ignorant of the Lucene internals) it sounds like a
potential bug in the lucene query parser syntax.

A complete recreation would be very useful for debugging.

clint

Which makes me think, is the '*' actually acting as a wildcard in the query
or is it interpreted by Lucene as just another character that has to be
analyzed, as explained in my previous email, therefore losing all the
wildcard information for the search?

On Wed, Mar 16, 2011 at 2:38 PM, Enrique Medina Montenegro <
e.medina.m@gmail.com> wrote:

I think I found the issue without having to do a full recreation...

If I search using this:

{
"query": {
"query_string": {
"default_operator": "AND",
"query": "iphone"
}
}
}

then it works as expected, I do get the expected results with the word
"iPhone".

However, if I use:

{
"query": {
"query_string": {
"default_operator": "AND",
"query": "*phone"
}
}
}

then I don't get them. It seems that when you specify a wildcard in the
query, it's not being properly analyzed like it should:

http://localhost:9200/mytest/_analyze?text=*phone

{"tokens":[{"token":"phon","start_offset":0,"end_offset":5,"type":"","position":1}]}

Therefore the wildcard is lost when tokenizing it and the search doesn't return any results, as "iPhone" doesn't match the token "phon".

Does this make sense now?

On Wed, Mar 16, 2011 at 12:33 PM, Clinton Gormley <clinton@iannounce.co.uk

wrote:

hi Enrique

I've just remembered your original question, which was:

"*phone"
vs
"phone"

As I understand it, the way this wildcard search works is that Lucene
looks up all matching terms, and searches against each of these.

So for some reason, "*phone" doesn't find the the right term, but
"phone" does.

I get consistent results (even the lack of results) when adding a
specific field or specific analyzer:

You mean, you see the same thing?

So I guess it's not a bug, but as explained in my previous email, the
fact that the Spanish analyzer created a token = "iphon" for iPhone so
no matter how I search, it will never match "*phone", right?

No. This should work. For instance, using the default analyzer, if you
index "The Quick BROWN fox" you end up with the terms
"quick","brown","fox"

If you then search for "The Quick BROWN fox", it performs the same
analysis, resulting in the same terms, and searches for those.

So to me (and I'm ignorant of the Lucene internals) it sounds like a
potential bug in the lucene query parser syntax.

A complete recreation would be very useful for debugging.

clint

Hi Enrique

On Wed, 2011-03-16 at 14:38 +0100, Enrique Medina Montenegro wrote:

I think I found the issue without having to do a full recreation...

The reason I keep asking for a complete recreation is so that Shay has
got a test case to figure out where the bug is. The easier you make
things for him, the more likely your bug will get attended to.

then I don't get them. It seems that when you specify a wildcard in
the query, it's not being properly analyzed like it should:

yes, i agree

http://localhost:9200/mytest/_analyze?text=*phone

{"tokens":[{"token":"phon","start_offset":0,"end_offset":5,"type":"","position":1}]}

Therefore the wildcard is lost when tokenizing it and the search
doesn't return any results, as "iPhone" doesn't match the token
"phon".

Not quite - the analyze API is just one part of this. What you're not
seeing is the lucene query parser in action. That's where I think the
bug is.

I suggest that you gist a complete recreation and post an issue to

ta

clint

Yes, I will post the recreation right after this email.

I did some more testing with the wildcard, and it seems that wildcards do
not match blank spaces, so if you specify "iphone" it will not match a
name of "iPhone 4", but something like "iPad/iPhone 4/iPod".

Is this the expected behaviour or am I missing something?

On Wed, Mar 16, 2011 at 2:47 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

Hi Enrique

On Wed, 2011-03-16 at 14:38 +0100, Enrique Medina Montenegro wrote:

I think I found the issue without having to do a full recreation...

The reason I keep asking for a complete recreation is so that Shay has
got a test case to figure out where the bug is. The easier you make
things for him, the more likely your bug will get attended to.

then I don't get them. It seems that when you specify a wildcard in
the query, it's not being properly analyzed like it should:

yes, i agree

http://localhost:9200/mytest/_analyze?text=*phone

{"tokens":[{"token":"phon","start_offset":0,"end_offset":5,"type":"","position":1}]}

Therefore the wildcard is lost when tokenizing it and the search
doesn't return any results, as "iPhone" doesn't match the token
"phon".

Not quite - the analyze API is just one part of this. What you're not
seeing is the lucene query parser in action. That's where I think the
bug is.

I suggest that you gist a complete recreation and post an issue to
Issues · elastic/elasticsearch · GitHub

ta

clint

On Wed, 2011-03-16 at 14:58 +0100, Enrique Medina Montenegro wrote:

Yes, I will post the recreation right after this email.

thanks :slight_smile:

I did some more testing with the wildcard, and it seems that wildcards
do not match blank spaces, so if you specify "iphone" it will not
match a name of "iPhone 4", but something like "iPad/iPhone 4/iPod".

Is this the expected behaviour or am I missing something?

This is correct - so I was wrong is saying that * is equivalent to % in
SQL. It works only on a per-word basis.

Also searching for '"ipho*"' (ie in double quotes) would not work, as
the * would be interpreted literally, rather than as a wildcard.

clint

Then it's definitively clear that it's not a bug, but a side effect of my
Spanish analyzer tokenizing "iPhone" as "iphon", therefore not matching
"phone" (token is different) or "*phone" (wildcard takes word as a term, not
its token).

I wonder if there's any sort of query in Lucene that acts as a '%' SQL
wildcard, so for instance, when I specify "*phone", instead of matching
tokens with that literal "phone" and something before, it could first
tokenize the literal, i.e. "phon", and then perform the search, which would
definitively match my "iPhone"...

Maybe the solution is to tokenize the words entered by the user before
applying the wildcard, and then passing the tokenized version to the query
eventually...

On Wed, Mar 16, 2011 at 3:11 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

On Wed, 2011-03-16 at 14:58 +0100, Enrique Medina Montenegro wrote:

Yes, I will post the recreation right after this email.

thanks :slight_smile:

I did some more testing with the wildcard, and it seems that wildcards
do not match blank spaces, so if you specify "iphone" it will not
match a name of "iPhone 4", but something like "iPad/iPhone 4/iPod".

Is this the expected behaviour or am I missing something?

This is correct - so I was wrong is saying that * is equivalent to % in
SQL. It works only on a per-word basis.

Also searching for '"ipho*"' (ie in double quotes) would not work, as
the * would be interpreted literally, rather than as a wildcard.

clint

On Wed, 2011-03-16 at 15:26 +0100, Enrique Medina Montenegro wrote:

Then it's definitively clear that it's not a bug, but a side effect of
my Spanish analyzer tokenizing "iPhone" as "iphon", therefore not
matching "phone" (token is different) or "*phone" (wildcard takes word
as a term, not its token).

OK, I'm tired of trying to convince you that this is a bug. So i've
opened the issue for you, with a recreation:

clint

I was already working on the recreation, but if you already did, it's done.

All my confusion is about the expected behaviour of the wildcard queries. So
let's say, if my user wants to search for "iPhone" and I run a query with
wildcards "iPhone" then:

  1. As it is working now, it will not analyze the search term, but just use
    iPhone as the token itself, therefore not finding "iPhone" which has a token
    of "iphon".

  2. As I expected it to be, i.e. the "iPhone" is analyzed into "iphon"
    and then executed the search, and "iPhone" results are returned.

So if current behaviour is 1), it's not a bug, but just a misunderstanding
on my side. If current behaviour should be 2), then there's definitively a
bug.

Looking forward to Shay's feedback on it.

On Wed, Mar 16, 2011 at 4:07 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

On Wed, 2011-03-16 at 15:26 +0100, Enrique Medina Montenegro wrote:

Then it's definitively clear that it's not a bug, but a side effect of
my Spanish analyzer tokenizing "iPhone" as "iphon", therefore not
matching "phone" (token is different) or "*phone" (wildcard takes word
as a term, not its token).

OK, I'm tired of trying to convince you that this is a bug. So i've
opened the issue for you, with a recreation:

WIldcard not working with snowball stemmer · Issue #784 · elastic/elasticsearch · GitHub

clint