Query string with wildcards not working as (I) expect


(Hakan Lindestaf) #1

Hi,

I have some documents that have a field like this:
trackingid: Api23-82199996

I would like to query on this, but only on the Api23 part. If possible
I want to ignore cases (to pick up both api23 and Api23). I tried to
query this using trackingid:api23* and trackingid:Api23* but no result
is returned. If I try trackingid:Api23-82199996 I get results, but
only for a full match of course. I realize there is something I'm
missing, but if anyone can help me understand or come up with a
workaround I'd appreciate it.

Here's a link to a ticket I opened for the UI, figured I'd start
there: https://logstash.jira.com/browse/LOGSTASH-235

Thanks,
/Hakan


(Shay Banon) #2

trackingId is probably analyzed, so its gets broken down into several terms,
using this:

create a sample index

curl -XPUT localhost:9200/test

see how the text for trackingId get analyzed using the default (standard)

analyzer
curl -XGET localhost:9200/test/_analyze -d 'Api23-82199996'

You can see that the text Api23-82199996 gets broken down into two terms,
Api23, and 82199996 that get indexed. If you want to treat it as a single
term, you need to define in a mapping that trackingId is not analyzed.

On Fri, Oct 7, 2011 at 10:16 PM, Hakan Lindestaf hakan@lindestaf.comwrote:

Hi,

I have some documents that have a field like this:
trackingid: Api23-82199996

I would like to query on this, but only on the Api23 part. If possible
I want to ignore cases (to pick up both api23 and Api23). I tried to
query this using trackingid:api23* and trackingid:Api23* but no result
is returned. If I try trackingid:Api23-82199996 I get results, but
only for a full match of course. I realize there is something I'm
missing, but if anyone can help me understand or come up with a
workaround I'd appreciate it.

Here's a link to a ticket I opened for the UI, figured I'd start
there: https://logstash.jira.com/browse/LOGSTASH-235

Thanks,
/Hakan


(Hakan Lindestaf) #3

Shay,

thanks a lot, you were right, it was analyzed. However, I changed it (and killed my indices), checked my metadata and it's not analyzed, but if the content in the field in Api23 (vs api23) then the wildcard query doesn't work. What am I missing? I tried both upper and lower case search query, but it seems to be dependent on the content in the document, which is weird to me.

Thanks,
/Hakan

On Oct 8, 2011, at 11:47 AM, Shay Banon wrote:

trackingId is probably analyzed, so its gets broken down into several terms, using this:

create a sample index

curl -XPUT localhost:9200/test

see how the text for trackingId get analyzed using the default (standard) analyzer

curl -XGET localhost:9200/test/_analyze -d 'Api23-82199996'

You can see that the text Api23-82199996 gets broken down into two terms, Api23, and 82199996 that get indexed. If you want to treat it as a single term, you need to define in a mapping that trackingId is not analyzed.

On Fri, Oct 7, 2011 at 10:16 PM, Hakan Lindestaf hakan@lindestaf.com wrote:
Hi,

I have some documents that have a field like this:
trackingid: Api23-82199996

I would like to query on this, but only on the Api23 part. If possible
I want to ignore cases (to pick up both api23 and Api23). I tried to
query this using trackingid:api23* and trackingid:Api23* but no result
is returned. If I try trackingid:Api23-82199996 I get results, but
only for a full match of course. I realize there is something I'm
missing, but if anyone can help me understand or come up with a
workaround I'd appreciate it.

Here's a link to a ticket I opened for the UI, figured I'd start
there: https://logstash.jira.com/browse/LOGSTASH-235

Thanks,
/Hakan


(Jamshid) #4

So you're using the "keyword" analyzer now? You probably have to set
lowercase_expanded_terms=false.

http://www.elasticsearch.org/guide/reference/query-dsl/query-string-query.html

--Jamshid

On Oct 11, 1:01 pm, Hakan Lindestaf ha...@lindestaf.com wrote:

Shay,

thanks a lot, you were right, it was analyzed. However, I changed it (and killed my indices), checked my metadata and it's not analyzed, but if the content in the field in Api23 (vs api23) then the wildcard query doesn't work. What am I missing? I tried both upper and lower case search query, but it seems to be dependent on the content in the document, which is weird to me.

Thanks,
/Hakan

On Oct 8, 2011, at 11:47 AM, Shay Banon wrote:

trackingId is probably analyzed, so its gets broken down into several terms, using this:

create a sample index

curl -XPUT localhost:9200/test

see how the text for trackingId get analyzed using the default (standard) analyzer

curl -XGET localhost:9200/test/_analyze -d 'Api23-82199996'

You can see that the text Api23-82199996 gets broken down into two terms, Api23, and 82199996 that get indexed. If you want to treat it as a single term, you need to define in a mapping that trackingId is not analyzed.

On Fri, Oct 7, 2011 at 10:16 PM, Hakan Lindestaf ha...@lindestaf.com wrote:
Hi,

I have some documents that have a field like this:
trackingid: Api23-82199996

I would like to query on this, but only on the Api23 part. If possible
I want to ignore cases (to pick up both api23 and Api23). I tried to
query this using trackingid:api23* and trackingid:Api23* but no result
is returned. If I try trackingid:Api23-82199996 I get results, but
only for a full match of course. I realize there is something I'm
missing, but if anyone can help me understand or come up with a
workaround I'd appreciate it.

Here's a link to a ticket I opened for the UI, figured I'd start
there:https://logstash.jira.com/browse/LOGSTASH-235

Thanks,
/Hakan


(Hakan Lindestaf) #5

The problem is that I'm using Logstash as the UI (and I've been told it's using the Java API client to access ES). So I can't see what the real search parameter is unfortunately.
However when I do searches I can guess what it does.
If my data looks like this:
api12-xxxxyyyy

then any of these searches bring back the same result:
api12*
Api12*
API12*

However if the data looks like this:
Api12-xxxxyyyy

then none of the combinations above bring back any results (only the full exact match works).

I also verified this with other (non_analyzed) fields. If the content has upper case characters then the wildcard search doesn't seem to work.

/Hakan

On Oct 12, 2011, at 12:01 PM, Jamshid wrote:

So you're using the "keyword" analyzer now? You probably have to set
lowercase_expanded_terms=false.

http://www.elasticsearch.org/guide/reference/query-dsl/query-string-query.html

--Jamshid

On Oct 11, 1:01 pm, Hakan Lindestaf ha...@lindestaf.com wrote:

Shay,

thanks a lot, you were right, it was analyzed. However, I changed it (and killed my indices), checked my metadata and it's not analyzed, but if the content in the field in Api23 (vs api23) then the wildcard query doesn't work. What am I missing? I tried both upper and lower case search query, but it seems to be dependent on the content in the document, which is weird to me.

Thanks,
/Hakan

On Oct 8, 2011, at 11:47 AM, Shay Banon wrote:

trackingId is probably analyzed, so its gets broken down into several terms, using this:

create a sample index

curl -XPUT localhost:9200/test

see how the text for trackingId get analyzed using the default (standard) analyzer

curl -XGET localhost:9200/test/_analyze -d 'Api23-82199996'

You can see that the text Api23-82199996 gets broken down into two terms, Api23, and 82199996 that get indexed. If you want to treat it as a single term, you need to define in a mapping that trackingId is not analyzed.

On Fri, Oct 7, 2011 at 10:16 PM, Hakan Lindestaf ha...@lindestaf.com wrote:
Hi,

I have some documents that have a field like this:
trackingid: Api23-82199996

I would like to query on this, but only on the Api23 part. If possible
I want to ignore cases (to pick up both api23 and Api23). I tried to
query this using trackingid:api23* and trackingid:Api23* but no result
is returned. If I try trackingid:Api23-82199996 I get results, but
only for a full match of course. I realize there is something I'm
missing, but if anyone can help me understand or come up with a
workaround I'd appreciate it.

Here's a link to a ticket I opened for the UI, figured I'd start
there:https://logstash.jira.com/browse/LOGSTASH-235

Thanks,
/Hakan


(David Pilato) #6

You should use a keyword analyzer with lowercase filter.
Define your own analyzer (keylowercase) and apply it to your field.

Then, when the user enter a search term, lowercase it.

That's the way I do it

HTH
David :wink:

Le 13 oct. 2011 à 01:27, Hakan Lindestaf hakan@lindestaf.com a écrit :

The problem is that I'm using Logstash as the UI (and I've been told it's using the Java API client to access ES). So I can't see what the real search parameter is unfortunately.
However when I do searches I can guess what it does.
If my data looks like this:
api12-xxxxyyyy

then any of these searches bring back the same result:
api12*
Api12*
API12*

However if the data looks like this:
Api12-xxxxyyyy

then none of the combinations above bring back any results (only the full exact match works).

I also verified this with other (non_analyzed) fields. If the content has upper case characters then the wildcard search doesn't seem to work.

/Hakan

On Oct 12, 2011, at 12:01 PM, Jamshid wrote:

So you're using the "keyword" analyzer now? You probably have to set
lowercase_expanded_terms=false.

http://www.elasticsearch.org/guide/reference/query-dsl/query-string-query.html

--Jamshid

On Oct 11, 1:01 pm, Hakan Lindestaf ha...@lindestaf.com wrote:

Shay,

thanks a lot, you were right, it was analyzed. However, I changed it (and killed my indices), checked my metadata and it's not analyzed, but if the content in the field in Api23 (vs api23) then the wildcard query doesn't work. What am I missing? I tried both upper and lower case search query, but it seems to be dependent on the content in the document, which is weird to me.

Thanks,
/Hakan

On Oct 8, 2011, at 11:47 AM, Shay Banon wrote:

trackingId is probably analyzed, so its gets broken down into several terms, using this:

create a sample index

curl -XPUT localhost:9200/test

see how the text for trackingId get analyzed using the default (standard) analyzer

curl -XGET localhost:9200/test/_analyze -d 'Api23-82199996'

You can see that the text Api23-82199996 gets broken down into two terms, Api23, and 82199996 that get indexed. If you want to treat it as a single term, you need to define in a mapping that trackingId is not analyzed.

On Fri, Oct 7, 2011 at 10:16 PM, Hakan Lindestaf ha...@lindestaf.com wrote:
Hi,

I have some documents that have a field like this:
trackingid: Api23-82199996

I would like to query on this, but only on the Api23 part. If possible
I want to ignore cases (to pick up both api23 and Api23). I tried to
query this using trackingid:api23* and trackingid:Api23* but no result
is returned. If I try trackingid:Api23-82199996 I get results, but
only for a full match of course. I realize there is something I'm
missing, but if anyone can help me understand or come up with a
workaround I'd appreciate it.

Here's a link to a ticket I opened for the UI, figured I'd start
there:https://logstash.jira.com/browse/LOGSTASH-235

Thanks,
/Hakan


(Hakan Lindestaf) #7

Ahhh, now I understand. That makes sense and it solved my problem. I think the search query was automatically (by the Logstash UI) made lower case, so with the default analyzer it didn't pick up the lower case search terms. With this change it all works! Thanks a lot!

/Hakan

On Oct 12, 2011, at 9:53 PM, David Pilato wrote:

You should use a keyword analyzer with lowercase filter.
Define your own analyzer (keylowercase) and apply it to your field.

Then, when the user enter a search term, lowercase it.

That's the way I do it

HTH
David :wink:

Le 13 oct. 2011 à 01:27, Hakan Lindestaf hakan@lindestaf.com a écrit :

The problem is that I'm using Logstash as the UI (and I've been told it's using the Java API client to access ES). So I can't see what the real search parameter is unfortunately.
However when I do searches I can guess what it does.
If my data looks like this:
api12-xxxxyyyy

then any of these searches bring back the same result:
api12*
Api12*
API12*

However if the data looks like this:
Api12-xxxxyyyy

then none of the combinations above bring back any results (only the full exact match works).

I also verified this with other (non_analyzed) fields. If the content has upper case characters then the wildcard search doesn't seem to work.

/Hakan

On Oct 12, 2011, at 12:01 PM, Jamshid wrote:

So you're using the "keyword" analyzer now? You probably have to set
lowercase_expanded_terms=false.

http://www.elasticsearch.org/guide/reference/query-dsl/query-string-query.html

--Jamshid

On Oct 11, 1:01 pm, Hakan Lindestaf ha...@lindestaf.com wrote:

Shay,

thanks a lot, you were right, it was analyzed. However, I changed it (and killed my indices), checked my metadata and it's not analyzed, but if the content in the field in Api23 (vs api23) then the wildcard query doesn't work. What am I missing? I tried both upper and lower case search query, but it seems to be dependent on the content in the document, which is weird to me.

Thanks,
/Hakan

On Oct 8, 2011, at 11:47 AM, Shay Banon wrote:

trackingId is probably analyzed, so its gets broken down into several terms, using this:

create a sample index

curl -XPUT localhost:9200/test

see how the text for trackingId get analyzed using the default (standard) analyzer

curl -XGET localhost:9200/test/_analyze -d 'Api23-82199996'

You can see that the text Api23-82199996 gets broken down into two terms, Api23, and 82199996 that get indexed. If you want to treat it as a single term, you need to define in a mapping that trackingId is not analyzed.

On Fri, Oct 7, 2011 at 10:16 PM, Hakan Lindestaf ha...@lindestaf.com wrote:
Hi,

I have some documents that have a field like this:
trackingid: Api23-82199996

I would like to query on this, but only on the Api23 part. If possible
I want to ignore cases (to pick up both api23 and Api23). I tried to
query this using trackingid:api23* and trackingid:Api23* but no result
is returned. If I try trackingid:Api23-82199996 I get results, but
only for a full match of course. I realize there is something I'm
missing, but if anyone can help me understand or come up with a
workaround I'd appreciate it.

Here's a link to a ticket I opened for the UI, figured I'd start
there:https://logstash.jira.com/browse/LOGSTASH-235

Thanks,
/Hakan


(Shay Banon) #8

I think logstash uses the query_string query to query elasticsearch.
Wildcard / Prefix queries will automatically be lowercased (since they are
not analyzed, Lucene tries its "best" to do some sort of common analysis,
which is lowercasing it). I think you solved your problem, which is mapping
it as keyword and lowercase, which is the best way to solve it.

On Thu, Oct 13, 2011 at 6:08 PM, Hakan Lindestaf hakan@lindestaf.comwrote:

Ahhh, now I understand. That makes sense and it solved my problem. I think
the search query was automatically (by the Logstash UI) made lower case, so
with the default analyzer it didn't pick up the lower case search terms.
With this change it all works! Thanks a lot!

/Hakan

On Oct 12, 2011, at 9:53 PM, David Pilato wrote:

You should use a keyword analyzer with lowercase filter.
Define your own analyzer (keylowercase) and apply it to your field.

Then, when the user enter a search term, lowercase it.

That's the way I do it

HTH
David :wink:

Le 13 oct. 2011 à 01:27, Hakan Lindestaf hakan@lindestaf.com a écrit :

The problem is that I'm using Logstash as the UI (and I've been told
it's using the Java API client to access ES). So I can't see what the real
search parameter is unfortunately.

However when I do searches I can guess what it does.
If my data looks like this:
api12-xxxxyyyy

then any of these searches bring back the same result:
api12*
Api12*
API12*

However if the data looks like this:
Api12-xxxxyyyy

then none of the combinations above bring back any results (only the
full exact match works).

I also verified this with other (non_analyzed) fields. If the content
has upper case characters then the wildcard search doesn't seem to work.

/Hakan

On Oct 12, 2011, at 12:01 PM, Jamshid wrote:

So you're using the "keyword" analyzer now? You probably have to set
lowercase_expanded_terms=false.

http://www.elasticsearch.org/guide/reference/query-dsl/query-string-query.html

--Jamshid

On Oct 11, 1:01 pm, Hakan Lindestaf ha...@lindestaf.com wrote:

Shay,

thanks a lot, you were right, it was analyzed. However, I changed it
(and killed my indices), checked my metadata and it's not analyzed, but if
the content in the field in Api23 (vs api23) then the wildcard query doesn't
work. What am I missing? I tried both upper and lower case search query, but
it seems to be dependent on the content in the document, which is weird to
me.

Thanks,
/Hakan

On Oct 8, 2011, at 11:47 AM, Shay Banon wrote:

trackingId is probably analyzed, so its gets broken down into several
terms, using this:

create a sample index

curl -XPUT localhost:9200/test

see how the text for trackingId get analyzed using the default

(standard) analyzer

curl -XGET localhost:9200/test/_analyze -d 'Api23-82199996'

You can see that the text Api23-82199996 gets broken down into two
terms, Api23, and 82199996 that get indexed. If you want to treat it as a
single term, you need to define in a mapping that trackingId is not
analyzed.

On Fri, Oct 7, 2011 at 10:16 PM, Hakan Lindestaf <
ha...@lindestaf.com> wrote:

Hi,

I have some documents that have a field like this:
trackingid: Api23-82199996

I would like to query on this, but only on the Api23 part. If
possible

I want to ignore cases (to pick up both api23 and Api23). I tried to
query this using trackingid:api23* and trackingid:Api23* but no
result

is returned. If I try trackingid:Api23-82199996 I get results, but
only for a full match of course. I realize there is something I'm
missing, but if anyone can help me understand or come up with a
workaround I'd appreciate it.

Here's a link to a ticket I opened for the UI, figured I'd start
there:https://logstash.jira.com/browse/LOGSTASH-235

Thanks,
/Hakan


(Anil AR) #9

Hi Kimchy,
I have also the similar issue. For instance we have values like "City of God" and "God". If I start searching for "g*", I should get get "God" only. Please advice.


(system) #10