Wildcard and slashes


(chimingc) #1

I've indexed this: "24/account".
I understand that it's been tokenized into "24" and "account", which is not a problem for me.

However, when I query "24/a*", it finds no match.

Then I tried the following cases:

  1. This works
    "query_string": {
    "analyze_wildcard": true,
    "query":"24/a*"
    }

  2. This works
    "query_string": {
    "query":"24/account"
    }

  3. This works
    "query_string": {
    "query":"24 / a*"
    }

  4. This works
    "query_string": {
    "query":"a*"
    }

  5. This doesn't work
    "query_string": {
    "query":"24/a*"
    }

I can't explain why 5 doesn't work. Perhaps without setting analyze_wildcard to true, elasticsearch simply removes the slash and searches for "24a*"?

What exactly does analyze_wildcard do when set to true?
As you can see 3 and 4 work without setting analyze_wildcard to true. So when do we need to set it to true?

Thanks,
jimmy


(Igor Motov) #2

Assuming that you are using standard analyzer, this is what these 5 queries
are translated into on Lucene level:

1: _all:24* - prefix query for terms that start with "24"
2: _all:24 _all:account - query for the term "24" or the term "account"
3: _all:24 _all:a* - query for the term "24" or prefix query for terms
that start with "a"
4: _all:a* - prefix query for terms that start with "a"
5: _all:24/a* - prefix query for terms that start with "24/a"

Cases 2-4 are obvious, but cases 1 and 5, probably, require some
explanation. By default, wildcard terms are not analyzed. This is why 5th
case is getting translated into prefix query with the prefix "24/a". As you
correctly noticed, "24/account" is indexed as two tokens "24" and
"account". So, there are no tokens in the index that start with 24/a and
therefore 5th case doesn't return any results. In the case 1, wildcard
terms are analyzed and "24/a" is getting translated into two tokens "24"
and "a". The token "a*" is a stopword and it's getting dropped and the
query is getting translated into prefix query for terms that start with 24.

On Wednesday, March 21, 2012 6:29:49 PM UTC-4, chimingc wrote:

I've indexed this: "24/account".
I understand that it's been tokenized into "24" and "account", which is not
a problem for me.

However, when I query "24/a*", it finds no match.

Then I tried the following cases:

  1. This works
    "query_string": {
    "analyze_wildcard": true,
    "query":"24/a*"
    }

  2. This works
    "query_string": {
    "query":"24/account"
    }

  3. This works
    "query_string": {
    "query":"24 / a*"
    }

  4. This works
    "query_string": {
    "query":"a*"
    }

  5. This doesn't work
    "query_string": {
    "query":"24/a*"
    }

I can't explain why 5 doesn't work. Perhaps without setting
analyze_wildcard
to true, elasticsearch simply removes the slash and searches for "24a*"?

What exactly does analyze_wildcard do when set to true?
As you can see 3 and 4 work without setting analyze_wildcard to true. So
when do we need to set it to true?

Thanks,
jimmy

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/wildcard-and-slashes-tp3847057p3847057.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(Gregory Rice) #3

Igor and friends,

I've got a similar situation as far as special characters in a string,
and I've instructed ElasticSearch to index a field containing the
string "frag-mpm" with the following analyzer:

index :
analysis :
analyzer :
string_lowercase:
tokenizer: keyword
filter: lowercase

using the following mapping:

{
"clientlog" : {
"_analyzer" : {
"path" : "analyzer"
},
"_source" : {
"enabled" : true,
"compress" : true
},
"properties" : {
"analyzer" : {
"type" : "string",
"index" : "no"
},
"@fields" : {
"dynamic" : "true",
"type" : "object"
},
"@timestamp" : {
"format" : "dateOptionalTime",
"type" : "date"
},
"@message" : {
"type" : "string",
"analyzer" :"string_lowercase"
},
"@source" : {
"type" : "string"
},
"@type" : {
"type" : "string"
},
"@tags" : {
"type" : "string"
},
"@source_host" : {
"type" : "string"
},
"@source_path" : {
"type" : "string"
}
}
}
}

I'm still not seeing any search results for the whole string, "frag-
mpm". I see stuff that contains the string "frag", but is it lucene
itself splitting it, even though the analyzer is indexing it with a
keyword tokenizer?

What am I configuring wrong?

Thanks,
Greg Rice
MobiTV

On Mar 21, 7:11 pm, Igor Motov imo...@gmail.com wrote:

Assuming that you are using standard analyzer, this is what these 5 queries
are translated into on Lucene level:

1: _all:24* - prefix query for terms that start with "24"
2: _all:24 _all:account - query for the term "24" or the term "account"
3: _all:24 _all:a* - query for the term "24" or prefix query for terms
that start with "a"
4: _all:a* - prefix query for terms that start with "a"
5: _all:24/a* - prefix query for terms that start with "24/a"

Cases 2-4 are obvious, but cases 1 and 5, probably, require some
explanation. By default, wildcard terms are not analyzed. This is why 5th
case is getting translated into prefix query with the prefix "24/a". As you
correctly noticed, "24/account" is indexed as two tokens "24" and
"account". So, there are no tokens in the index that start with 24/a and
therefore 5th case doesn't return any results. In the case 1, wildcard
terms are analyzed and "24/a" is getting translated into two tokens "24"
and "a". The token "a*" is a stopword and it's getting dropped and the
query is getting translated into prefix query for terms that start with 24.

On Wednesday, March 21, 2012 6:29:49 PM UTC-4, chimingc wrote:

I've indexed this: "24/account".
I understand that it's been tokenized into "24" and "account", which is not
a problem for me.

However, when I query "24/a*", it finds no match.

Then I tried the following cases:

  1. This works
    "query_string": {
    "analyze_wildcard": true,
    "query":"24/a*"
    }
  1. This works
    "query_string": {
    "query":"24/account"
    }
  1. This works
    "query_string": {
    "query":"24 / a*"
    }
  1. This works
    "query_string": {
    "query":"a*"
    }
  1. This doesn't work
    "query_string": {
    "query":"24/a*"
    }

I can't explain why 5 doesn't work. Perhaps without setting
analyze_wildcard
to true, elasticsearch simply removes the slash and searches for "24a*"?

What exactly does analyze_wildcard do when set to true?
As you can see 3 and 4 work without setting analyze_wildcard to true. So
when do we need to set it to true?

Thanks,
jimmy

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/wildcard-and-slashes-...
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(David Pilato) #4

Hi there,

It could depend on how you query.
If you want to find "frag-mpm", you have to query on specific field
containing this value.
By default, you search in _all which have a default analyzer so this field
is broken in tokens.

HTH
David

-----Message d'origine-----
De : elasticsearch@googlegroups.com
[mailto:elasticsearch@googlegroups.com] De la part de Gregory Rice
Envoyé : jeudi 22 mars 2012 05:18
À : elasticsearch
Objet : Re: wildcard and slashes

Igor and friends,

I've got a similar situation as far as special characters in a string,
and I've instructed ElasticSearch to index a field containing the
string "frag-mpm" with the following analyzer:

index :
analysis :
analyzer :
string_lowercase:
tokenizer: keyword
filter: lowercase

using the following mapping:

{
"clientlog" : {
"_analyzer" : {
"path" : "analyzer"
},
"_source" : {
"enabled" : true,
"compress" : true
},
"properties" : {
"analyzer" : {
"type" : "string",
"index" : "no"
},
"@fields" : {
"dynamic" : "true",
"type" : "object"
},
"@timestamp" : {
"format" : "dateOptionalTime",
"type" : "date"
},
"@message" : {
"type" : "string",
"analyzer" :"string_lowercase"
},
"@source" : {
"type" : "string"
},
"@type" : {
"type" : "string"
},
"@tags" : {
"type" : "string"
},
"@source_host" : {
"type" : "string"
},
"@source_path" : {
"type" : "string"
}
}
}
}

I'm still not seeing any search results for the whole string, "frag-
mpm". I see stuff that contains the string "frag", but is it lucene
itself splitting it, even though the analyzer is indexing it with a
keyword tokenizer?

What am I configuring wrong?

Thanks,
Greg Rice
MobiTV

On Mar 21, 7:11 pm, Igor Motov imo...@gmail.com wrote:

Assuming that you are using standard analyzer, this is what these 5
queries are translated into on Lucene level:

1: _all:24* - prefix query for terms that start with "24"
2: _all:24 _all:account - query for the term "24" or the term
"account"
3: _all:24 _all:a* - query for the term "24" or prefix query for
terms that start with "a"
4: _all:a* - prefix query for terms that start with "a"
5: _all:24/a* - prefix query for terms that start with "24/a"

Cases 2-4 are obvious, but cases 1 and 5, probably, require some
explanation. By default, wildcard terms are not analyzed. This is why
5th case is getting translated into prefix query with the prefix
"24/a". As you correctly noticed, "24/account" is indexed as two
tokens "24" and "account". So, there are no tokens in the index that
start with 24/a and therefore 5th case doesn't return any results. In
the case 1, wildcard terms are analyzed and "24/a" is getting
translated into two tokens "24"
and "a". The token "a*" is a stopword and it's getting dropped and
the
query is getting translated into prefix query for terms that start
with 24.

On Wednesday, March 21, 2012 6:29:49 PM UTC-4, chimingc wrote:

I've indexed this: "24/account".
I understand that it's been tokenized into "24" and "account",
which

is not a problem for me.

However, when I query "24/a*", it finds no match.

Then I tried the following cases:

  1. This works
    "query_string": {
    "analyze_wildcard": true,
    "query":"24/a*"
    }
  1. This works
    "query_string": {
    "query":"24/account"
    }
  1. This works
    "query_string": {
    "query":"24 / a*"
    }
  1. This works
    "query_string": {
    "query":"a*"
    }
  1. This doesn't work
    "query_string": {
    "query":"24/a*"
    }

I can't explain why 5 doesn't work. Perhaps without setting
analyze_wildcard to true, elasticsearch simply removes the slash
and

searches for "24a*"?

What exactly does analyze_wildcard do when set to true?
As you can see 3 and 4 work without setting analyze_wildcard to
true. So when do we need to set it to true?

Thanks,
jimmy

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/wildcard-and-
slashes-...

Sent from the ElasticSearch Users mailing list archive at
Nabble.com.


(chimingc) #5

Igor,

Thanks for the response. Really helpful.
I have a few more questions though.

Neither query 3 nor 5 is being analyzed, so why does 3 get broken down into 2 tokens but 5 doesn't?

Also, how did you get the translated queries? Anyway to query elastic to get them? I think it's very helpful to know what the query strings eventually become.

Thanks again,
jimmy

From: "Igor Motov-3 [via ElasticSearch Users]" <ml-node+s115913n3847358h2@n3.nabble.commailto:ml-node+s115913n3847358h2@n3.nabble.com>
Date: Wed, 21 Mar 2012 21:11:28 -0500
To: Jimmy Chen <jchen@sugarcrm.commailto:jchen@sugarcrm.com>
Subject: Re: wildcard and slashes

Assuming that you are using standard analyzer, this is what these 5 queries are translated into on Lucene level:

1: _all:24* - prefix query for terms that start with "24"
2: _all:24 _all:account - query for the term "24" or the term "account"
3: _all:24 _all:a* - query for the term "24" or prefix query for terms that start with "a"
4: _all:a* - prefix query for terms that start with "a"
5: _all:24/a* - prefix query for terms that start with "24/a"

Cases 2-4 are obvious, but cases 1 and 5, probably, require some explanation. By default, wildcard terms are not analyzed. This is why 5th case is getting translated into prefix query with the prefix "24/a". As you correctly noticed, "24/account" is indexed as two tokens "24" and "account". So, there are no tokens in the index that start with 24/a and therefore 5th case doesn't return any results. In the case 1, wildcard terms are analyzed and "24/a" is getting translated into two tokens "24" and "a". The token "a*" is a stopword and it's getting dropped and the query is getting translated into prefix query for terms that start with 24.

On Wednesday, March 21, 2012 6:29:49 PM UTC-4, chimingc wrote:
I've indexed this: "24/account".
I understand that it's been tokenized into "24" and "account", which is not
a problem for me.

However, when I query "24/a*", it finds no match.

Then I tried the following cases:

  1. This works
    "query_string": {
    "analyze_wildcard": true,
    "query":"24/a*"
    }

  2. This works
    "query_string": {
    "query":"24/account"
    }

  3. This works
    "query_string": {
    "query":"24 / a*"
    }

  4. This works
    "query_string": {
    "query":"a*"
    }

  5. This doesn't work
    "query_string": {
    "query":"24/a*"
    }

I can't explain why 5 doesn't work. Perhaps without setting analyze_wildcard
to true, elasticsearch simply removes the slash and searches for "24a*"?

What exactly does analyze_wildcard do when set to true?
As you can see 3 and 4 work without setting analyze_wildcard to true. So
when do we need to set it to true?

Thanks,
jimmy

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/wildcard-and-slashes-tp3847057p3847057.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


If you reply to this email, your message will be added to the discussion below:
http://elasticsearch-users.115913.n3.nabble.com/wildcard-and-slashes-tp3847057p3847358.html
To unsubscribe from wildcard and slashes, click herehttp://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3847057&code=amNoZW5Ac3VnYXJjcm0uY29tfDM4NDcwNTd8LTcxNDM5MzQ0Nw==.
NAMLhttp://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml


(Igor Motov) #6

3 gets broken into queries by query parser.

I am not aware of any simple way to get the translated queries. When I need
to figure out what's actually going on with my queries I just start
elasticsearch under debugger, place breakpoint
here https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/search/query/QueryPhase.java#L176
and execute my search. The query variable there points to the actual
Lucene query.

On Thursday, March 22, 2012 2:02:31 PM UTC-4, chimingc wrote:

Igor,

Thanks for the response. Really helpful.
I have a few more questions though.

Neither query 3 nor 5 is being analyzed, so why does 3 get broken down
into 2 tokens but 5 doesn't?

Also, how did you get the translated queries? Anyway to query elastic to
get them? I think it's very helpful to know what the query strings
eventually become.

Thanks again,
jimmy

From: "Igor Motov-3 [via ElasticSearch Users]" <[hidden email]http://user/SendEmail.jtp?type=node&node=3849083&i=0

Date: Wed, 21 Mar 2012 21:11:28 -0500
To: Jimmy Chen <[hidden email]http://user/SendEmail.jtp?type=node&node=3849083&i=1

Subject: Re: wildcard and slashes

Assuming that you are using standard analyzer, this is what these 5
queries are translated into on Lucene level:

1: _all:24* - prefix query for terms that start with "24"
2: _all:24 _all:account - query for the term "24" or the term "account"
3: _all:24 _all:a* - query for the term "24" or prefix query for terms
that start with "a"
4: _all:a* - prefix query for terms that start with "a"
5: _all:24/a* - prefix query for terms that start with "24/a"

Cases 2-4 are obvious, but cases 1 and 5, probably, require some
explanation. By default, wildcard terms are not analyzed. This is why 5th
case is getting translated into prefix query with the prefix "24/a". As you
correctly noticed, "24/account" is indexed as two tokens "24" and
"account". So, there are no tokens in the index that start with 24/a and
therefore 5th case doesn't return any results. In the case 1, wildcard
terms are analyzed and "24/a" is getting translated into two tokens "24"
and "a". The token "a*" is a stopword and it's getting dropped and the
query is getting translated into prefix query for terms that start with 24.

On Wednesday, March 21, 2012 6:29:49 PM UTC-4, chimingc wrote:

I've indexed this: "24/account".
I understand that it's been tokenized into "24" and "account", which is
not
a problem for me.

However, when I query "24/a*", it finds no match.

Then I tried the following cases:

  1. This works
    "query_string": {
    "analyze_wildcard": true,
    "query":"24/a*"
    }

  2. This works
    "query_string": {
    "query":"24/account"
    }

  3. This works
    "query_string": {
    "query":"24 / a*"
    }

  4. This works
    "query_string": {
    "query":"a*"
    }

  5. This doesn't work
    "query_string": {
    "query":"24/a*"
    }

I can't explain why 5 doesn't work. Perhaps without setting
analyze_wildcard
to true, elasticsearch simply removes the slash and searches for "24a*"?

What exactly does analyze_wildcard do when set to true?
As you can see 3 and 4 work without setting analyze_wildcard to true. So
when do we need to set it to true?

Thanks,
jimmy

--
View this message in context:
http://elasticsearch-users.​115913.n3.nabble.com/wildcard-​and-slashes-tp3847057p3847057.​htmlhttp://elasticsearch-users.115913.n3.nabble.com/wildcard-and-slashes-tp3847057p3847057.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


If you reply to this email, your message will be added to the discussion
below:

http://elasticsearch-users.​115913.n3.nabble.com/wildcard-​and-slashes-tp3847057p3847358.​htmlhttp://elasticsearch-users.115913.n3.nabble.com/wildcard-and-slashes-tp3847057p3847358.html
To unsubscribe from wildcard and slashes, click here.
NAMLhttp://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml


View this message in context: Re: wildcard and slasheshttp://elasticsearch-users.115913.n3.nabble.com/wildcard-and-slashes-tp3847057p3849083.html
Sent from the ElasticSearch Users mailing list archivehttp://elasticsearch-users.115913.n3.nabble.com/at Nabble.com.


(Gregory Rice) #7

Igor and David,

Thanks a ton for the info. One question:

Is there any way to specify which special characters are used for
tokenization? Like, is there an easy way to say "Break on slashes, but
not dashes", or do I need to make my own tokenizer to do that?

Thanks,
Greg Rice

On Mar 22, 12:52 pm, Igor Motov imo...@gmail.com wrote:

3 gets broken into queries by query parser.

I am not aware of any simple way to get the translated queries. When I need
to figure out what's actually going on with my queries I just start
elasticsearch under debugger, place breakpoint
herehttps://github.com/elasticsearch/elasticsearch/blob/master/src/main/j...
and execute my search. The query variable there points to the actual
Lucene query.

On Thursday, March 22, 2012 2:02:31 PM UTC-4, chimingc wrote:

Igor,

Thanks for the response. Really helpful.
I have a few more questions though.

Neither query 3 nor 5 is being analyzed, so why does 3 get broken down
into 2 tokens but 5 doesn't?

Also, how did you get the translated queries? Anyway to query elastic to
get them? I think it's very helpful to know what the query strings
eventually become.

Thanks again,
jimmy

From: "Igor Motov-3 [via ElasticSearch Users]" <[hidden email]http://user/SendEmail.jtp?type=node&node=3849083&i=0

Date: Wed, 21 Mar 2012 21:11:28 -0500
To: Jimmy Chen <[hidden email]http://user/SendEmail.jtp?type=node&node=3849083&i=1

Subject: Re: wildcard and slashes

Assuming that you are using standard analyzer, this is what these 5
queries are translated into on Lucene level:

1: _all:24* - prefix query for terms that start with "24"
2: _all:24 _all:account - query for the term "24" or the term "account"
3: _all:24 _all:a* - query for the term "24" or prefix query for terms
that start with "a"
4: _all:a* - prefix query for terms that start with "a"
5: _all:24/a* - prefix query for terms that start with "24/a"

Cases 2-4 are obvious, but cases 1 and 5, probably, require some
explanation. By default, wildcard terms are not analyzed. This is why 5th
case is getting translated into prefix query with the prefix "24/a". As you
correctly noticed, "24/account" is indexed as two tokens "24" and
"account". So, there are no tokens in the index that start with 24/a and
therefore 5th case doesn't return any results. In the case 1, wildcard
terms are analyzed and "24/a" is getting translated into two tokens "24"
and "a". The token "a*" is a stopword and it's getting dropped and the
query is getting translated into prefix query for terms that start with 24.

On Wednesday, March 21, 2012 6:29:49 PM UTC-4, chimingc wrote:

I've indexed this: "24/account".
I understand that it's been tokenized into "24" and "account", which is
not
a problem for me.

However, when I query "24/a*", it finds no match.

Then I tried the following cases:

  1. This works
    "query_string": {
    "analyze_wildcard": true,
    "query":"24/a*"
    }
  1. This works
    "query_string": {
    "query":"24/account"
    }
  1. This works
    "query_string": {
    "query":"24 / a*"
    }
  1. This works
    "query_string": {
    "query":"a*"
    }
  1. This doesn't work
    "query_string": {
    "query":"24/a*"
    }

I can't explain why 5 doesn't work. Perhaps without setting
analyze_wildcard
to true, elasticsearch simply removes the slash and searches for "24a*"?

What exactly does analyze_wildcard do when set to true?
As you can see 3 and 4 work without setting analyze_wildcard to true. So
when do we need to set it to true?

Thanks,
jimmy

--
View this message in context:
http://elasticsearch-users.​115913.n3.nabble.com/wildcard-​and-slashes-tp3847057p3847057.​htmlhttp://elasticsearch-users.115913.n3.nabble.com/wildcard-and-slashes-...
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


If you reply to this email, your message will be added to the discussion
below:

http://elasticsearch-users.​115913.n3.nabble.com/wildcard-​and-slashes-tp3847057p3847358.​htmlhttp://elasticsearch-users.115913.n3.nabble.com/wildcard-and-slashes-...
To unsubscribe from wildcard and slashes, click here.
NAMLhttp://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet....


View this message in context: Re: wildcard and slasheshttp://elasticsearch-users.115913.n3.nabble.com/wildcard-and-slashes-...
Sent from the ElasticSearch Users mailing list archivehttp://elasticsearch-users.115913.n3.nabble.com/at Nabble.com.


(David Pilato) #8

Perhaps this one : http://www.elasticsearch.org/guide/reference/index-modules/analysis/pattern-tokenizer.html

HTH
David :wink:
Twitter : @dadoonet / @elasticsearchfr

Le 22 mars 2012 à 22:11, Gregory Rice gregrice@gmail.com a écrit :

Igor and David,

Thanks a ton for the info. One question:

Is there any way to specify which special characters are used for
tokenization? Like, is there an easy way to say "Break on slashes, but
not dashes", or do I need to make my own tokenizer to do that?

Thanks,
Greg Rice

On Mar 22, 12:52 pm, Igor Motov imo...@gmail.com wrote:

3 gets broken into queries by query parser.

I am not aware of any simple way to get the translated queries. When I need
to figure out what's actually going on with my queries I just start
elasticsearch under debugger, place breakpoint
herehttps://github.com/elasticsearch/elasticsearch/blob/master/src/main/j...
and execute my search. The query variable there points to the actual
Lucene query.

On Thursday, March 22, 2012 2:02:31 PM UTC-4, chimingc wrote:

Igor,

Thanks for the response. Really helpful.
I have a few more questions though.

Neither query 3 nor 5 is being analyzed, so why does 3 get broken down
into 2 tokens but 5 doesn't?

Also, how did you get the translated queries? Anyway to query elastic to
get them? I think it's very helpful to know what the query strings
eventually become.

Thanks again,
jimmy

From: "Igor Motov-3 [via ElasticSearch Users]" <[hidden email]http://user/SendEmail.jtp?type=node&node=3849083&i=0

Date: Wed, 21 Mar 2012 21:11:28 -0500
To: Jimmy Chen <[hidden email]http://user/SendEmail.jtp?type=node&node=3849083&i=1

Subject: Re: wildcard and slashes

Assuming that you are using standard analyzer, this is what these 5
queries are translated into on Lucene level:

1: _all:24* - prefix query for terms that start with "24"
2: _all:24 _all:account - query for the term "24" or the term "account"
3: _all:24 _all:a* - query for the term "24" or prefix query for terms
that start with "a"
4: _all:a* - prefix query for terms that start with "a"
5: _all:24/a* - prefix query for terms that start with "24/a"

Cases 2-4 are obvious, but cases 1 and 5, probably, require some
explanation. By default, wildcard terms are not analyzed. This is why 5th
case is getting translated into prefix query with the prefix "24/a". As you
correctly noticed, "24/account" is indexed as two tokens "24" and
"account". So, there are no tokens in the index that start with 24/a and
therefore 5th case doesn't return any results. In the case 1, wildcard
terms are analyzed and "24/a" is getting translated into two tokens "24"
and "a". The token "a*" is a stopword and it's getting dropped and the
query is getting translated into prefix query for terms that start with 24.

On Wednesday, March 21, 2012 6:29:49 PM UTC-4, chimingc wrote:

I've indexed this: "24/account".
I understand that it's been tokenized into "24" and "account", which is
not
a problem for me.

However, when I query "24/a*", it finds no match.

Then I tried the following cases:

  1. This works
    "query_string": {
    "analyze_wildcard": true,
    "query":"24/a*"
    }
  1. This works
    "query_string": {
    "query":"24/account"
    }
  1. This works
    "query_string": {
    "query":"24 / a*"
    }
  1. This works
    "query_string": {
    "query":"a*"
    }
  1. This doesn't work
    "query_string": {
    "query":"24/a*"
    }

I can't explain why 5 doesn't work. Perhaps without setting
analyze_wildcard
to true, elasticsearch simply removes the slash and searches for "24a*"?

What exactly does analyze_wildcard do when set to true?
As you can see 3 and 4 work without setting analyze_wildcard to true. So
when do we need to set it to true?

Thanks,
jimmy

--
View this message in context:
http://elasticsearch-users.​115913.n3.nabble.com/wildcard-​and-slashes-tp3847057p3847057.​htmlhttp://elasticsearch-users.115913.n3.nabble.com/wildcard-and-slashes-...
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


If you reply to this email, your message will be added to the discussion
below:

http://elasticsearch-users.​115913.n3.nabble.com/wildcard-​and-slashes-tp3847057p3847358.​htmlhttp://elasticsearch-users.115913.n3.nabble.com/wildcard-and-slashes-...
To unsubscribe from wildcard and slashes, click here.
NAMLhttp://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet....


View this message in context: Re: wildcard and slasheshttp://elasticsearch-users.115913.n3.nabble.com/wildcard-and-slashes-...
Sent from the ElasticSearch Users mailing list archivehttp://elasticsearch-users.115913.n3.nabble.com/at Nabble.com.


(Igor Motov) #9

A better way to get translated queries is coming in 0.19.2 and 0.20.0. See https://github.com/elasticsearch/elasticsearch/pull/1811
for details.

On Thursday, March 22, 2012 3:52:32 PM UTC-4, Igor Motov wrote:

3 gets broken into queries by query parser.

I am not aware of any simple way to get the translated queries. When I
need to figure out what's actually going on with my queries I just start
elasticsearch under debugger, place breakpoint here
https://github.com/​elasticsearch/elasticsearch/​blob/master/src/main/java/org/​elasticsearch/search/query/​QueryPhase.java#L176https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/search/query/QueryPhase.java#L176 and execute my search. The query variable there points to the actual
Lucene query.

On Thursday, March 22, 2012 2:02:31 PM UTC-4, chimingc wrote:

Igor,

Thanks for the response. Really helpful.
I have a few more questions though.

Neither query 3 nor 5 is being analyzed, so why does 3 get broken down
into 2 tokens but 5 doesn't?

Also, how did you get the translated queries? Anyway to query elastic to
get them? I think it's very helpful to know what the query strings
eventually become.

Thanks again,
jimmy

From: "Igor Motov-3 [via ElasticSearch Users]" <[hidden email]http://user/SendEmail.jtp?type=node&node=3849083&i=0

Date: Wed, 21 Mar 2012 21:11:28 -0500
To: Jimmy Chen <[hidden email]http://user/SendEmail.jtp?type=node&node=3849083&i=1

Subject: Re: wildcard and slashes

Assuming that you are using standard analyzer, this is what these 5
queries are translated into on Lucene level:

1: _all:24* - prefix query for terms that start with "24"
2: _all:24 _all:account - query for the term "24" or the term
"account"
3: _all:24 _all:a* - query for the term "24" or prefix query for
terms that start with "a"
4: _all:a* - prefix query for terms that start with "a"
5: _all:24/a* - prefix query for terms that start with "24/a"

Cases 2-4 are obvious, but cases 1 and 5, probably, require some
explanation. By default, wildcard terms are not analyzed. This is why 5th
case is getting translated into prefix query with the prefix "24/a". As you
correctly noticed, "24/account" is indexed as two tokens "24" and
"account". So, there are no tokens in the index that start with 24/a and
therefore 5th case doesn't return any results. In the case 1, wildcard
terms are analyzed and "24/a" is getting translated into two tokens "24"
and "a". The token "a*" is a stopword and it's getting dropped and the
query is getting translated into prefix query for terms that start with 24.

On Wednesday, March 21, 2012 6:29:49 PM UTC-4, chimingc wrote:

I've indexed this: "24/account".
I understand that it's been tokenized into "24" and "account", which is
not
a problem for me.

However, when I query "24/a*", it finds no match.

Then I tried the following cases:

  1. This works
    "query_string": {
    "analyze_wildcard": true,
    "query":"24/a*"
    }

  2. This works
    "query_string": {
    "query":"24/account"
    }

  3. This works
    "query_string": {
    "query":"24 / a*"
    }

  4. This works
    "query_string": {
    "query":"a*"
    }

  5. This doesn't work
    "query_string": {
    "query":"24/a*"
    }

I can't explain why 5 doesn't work. Perhaps without setting
analyze_wildcard
to true, elasticsearch simply removes the slash and searches for "24a*"?

What exactly does analyze_wildcard do when set to true?
As you can see 3 and 4 work without setting analyze_wildcard to true. So
when do we need to set it to true?

Thanks,
jimmy

--
View this message in context:
http://elasticsearch-users.​​115913.n3.nabble.com/wildcard-​​and-slashes-​tp3847057p3847057.​htmlhttp://elasticsearch-users.115913.n3.nabble.com/wildcard-and-slashes-tp3847057p3847057.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


If you reply to this email, your message will be added to the
discussion below:

http://elasticsearch-users.​​115913.n3.nabble.com/wildcard-​​and-slashes-​tp3847057p3847358.​htmlhttp://elasticsearch-users.115913.n3.nabble.com/wildcard-and-slashes-tp3847057p3847358.html
To unsubscribe from wildcard and slashes, click here.
NAMLhttp://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml


View this message in context: Re: wildcard and slasheshttp://elasticsearch-users.115913.n3.nabble.com/wildcard-and-slashes-tp3847057p3849083.html
Sent from the ElasticSearch Users mailing list archivehttp://elasticsearch-users.115913.n3.nabble.com/at Nabble.com.


(chimingc) #10

Thanks, good to know.

From: "Igor Motov-3 [via ElasticSearch Users]" <ml-node+s115913n3853948h92@n3.nabble.commailto:ml-node+s115913n3853948h92@n3.nabble.com>
Date: Sat, 24 Mar 2012 10:24:42 -0500
To: Jimmy Chen <jchen@sugarcrm.commailto:jchen@sugarcrm.com>
Subject: Re: wildcard and slashes

A better way to get translated queries is coming in 0.19.2 and 0.20.0. See https://github.com/elasticsearch/elasticsearch/pull/1811 for details.

On Thursday, March 22, 2012 3:52:32 PM UTC-4, Igor Motov wrote:
3 gets broken into queries by query parser.

I am not aware of any simple way to get the translated queries. When I need to figure out what's actually going on with my queries I just start elasticsearch under debugger, place breakpoint here https://github.com/​elasticsearch/elasticsearch/​blob/master/src/main/java/org/​elasticsearch/search/query/​QueryPhase.java#L176https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/search/query/QueryPhase.java#L176 and execute my search. The query variable there points to the actual Lucene query.

On Thursday, March 22, 2012 2:02:31 PM UTC-4, chimingc wrote:
Igor,

Thanks for the response. Really helpful.
I have a few more questions though.

Neither query 3 nor 5 is being analyzed, so why does 3 get broken down into 2 tokens but 5 doesn't?

Also, how did you get the translated queries? Anyway to query elastic to get them? I think it's very helpful to know what the query strings eventually become.

Thanks again,
jimmy

From: "Igor Motov-3 [via ElasticSearch Users]" <[hidden email]http://user/SendEmail.jtp?type=node&node=3849083&i=0>
Date: Wed, 21 Mar 2012 21:11:28 -0500
To: Jimmy Chen <[hidden email]http://user/SendEmail.jtp?type=node&node=3849083&i=1>
Subject: Re: wildcard and slashes

Assuming that you are using standard analyzer, this is what these 5 queries are translated into on Lucene level:

1: _all:24* - prefix query for terms that start with "24"
2: _all:24 _all:account - query for the term "24" or the term "account"
3: _all:24 _all:a* - query for the term "24" or prefix query for terms that start with "a"
4: _all:a* - prefix query for terms that start with "a"
5: _all:24/a* - prefix query for terms that start with "24/a"

Cases 2-4 are obvious, but cases 1 and 5, probably, require some explanation. By default, wildcard terms are not analyzed. This is why 5th case is getting translated into prefix query with the prefix "24/a". As you correctly noticed, "24/account" is indexed as two tokens "24" and "account". So, there are no tokens in the index that start with 24/a and therefore 5th case doesn't return any results. In the case 1, wildcard terms are analyzed and "24/a" is getting translated into two tokens "24" and "a". The token "a*" is a stopword and it's getting dropped and the query is getting translated into prefix query for terms that start with 24.

On Wednesday, March 21, 2012 6:29:49 PM UTC-4, chimingc wrote:
I've indexed this: "24/account".
I understand that it's been tokenized into "24" and "account", which is not
a problem for me.

However, when I query "24/a*", it finds no match.

Then I tried the following cases:

  1. This works
    "query_string": {
    "analyze_wildcard": true,
    "query":"24/a*"
    }

  2. This works
    "query_string": {
    "query":"24/account"
    }

  3. This works
    "query_string": {
    "query":"24 / a*"
    }

  4. This works
    "query_string": {
    "query":"a*"
    }

  5. This doesn't work
    "query_string": {
    "query":"24/a*"
    }

I can't explain why 5 doesn't work. Perhaps without setting analyze_wildcard
to true, elasticsearch simply removes the slash and searches for "24a*"?

What exactly does analyze_wildcard do when set to true?
As you can see 3 and 4 work without setting analyze_wildcard to true. So
when do we need to set it to true?

Thanks,
jimmy

--
View this message in context: http://elasticsearch-users.​​115913.n3.nabble.com/wildcard-​​and-slashes-​tp3847057p3847057.​htmlhttp://elasticsearch-users.115913.n3.nabble.com/wildcard-and-slashes-tp3847057p3847057.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


If you reply to this email, your message will be added to the discussion below:
http://elasticsearch-users.​​115913.n3.nabble.com/wildcard-​​and-slashes-​tp3847057p3847358.​htmlhttp://elasticsearch-users.115913.n3.nabble.com/wildcard-and-slashes-tp3847057p3847358.html
To unsubscribe from wildcard and slashes, click here.
NAMLhttp://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml


View this message in context: Re: wildcard and slasheshttp://elasticsearch-users.115913.n3.nabble.com/wildcard-and-slashes-tp3847057p3849083.html
Sent from the ElasticSearch Users mailing list archivehttp://elasticsearch-users.115913.n3.nabble.com/ at Nabble.com.


If you reply to this email, your message will be added to the discussion below:
http://elasticsearch-users.115913.n3.nabble.com/wildcard-and-slashes-tp3847057p3853948.html
To unsubscribe from wildcard and slashes, click herehttp://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3847057&code=amNoZW5Ac3VnYXJjcm0uY29tfDM4NDcwNTd8LTcxNDM5MzQ0Nw==.
NAMLhttp://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml


(system) #11