Strange behaviour of QueryString in JavaAPI


(Sang Dang) #1

Hi,
I am using QueryString in JavaAPI and find that it work really strange with
query string in Rest.

Here is step to reproduce it

First, add asciifolding to filter:

analyzer:
  default:

tokenizer: standard
filter: [asciifolding,lowercase]

Create your index, and indexing your data with unicode word, ex: không có gì

Search in head plugin with: không, -> you get your document "không có gì"
Search in java api with: không -> you get nothing
Search in java api with: khong -> you get your document

First I think it's because my index is not use asciifolding & lowercase
filter, so I test it like that:
http://127.0.0.1:9200/myindex/_analyze?text=không%20có%20gì
Result:

{"tokens":[{"token":"khong","start_offset":0,"end_offset":5,"type":"","position":1},{"token":"co","start_offset":6,"end_offset":8,"type":"","position":2},{"token":"gi","start_offset":9,"end_offset":11,"type":"","position":3}]}

So there shouldn't problem with filter.

Currently I fix it by do Asciifolding & lowercase by my self using Lucence AsciiFoldingFilter. But I really want to know what's happening.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/905311c4-dbf8-4e84-a1dc-b09ee3aec0bd%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #2

Could you gist your java code?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 déc. 2013 à 04:15, kidkid zkidkid@gmail.com a écrit :

Hi,
I am using QueryString in JavaAPI and find that it work really strange with query string in Rest.

Here is step to reproduce it

First, add asciifolding to filter:

analyzer:
  default:
tokenizer: standard
    filter: [asciifolding,lowercase]

Create your index, and indexing your data with unicode word, ex: không có gì

Search in head plugin with: không, -> you get your document "không có gì"
Search in java api with: không -> you get nothing
Search in java api with: khong -> you get your document

First I think it's because my index is not use asciifolding & lowercase filter, so I test it like that:
http://127.0.0.1:9200/myindex/_analyze?text=không%20có%20gì
Result:
{"tokens":[{"token":"khong","start_offset":0,"end_offset":5,"type":"","position":1},{"token":"co","start_offset":6,"end_offset":8,"type":"","position":2},{"token":"gi","start_offset":9,"end_offset":11,"type":"","position":3}]}

So there shouldn't problem with filter.

Currently I fix it by do Asciifolding & lowercase by my self using Lucence AsciiFoldingFilter. But I really want to know what's happening.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/905311c4-dbf8-4e84-a1dc-b09ee3aec0bd%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/EFF9AD08-3053-4004-A775-6129002B4727%40pilato.fr.
For more options, visit https://groups.google.com/groups/opt_out.


(Sang Dang) #3

Hi David,

I have figure out the problem:

Let said we have already setup as I said above.
Now I try with:
String query = "khô*"
QueryStringQueryBuilder queryString =
QueryBuilders.queryString(query).defaultField("myfield");
I would expect to get : "không có gì" but actually It will return nothing

I have to set analyzeWildcard(true) and it do fine.

The question here, incase I don't set analyzeWildcard(true).
If I search kho* it would return "không có gì" document
But if I search khô* it wouldn't return.

Is it reasonable ?

On Sunday, December 15, 2013 8:54:53 PM UTC-8, David Pilato wrote:

Could you gist your java code?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 déc. 2013 à 04:15, kidkid <zki...@gmail.com <javascript:>> a écrit :

Hi,
I am using QueryString in JavaAPI and find that it work really strange
with query string in Rest.

Here is step to reproduce it

First, add asciifolding to filter:

analyzer:
  default:

tokenizer: standard
filter: [asciifolding,lowercase]

Create your index, and indexing your data with unicode word, ex: không có

Search in head plugin with: không, -> you get your document "không có gì"
Search in java api with: không -> you get nothing
Search in java api with: khong -> you get your document

First I think it's because my index is not use asciifolding & lowercase
filter, so I test it like that:
http://127.0.0.1:9200/myindex/_analyze?text=không%20có%20gì
Result:

{"tokens":[{"token":"khong","start_offset":0,"end_offset":5,"type":"","position":1},{"token":"co","start_offset":6,"end_offset":8,"type":"","position":2},{"token":"gi","start_offset":9,"end_offset":11,"type":"","position":3}]}

So there shouldn't problem with filter.

Currently I fix it by do Asciifolding & lowercase by my self using Lucence AsciiFoldingFilter. But I really want to know what's happening.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/905311c4-dbf8-4e84-a1dc-b09ee3aec0bd%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f2db090e-50e2-490e-a6a4-d8376c859ac0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #4

I guess it is: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html

analyze_wildcard

By default, wildcards terms in a query string are not analyzed. By setting this value to true, a best effort will be made to analyze those as well.

So when searching for "khô*", you are trying to compare "khô" with the inverted index term "kho". It does not match.

BTW I think you should consider using MatchQuery instead of QueryStringQuery: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-match-query.html

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 17 décembre 2013 at 09:24:46, kidkid (zkidkid@gmail.com) a écrit:

Hi David,

I have figure out the problem:

Let said we have already setup as I said above.
Now I try with:
String query = "khô*"
QueryStringQueryBuilder queryString = QueryBuilders.queryString(query).defaultField("myfield");
I would expect to get : "không có gì" but actually It will return nothing

I have to set analyzeWildcard(true) and it do fine.

The question here, incase I don't set analyzeWildcard(true).
If I search kho* it would return "không có gì" document
But if I search khô* it wouldn't return.

Is it reasonable ?

On Sunday, December 15, 2013 8:54:53 PM UTC-8, David Pilato wrote:
Could you gist your java code?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 déc. 2013 à 04:15, kidkid zki...@gmail.com a écrit :

Hi,
I am using QueryString in JavaAPI and find that it work really strange with query string in Rest.

Here is step to reproduce it

First, add asciifolding to filter:

analyzer:
  default:

tokenizer: standard
filter: [asciifolding,lowercase]

Create your index, and indexing your data with unicode word, ex: không có gì

Search in head plugin with: không, -> you get your document "không có gì"
Search in java api with: không -> you get nothing
Search in java api with: khong -> you get your document

First I think it's because my index is not use asciifolding & lowercase filter, so I test it like that:
http://127.0.0.1:9200/myindex/_analyze?text=không%20có%20gì
Result:
{"tokens":[{"token":"khong","start_offset":0,"end_offset":5,"type":"","position":1},{"token":"co","start_offset":6,"end_offset":8,"type":"","position":2},{"token":"gi","start_offset":9,"end_offset":11,"type":"","position":3}]}

So there shouldn't problem with filter.

Currently I fix it by do Asciifolding & lowercase by my self using Lucence AsciiFoldingFilter. But I really want to know what's happening.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/905311c4-dbf8-4e84-a1dc-b09ee3aec0bd%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f2db090e-50e2-490e-a6a4-d8376c859ac0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.52b00ba5.580bd78f.6956%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/groups/opt_out.


(Sang Dang) #5

Hi David,
Thanks for you suggest but Match Query don't allow to use wildcard.
So in case I search khong it will match "khong co gi" but if I search "kho"
it will not match.
So I back to use query string instead.
Actually I only use wildcard in case query has only 1 word, so the
performance is fine.

On Tuesday, December 17, 2013 12:30:29 AM UTC-8, David Pilato wrote:

I guess it is:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html

analyze_wildcard

By default, wildcards terms in a query string are not analyzed. By setting
this value to true, a best effort will be made to analyze those as well.
So when searching for "khô*", you are trying to compare "khô" with the
inverted index term "kho". It does not match.

BTW I think you should consider using MatchQuery instead of
QueryStringQuery:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-match-query.html

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 17 décembre 2013 at 09:24:46, kidkid (zki...@gmail.com <javascript:>)
a écrit:

Hi David,

I have figure out the problem:

Let said we have already setup as I said above.
Now I try with:
String query = "khô*"
QueryStringQueryBuilder queryString =
QueryBuilders.queryString(query).defaultField("myfield");
I would expect to get : "không có gì" but actually It will return nothing

I have to set analyzeWildcard(true) and it do fine.

The question here, incase I don't set analyzeWildcard(true).
If I search kho* it would return "không có gì" document
But if I search khô* it wouldn't return.

Is it reasonable ?

On Sunday, December 15, 2013 8:54:53 PM UTC-8, David Pilato wrote:

Could you gist your java code?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 déc. 2013 à 04:15, kidkid zki...@gmail.com a écrit :

Hi,
I am using QueryString in JavaAPI and find that it work really strange
with query string in Rest.

Here is step to reproduce it

First, add asciifolding to filter:

analyzer:
  default:

tokenizer: standard
filter: [asciifolding,lowercase]

Create your index, and indexing your data with unicode word, ex: không có

Search in head plugin with: không, -> you get your document "không có gì"
Search in java api with: không -> you get nothing
Search in java api with: khong -> you get your document

First I think it's because my index is not use asciifolding & lowercase
filter, so I test it like that:
http://127.0.0.1:9200/myindex/_analyze?text=không%20có%20gìhttp://127.0.0.1:9200/myindex/_analyze?text=không%20có%20gì
Result:

{"tokens":[{"token":"khong","start_offset":0,"end_offset":5,"type":"","position":1},{"token":"co","start_offset":6,"end_offset":8,"type":"","position":2},{"token":"gi","start_offset":9,"end_offset":11,"type":"","position":3}]}

So there shouldn't problem with filter.

Currently I fix it by do Asciifolding & lowercase by my self using Lucence AsciiFoldingFilter. But I really want to know what's happening.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/905311c4-dbf8-4e84-a1dc-b09ee3aec0bd%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f2db090e-50e2-490e-a6a4-d8376c859ac0%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bcc6b4f3-3692-4f35-97f5-aaabebebead6%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #6