Analyzers comparison

Hi,

I am trying to compare different analyzers to see which ones we should use. But my simple comparison shows me no difference between standard and keyword analyzers.
I wonder what I am missing.

This is the simple test I have.

I have two fields using different analyzer (keyword vs standard):
curl -XPOST localhost:9200/test -d '{
"mappings" : {
"accounts" : {
"properties" : {
"keyword_field" : { type: "string", analyzer : "keyword" },
"standard_field" : { type: "string", analyzer : "standard" }
}
}
}
}'

Then I have two records:
curl -XPUT 'http://localhost:9200/test/accounts/1' -d '{
"keyword_field" : "john doe"
}'
curl -XPUT 'http://localhost:9200/test/accounts/2' -d '{
"standard_field" : "john doe"
}'

Finally, I use this query to get my results:
curl "localhost:9200/test/accounts/_search?pretty=true" -d '{
"query" : {
"query_string" : {
"analyzer" : "keyword", //also tried "standard"
"query" : "john doe" // also tried "john" or "doe"
}
}
}'

In the query_string above, I have tried both keyword and standard.
However, they always return me both documents with same score no matter which analyzer I have in my query_string.
I also tried replacing the query "john doe" with either only "john" or "doe", and it still returns me both documents regardless the analyzer in my query_string.

My understanding is that keyword analyzer does not tokenize while standard analyzer does. So I'd expect different search results.

If this is not a correct test to see the difference between the two, what test can I use to see some differences so I know which one to use in my app?

Thanks.

Elasticsearch has an 'analyzer' API that you can use to see what the
various analyzers do: see http://www.elasticsearch.org/guide/reference/api/admin-indices-analyze.html.
For your example:

curl -XGET 'localhost:9200/test/_analyze?pretty=true&text=john
+doe&analyzer=standard'

{
"tokens" : [ {
"token" : "john",
"start_offset" : 0,
"end_offset" : 4,
"type" : "",
"position" : 1
}, {
"token" : "doe",
"start_offset" : 5,
"end_offset" : 8,
"type" : "",
"position" : 2
} ]
}

curl -XGET 'localhost:9200/test/_analyze?pretty=true&text=john
+doe&analyzer=keyword'

{
"tokens" : [ {
"token" : "john doe",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
} ]
}

Hi,

See
http://www.elasticsearch.org/guide/reference/api/admin-indices-analyze.html
http://www.elasticsearch.org/guide/reference/api/admin-indices-analyze.html
You can test how the analyzer will tokenize your datas.

It will help to understand what's going on.

HTH
David.

Le 1 février 2012 à 23:02, chimingc jchen@sugarcrm.com a écrit :

Hi,

I am trying to compare different analyzers to see which ones we should
use.
But my simple comparison shows me no difference between standard and
keyword
analyzers.
I wonder what I am missing.

This is the simple test I have.

I have two fields using different analyzer (keyword vs standard):
curl -XPOST localhost:9200/test -d '{
"mappings" : {
"accounts" : {
"properties" : {
"keyword_field" : { type: "string", analyzer : "keyword"
},
"standard_field" : { type: "string", analyzer :
"standard" }
}
}
}
}'

Then I have two records:
curl -XPUT 'http://localhost:9200/test/accounts/1' -d '{
"keyword_field" : "john doe"
}'
curl -XPUT 'http://localhost:9200/test/accounts/2' -d '{
"standard_field" : "john doe"
}'

Finally, I use this query to get my results:
curl "localhost:9200/test/accounts/_search?pretty=true" -d '{
"query" : {
"query_string" : {
"analyzer" : "keyword", //also tried "standard"
"query" : "john doe" // also tried "john" or "doe"
}
}
}'

In the query_string above, I have tried both keyword and standard.
However, they always return me both documents with same score no matter
which analyzer I have in my query_string.
I also tried replacing the query "john doe" with either only "john" or
"doe", and it still returns me both documents regardless the analyzer in
my
query_string.

My understanding is that keyword analyzer does not tokenize while
standard
analyzer does. So I'd expect different search results.

If this is not a correct test to see the difference between the two, what
test can I use to see some differences so I know which one to use in my
app?

Thanks.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Analyzers-comparison-tp3708214p3708214.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
David Pilato
http://dev.david.pilato.fr/
Twitter : @dadoonet

Too late !
Dan was the first ! :wink:

Le 1 février 2012 à 23:36, "david@pilato.fr" david@pilato.fr a écrit :

Hi,

See
http://www.elasticsearch.org/guide/reference/api/admin-indices-analyze.html
http://www.elasticsearch.org/guide/reference/api/admin-indices-analyze.html
You can test how the analyzer will tokenize your datas.

It will help to understand what's going on.

HTH
David.

Le 1 février 2012 à 23:02, chimingc jchen@sugarcrm.com a écrit :

Hi,

I am trying to compare different analyzers to see which ones we should
use.
But my simple comparison shows me no difference between standard and
keyword
analyzers.
I wonder what I am missing.

This is the simple test I have.

I have two fields using different analyzer (keyword vs standard):
curl -XPOST localhost:9200/test -d '{
"mappings" : {
"accounts" : {
"properties" : {
"keyword_field" : { type: "string", analyzer : "keyword"
},
"standard_field" : { type: "string", analyzer :
"standard" }
}
}
}
}'

Then I have two records:
curl -XPUT 'http://localhost:9200/test/accounts/1' -d '{
"keyword_field" : "john doe"
}'
curl -XPUT 'http://localhost:9200/test/accounts/2' -d '{
"standard_field" : "john doe"
}'

Finally, I use this query to get my results:
curl "localhost:9200/test/accounts/_search?pretty=true" -d '{
"query" : {
"query_string" : {
"analyzer" : "keyword", //also tried "standard"
"query" : "john doe" // also tried "john" or "doe"
}
}
}'

In the query_string above, I have tried both keyword and standard.
However, they always return me both documents with same score no matter
which analyzer I have in my query_string.
I also tried replacing the query "john doe" with either only "john" or
"doe", and it still returns me both documents regardless the analyzer in
my
query_string.

My understanding is that keyword analyzer does not tokenize while
standard
analyzer does. So I'd expect different search results.

If this is not a correct test to see the difference between the two, what
test can I use to see some differences so I know which one to use in my
app?

Thanks.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Analyzers-comparison-tp3708214p3708214.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
David Pilato
http://dev.david.pilato.fr/
Twitter : @dadoonet

--
David Pilato
http://dev.david.pilato.fr/
Twitter : @dadoonet

Thanks guys.
I am aware of the analyzers API and I do see that they tokenize strings differently.

What puzzles me is that the search results. I couldn't differentiate the two from the search results.
Perhaps I didn't have the correct query_string but I couldn't find any other way to pass an analyzer in the query_string.

I just tried some query that is not supposed to work:
curl "localhost:9200/test/accounts/_search?pretty=true" -d '{
"query" : {
"query_string" : {
"analyzer" : "xyz",
"query" : "john and his friend jane"
}
}
}'

And that still returns me both records.
As you can see "xyz" is not even a valid analyzer.

Is it possible that the analyzer in query_string is totally ignored?

From http://www.elasticsearch.org/guide/reference/query-dsl/query-string-query.html it seems analyzer is a valid parameter, though.

On 01/02/2012 22:02, chimingc wrote:

Hi,

I am trying to compare different analyzers to see which ones we should use.
But my simple comparison shows me no difference between standard and keyword
analyzers.
I wonder what I am missing.

This is the simple test I have.

I have two fields using different analyzer (keyword vs standard):
curl -XPOST localhost:9200/test -d '{
"mappings" : {
"accounts" : {
"properties" : {
"keyword_field" : { type: "string", analyzer : "keyword" },
"standard_field" : { type: "string", analyzer : "standard" }
}
}
}
}'

Then I have two records:
curl -XPUT 'http://localhost:9200/test/accounts/1' -d '{
"keyword_field" : "john doe"
}'
curl -XPUT 'http://localhost:9200/test/accounts/2' -d '{
"standard_field" : "john doe"
}'

Finally, I use this query to get my results:
curl "localhost:9200/test/accounts/_search?pretty=true" -d '{
"query" : {
"query_string" : {
"analyzer" : "keyword", //also tried "standard"
"query" : "john doe" // also tried "john" or "doe"
}
}
}'

In the query_string above, I have tried both keyword and standard.
However, they always return me both documents with same score no matter
which analyzer I have in my query_string.
I also tried replacing the query "john doe" with either only "john" or
"doe", and it still returns me both documents regardless the analyzer in my
query_string.

My understanding is that keyword analyzer does not tokenize while standard
analyzer does. So I'd expect different search results.

If this is not a correct test to see the difference between the two, what
test can I use to see some differences so I know which one to use in my app?

Thanks.

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Analyzers-comparison-tp3708214p3708214.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

Please note that I am using elasticsearch-0.18.6, which should already include this bug fix:

Hi,

Here is what happens. First, you use a different analyzer on different fields in the json, but then you use a query_string which by default searches on the _all field (which has its own analyzer setting).

The query_string syntax breaks the text provided by whitespace, regardless of the analyzer provided, and then applies the analyzer to the result of it. This is done in order to parse the different operators and query language it has. You can wrap the text you want to search on in \" and then it will be treated as a phrase which will then by analyzed completely using the provided analyzer.

The text query can be simpler in those cases, as it treats the whole text and then analyzes it to generate (different) type of queries.

-shay.banon

On Thursday, February 2, 2012 at 12:02 AM, chimingc wrote:

curl "localhost:9200/test/accounts/_search?pretty=true" -d '{
"query" : {
"query_string" : {
"analyzer" : "keyword", //also tried "standard"
"query" : "john doe" // also tried "john" or "doe"
}
}
}'

So is there any way to search on "all fields" instead of "_all field"?
Or do I have to explicitly list all my fields (can be many) in the query_string?

My goal is use different analyzer on different field, and when I search, I want to search on all fields with their own analyzer.
Thanks.

From: "kimchy [via ElasticSearch Users]" <ml-node+s115913n3717037h51@n3.nabble.commailto:ml-node+s115913n3717037h51@n3.nabble.com>
Date: Sun, 5 Feb 2012 05:16:59 -0600
To: Jimmy Chen <jchen@sugarcrm.commailto:jchen@sugarcrm.com>
Subject: Re: Analyzers comparison

Here is what happens. First, you use a different analyzer on different fields in the json, but then you use a query_string which by default searches on the _all field (which has its own analyzer setting).

Yes, you will need to specify several fields if you want to search across specific ones. The more though, the slower the search will be.

On Monday, February 6, 2012 at 1:08 AM, chimingc wrote:

So is there any way to search on "all fields" instead of "_all field"?
Or do I have to explicitly list all my fields (can be many) in the query_string?

My goal is use different analyzer on different field, and when I search, I want to search on all fields with their own analyzer.
Thanks.

From: "kimchy [via ElasticSearch Users]" <[hidden email] (/user/SendEmail.jtp?type=node&node=3718456&i=0)>
Date: Sun, 5 Feb 2012 05:16:59 -0600
To: Jimmy Chen <[hidden email] (/user/SendEmail.jtp?type=node&node=3718456&i=1)>
Subject: Re: Analyzers comparison

Here is what happens. First, you use a different analyzer on different fields in the json, but then you use a query_string which by default searches on the _all field (which has its own analyzer setting).

View this message in context: Re: Analyzers comparison (http://elasticsearch-users.115913.n3.nabble.com/Analyzers-comparison-tp3708214p3718456.html)
Sent from the ElasticSearch Users mailing list archive (http://elasticsearch-users.115913.n3.nabble.com/) at Nabble.com (http://Nabble.com).

Hi Kimchy,

I was facing the same problem, But now I am able to search as per your
below solution,
But one more problem I am facing is that if I use "" with my query which
contains hyphen ("-"). It breaks the second word after hyphen and gives the
result started with that second word.
It should consider the complete word and gives the result with wildcard
character "
".

Here is my query,

{
"query": {
"query_string": {
"default_field": "_all",
"default_operator": "AND",
"analyze_wildcard": true,
"query": "QNAM-45*"
}
}
}

As I explained this query returns me the result started with 45. But I need
the results which will start with the complete word ("QNAM-45)

Please give me the solution for this.

Your help will be highly appreciable.

Thanks & Regards,
Pravin

On Sunday, February 5, 2012 1:16:23 PM UTC+2, kimchy wrote:

Hi,

Here is what happens. First, you use a different analyzer on different
fields in the json, but then you use a query_string which by default
searches on the _all field (which has its own analyzer setting).

The query_string syntax breaks the text provided by whitespace,
regardless of the analyzer provided, and then applies the analyzer to the
result of it. This is done in order to parse the different operators and
query language it has. You can wrap the text you want to search on in \"
and then it will be treated as a phrase which will then by analyzed
completely using the provided analyzer.

The text query can be simpler in those cases, as it treats the whole 

text and then analyzes it to generate (different) type of queries.

-shay.banon

On Thursday, February 2, 2012 at 12:02 AM, chimingc wrote:

curl "localhost:9200/test/accounts/_search?pretty=true" -d '{
"query" : {
"query_string" : {
"analyzer" : "keyword", //also tried "standard"
"query" : "john doe" // also tried "john" or "doe"
}
}
}'

--