Searching an index with 2 types using a keyword tokenizer

Hi,

I'm getting strange results trying to search on an index with 2 types using
a keyword tokenizer.

Using ElasticSearch 0.90.2 this :
curl -XGET localhost:9200/myindex/_search?pretty=1 -d '{"query":{"match":{"name":"carex
f"}}}'
Returns a result containing "Carex" alone (unexpected behaviour)

curl -XGET localhost:9200/myindex/taxon/_search?pretty=1 -d '{"query":{"match":{"name":"carex
f"}}}'
Will return an expected results of "Carex feta" and not "Carex" alone.

If I do the same thing using ElasticSearch 0.90.1, the 2 queries above will
return the expected results. This could be related to different
configuration but I am using the default configurations on both versions.

So, I would like to know what is the ElasticSearch expected behavior for
the first query?
Could it be related to ES using a default tokenizer (Standard) when we use
multiple types?

Here are the current settings:

curl -XPOST "localhost:9200/myindex" -d '
{
"settings":{
"index":{
"analysis":{
"filter" : {
"name_nGram" : {
"max_gram" : 100,
"min_gram" : 2,
"type" : "edge_ngram"
}
},
"analyzer":{
"name_index" : {
"filter" : [
"lowercase","asciifolding","name_nGram"
],
"tokenizer" : "standard"
},
"full_name_index" : {
"filter" : [
"lowercase","asciifolding"
],
"tokenizer" : "keyword"
},
"scientificname_index" : {
"filter" : [
"lowercase","asciifolding","name_nGram"
],
"tokenizer" : "keyword"
},
"name_search" : {
"filter" : [
"lowercase","asciifolding"
],
"tokenizer" : "keyword"
}
}
}
}
},
"mappings" : {
"taxon" : {
"properties" : {
"name" : {
"type" : "multi_field",
"fields":{
"name":{
"type" : "string",
"index_analyzer" : "full_name_index",
"search_analyzer" : "name_search"
},
"ngrams":{
"type" : "string",
"index_analyzer" : "scientificname_index",
"search_analyzer" : "name_search"
}
}
},
"status":{
"index" : "not_analyzed",
"type" : "string"
},
"namehtml":{
"index" : "not_analyzed",
"type" : "string"
},
"namehtmlauthor":{
"index" : "not_analyzed",
"type" : "string"
},
"rankname":{
"index" : "not_analyzed",
"type" : "string"
},
"parentid":{
"index" : "not_analyzed",
"type" : "integer"
},
"parentnamehtml":{
"index" : "not_analyzed",
"type" : "string"
}
}
},
"vernacular" : {
"properties" : {
"name" : {
"type" : "multi_field",
"fields":{
"name":{
"type" : "string",
"index" : "not_analyzed"
},
"ngrams":{
"type" : "string",
"search_analyzer" : "name_search",
"index_analyzer" : "name_index"
}
}
},
"taxonid":{
"index" : "not_analyzed",
"type" : "integer"
},
"status":{
"index" : "not_analyzed",
"type" : "string"
},
"language":{
"index" : "not_analyzed",
"type" : "string"
},
"taxonnamehtml":{
"index" : "not_analyzed",
"type" : "string"
}
}
}
}
}'

Add some data:

curl -XPUT 'http://localhost:9200/myindex/taxon/1' -d '{
"name" : "carex"
}'
curl -XPUT 'http://localhost:9200/myindex/taxon/2' -d '{
"name" : "carex feta"
}'

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Christian,
I just tested it with elasticsearch 0.90.2 and I got the expected results:
1 results querying for 'carex', one querying for 'carex feta'.

Your weird results seem to be caused by indexing ngrams, since you get back
partial results. On the other hand the recreation that you posted works
fine. I would check again the field you're querying on and your mapping,
maybe you're doing something slightly different from your curl example?

On Tuesday, July 16, 2013 6:21:59 PM UTC+2, Christian Gendreau wrote:

Hi,

I'm getting strange results trying to search on an index with 2 types
using a keyword tokenizer.

Using ElasticSearch 0.90.2 this :
curl -XGET localhost:9200/myindex/_search?pretty=1 -d '{"query":{"match":{"name":"carex
f"}}}'
Returns a result containing "Carex" alone (unexpected behaviour)

curl -XGET localhost:9200/myindex/taxon/_search?pretty=1 -d '{"query":{"match":{"name":"carex
f"}}}'
Will return an expected results of "Carex feta" and not "Carex" alone.

If I do the same thing using ElasticSearch 0.90.1, the 2 queries above
will return the expected results. This could be related to different
configuration but I am using the default configurations on both versions.

So, I would like to know what is the ElasticSearch expected behavior for
the first query?
Could it be related to ES using a default tokenizer (Standard) when we use
multiple types?

Here are the current settings:

curl -XPOST "localhost:9200/myindex" -d '
{
"settings":{
"index":{
"analysis":{
"filter" : {
"name_nGram" : {
"max_gram" : 100,
"min_gram" : 2,
"type" : "edge_ngram"
}
},
"analyzer":{
"name_index" : {
"filter" : [
"lowercase","asciifolding","name_nGram"
],
"tokenizer" : "standard"
},
"full_name_index" : {
"filter" : [
"lowercase","asciifolding"
],
"tokenizer" : "keyword"
},
"scientificname_index" : {
"filter" : [
"lowercase","asciifolding","name_nGram"
],
"tokenizer" : "keyword"
},
"name_search" : {
"filter" : [
"lowercase","asciifolding"
],
"tokenizer" : "keyword"
}
}
}
}
},
"mappings" : {
"taxon" : {
"properties" : {
"name" : {
"type" : "multi_field",
"fields":{
"name":{
"type" : "string",
"index_analyzer" : "full_name_index",
"search_analyzer" : "name_search"
},
"ngrams":{
"type" : "string",
"index_analyzer" : "scientificname_index",
"search_analyzer" : "name_search"
}
}
},
"status":{
"index" : "not_analyzed",
"type" : "string"
},
"namehtml":{
"index" : "not_analyzed",
"type" : "string"
},
"namehtmlauthor":{
"index" : "not_analyzed",
"type" : "string"
},
"rankname":{
"index" : "not_analyzed",
"type" : "string"
},
"parentid":{
"index" : "not_analyzed",
"type" : "integer"
},
"parentnamehtml":{
"index" : "not_analyzed",
"type" : "string"
}
}
},
"vernacular" : {
"properties" : {
"name" : {
"type" : "multi_field",
"fields":{
"name":{
"type" : "string",
"index" : "not_analyzed"
},
"ngrams":{
"type" : "string",
"search_analyzer" : "name_search",
"index_analyzer" : "name_index"
}
}
},
"taxonid":{
"index" : "not_analyzed",
"type" : "integer"
},
"status":{
"index" : "not_analyzed",
"type" : "string"
},
"language":{
"index" : "not_analyzed",
"type" : "string"
},
"taxonnamehtml":{
"index" : "not_analyzed",
"type" : "string"
}
}
}
}
}'

Add some data:

curl -XPUT 'http://localhost:9200/myindex/taxon/1' -d '{
"name" : "carex"
}'
curl -XPUT 'http://localhost:9200/myindex/taxon/2' -d '{
"name" : "carex feta"
}'

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Luca,

Sorry there is an error in my post:
curl -XGET localhost:9200/myindex/taxon/_search?pretty=1 -d '{"query":{"match":{"name":"carex
f"}}}'
Will return an expected result of : nothing

Since I'm matching the field "name", using a tokenizer "keyword", I would
expect to get no match for "carex f".

Am I wrong?

On Wednesday, July 17, 2013 8:20:28 AM UTC-4, Luca Cavanna wrote:

Hi Christian,
I just tested it with elasticsearch 0.90.2 and I got the expected results:
1 results querying for 'carex', one querying for 'carex feta'.

Your weird results seem to be caused by indexing ngrams, since you get
back partial results. On the other hand the recreation that you posted
works fine. I would check again the field you're querying on and your
mapping, maybe you're doing something slightly different from your curl
example?

On Tuesday, July 16, 2013 6:21:59 PM UTC+2, Christian Gendreau wrote:

Hi,

I'm getting strange results trying to search on an index with 2 types
using a keyword tokenizer.

Using ElasticSearch 0.90.2 this :
curl -XGET localhost:9200/myindex/_search?pretty=1 -d '{"query":{"match":{"name":"carex
f"}}}'
Returns a result containing "Carex" alone (unexpected behaviour)

curl -XGET localhost:9200/myindex/taxon/_search?pretty=1 -d '{"query":{"match":{"name":"carex
f"}}}'
Will return an expected results of "Carex feta" and not "Carex" alone.

If I do the same thing using ElasticSearch 0.90.1, the 2 queries above
will return the expected results. This could be related to different
configuration but I am using the default configurations on both versions.

So, I would like to know what is the ElasticSearch expected behavior for
the first query?
Could it be related to ES using a default tokenizer (Standard) when we
use multiple types?

Here are the current settings:

curl -XPOST "localhost:9200/myindex" -d '
{
"settings":{
"index":{
"analysis":{
"filter" : {
"name_nGram" : {
"max_gram" : 100,
"min_gram" : 2,
"type" : "edge_ngram"
}
},
"analyzer":{
"name_index" : {
"filter" : [
"lowercase","asciifolding","name_nGram"
],
"tokenizer" : "standard"
},
"full_name_index" : {
"filter" : [
"lowercase","asciifolding"
],
"tokenizer" : "keyword"
},
"scientificname_index" : {
"filter" : [
"lowercase","asciifolding","name_nGram"
],
"tokenizer" : "keyword"
},
"name_search" : {
"filter" : [
"lowercase","asciifolding"
],
"tokenizer" : "keyword"
}
}
}
}
},
"mappings" : {
"taxon" : {
"properties" : {
"name" : {
"type" : "multi_field",
"fields":{
"name":{
"type" : "string",
"index_analyzer" : "full_name_index",
"search_analyzer" : "name_search"
},
"ngrams":{
"type" : "string",
"index_analyzer" : "scientificname_index",
"search_analyzer" : "name_search"
}
}
},
"status":{
"index" : "not_analyzed",
"type" : "string"
},
"namehtml":{
"index" : "not_analyzed",
"type" : "string"
},
"namehtmlauthor":{
"index" : "not_analyzed",
"type" : "string"
},
"rankname":{
"index" : "not_analyzed",
"type" : "string"
},
"parentid":{
"index" : "not_analyzed",
"type" : "integer"
},
"parentnamehtml":{
"index" : "not_analyzed",
"type" : "string"
}
}
},
"vernacular" : {
"properties" : {
"name" : {
"type" : "multi_field",
"fields":{
"name":{
"type" : "string",
"index" : "not_analyzed"
},
"ngrams":{
"type" : "string",
"search_analyzer" : "name_search",
"index_analyzer" : "name_index"
}
}
},
"taxonid":{
"index" : "not_analyzed",
"type" : "integer"
},
"status":{
"index" : "not_analyzed",
"type" : "string"
},
"language":{
"index" : "not_analyzed",
"type" : "string"
},
"taxonnamehtml":{
"index" : "not_analyzed",
"type" : "string"
}
}
}
}
}'

Add some data:

curl -XPUT 'http://localhost:9200/myindex/taxon/1' -d '{
"name" : "carex"
}'
curl -XPUT 'http://localhost:9200/myindex/taxon/2' -d '{
"name" : "carex feta"
}'

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Christian,
exactly you can't find partial matches when indexing with keyword
tokenizer. Only querying for 'carex' or 'carex feta' would find a match
since that's what you indexed, with no tokenization.
I didn't get whether you're still having an unexpected behaviour or not
though. Could you please clarify that?

Cheers
Luca

On Wed, Jul 17, 2013 at 5:36 PM, Christian Gendreau <
christiangendreau@gmail.com> wrote:

Hi Luca,

Sorry there is an error in my post:
curl -XGET localhost:9200/myindex/taxon/_**search?pretty=1 -d
'{"query":{"match":{"name":"**carex f"}}}'
Will return an expected result of : nothing

Since I'm matching the field "name", using a tokenizer "keyword", I would
expect to get no match for "carex f".

Am I wrong?

On Wednesday, July 17, 2013 8:20:28 AM UTC-4, Luca Cavanna wrote:

Hi Christian,
I just tested it with elasticsearch 0.90.2 and I got the expected
results: 1 results querying for 'carex', one querying for 'carex feta'.

Your weird results seem to be caused by indexing ngrams, since you get
back partial results. On the other hand the recreation that you posted
works fine. I would check again the field you're querying on and your
mapping, maybe you're doing something slightly different from your curl
example?

On Tuesday, July 16, 2013 6:21:59 PM UTC+2, Christian Gendreau wrote:

Hi,

I'm getting strange results trying to search on an index with 2 types
using a keyword tokenizer.

Using ElasticSearch 0.90.2 this :
curl -XGET localhost:9200/myindex/_search**?pretty=1 -d
'{"query":{"match":{"name":"**carex f"}}}'
Returns a result containing "Carex" alone (unexpected behaviour)

curl -XGET localhost:9200/myindex/taxon/_**search?pretty=1 -d
'{"query":{"match":{"name":"**carex f"}}}'
Will return an expected results of "Carex feta" and not "Carex" alone.

If I do the same thing using ElasticSearch 0.90.1, the 2 queries above
will return the expected results. This could be related to different
configuration but I am using the default configurations on both versions.

So, I would like to know what is the ElasticSearch expected behavior for
the first query?
Could it be related to ES using a default tokenizer (Standard) when we
use multiple types?

Here are the current settings:

curl -XPOST "localhost:9200/myindex" -d '
{
"settings":{
"index":{
"analysis":{
"filter" : {
"name_nGram" : {
"max_gram" : 100,
"min_gram" : 2,
"type" : "edge_ngram"
}
},
"analyzer":{
"name_index" : {
"filter" : [
"lowercase","asciifolding","**name_nGram"
],
"tokenizer" : "standard"
},
"full_name_index" : {
"filter" : [
"lowercase","asciifolding"
],
"tokenizer" : "keyword"
},
"scientificname_index" : {
"filter" : [
"lowercase","asciifolding","**name_nGram"
],
"tokenizer" : "keyword"
},
"name_search" : {
"filter" : [
"lowercase","asciifolding"
],
"tokenizer" : "keyword"
}
}
}
}
},
"mappings" : {
"taxon" : {
"properties" : {
"name" : {
"type" : "multi_field",
"fields":{
"name":{
"type" : "string",
"index_analyzer" : "full_name_index",
"search_analyzer" : "name_search"
},
"ngrams":{
"type" : "string",
"index_analyzer" : "scientificname_index",
"search_analyzer" : "name_search"
}
}
},
"status":{
"index" : "not_analyzed",
"type" : "string"
},
"namehtml":{
"index" : "not_analyzed",
"type" : "string"
},
"namehtmlauthor":{
"index" : "not_analyzed",
"type" : "string"
},
"rankname":{
"index" : "not_analyzed",
"type" : "string"
},
"parentid":{
"index" : "not_analyzed",
"type" : "integer"
},
"parentnamehtml":{
"index" : "not_analyzed",
"type" : "string"
}
}
},
"vernacular" : {
"properties" : {
"name" : {
"type" : "multi_field",
"fields":{
"name":{
"type" : "string",
"index" : "not_analyzed"
},
"ngrams":{
"type" : "string",
"search_analyzer" : "name_search",
"index_analyzer" : "name_index"
}
}
},
"taxonid":{
"index" : "not_analyzed",
"type" : "integer"
},
"status":{
"index" : "not_analyzed",
"type" : "string"
},
"language":{
"index" : "not_analyzed",
"type" : "string"
},
"taxonnamehtml":{
"index" : "not_analyzed",
"type" : "string"
}
}
}
}
}'

Add some data:

curl -XPUT 'http://localhost:9200/**myindex/taxon/1http://localhost:9200/myindex/taxon/1
' -d '{
"name" : "carex"
}'
curl -XPUT 'http://localhost:9200/**myindex/taxon/2http://localhost:9200/myindex/taxon/2
' -d '{
"name" : "carex feta"
}'

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/8EWZmY_PZyE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Indeed, I didn't make it clear.
I rebuilt the whole index from scratch and it is working properly. This is
a little bit scary since the response of "..._mapping" was identical on
both installations but I guess I made a mistake somewhere.
Anyway, thank you for confirming that it should work.

The remaining issue (that I'm solving here by keeping the untouched field)
I have is already in another thread
(http://elasticsearch-users.115913.n3.nabble.com/boosting-exact-matches-in-edgengram-search-td4035244.html).

Thanks again

Christian

On Wednesday, July 17, 2013 11:41:36 AM UTC-4, Luca Cavanna wrote:

Hi Christian,
exactly you can't find partial matches when indexing with keyword
tokenizer. Only querying for 'carex' or 'carex feta' would find a match
since that's what you indexed, with no tokenization.
I didn't get whether you're still having an unexpected behaviour or not
though. Could you please clarify that?

Cheers
Luca

On Wed, Jul 17, 2013 at 5:36 PM, Christian Gendreau <christia...@gmail.com<javascript:>

wrote:

Hi Luca,

Sorry there is an error in my post:
curl -XGET localhost:9200/myindex/taxon/_**search?pretty=1 -d
'{"query":{"match":{"name":"**carex f"}}}'
Will return an expected result of : nothing

Since I'm matching the field "name", using a tokenizer "keyword", I would
expect to get no match for "carex f".

Am I wrong?

On Wednesday, July 17, 2013 8:20:28 AM UTC-4, Luca Cavanna wrote:

Hi Christian,
I just tested it with elasticsearch 0.90.2 and I got the expected
results: 1 results querying for 'carex', one querying for 'carex feta'.

Your weird results seem to be caused by indexing ngrams, since you get
back partial results. On the other hand the recreation that you posted
works fine. I would check again the field you're querying on and your
mapping, maybe you're doing something slightly different from your curl
example?

On Tuesday, July 16, 2013 6:21:59 PM UTC+2, Christian Gendreau wrote:

Hi,

I'm getting strange results trying to search on an index with 2 types
using a keyword tokenizer.

Using ElasticSearch 0.90.2 this :
curl -XGET localhost:9200/myindex/_search**?pretty=1 -d
'{"query":{"match":{"name":"**carex f"}}}'
Returns a result containing "Carex" alone (unexpected behaviour)

curl -XGET localhost:9200/myindex/taxon/_**search?pretty=1 -d
'{"query":{"match":{"name":"**carex f"}}}'
Will return an expected results of "Carex feta" and not "Carex" alone.

If I do the same thing using ElasticSearch 0.90.1, the 2 queries above
will return the expected results. This could be related to different
configuration but I am using the default configurations on both versions.

So, I would like to know what is the ElasticSearch expected behavior
for the first query?
Could it be related to ES using a default tokenizer (Standard) when we
use multiple types?

Here are the current settings:

curl -XPOST "localhost:9200/myindex" -d '
{
"settings":{
"index":{
"analysis":{
"filter" : {
"name_nGram" : {
"max_gram" : 100,
"min_gram" : 2,
"type" : "edge_ngram"
}
},
"analyzer":{
"name_index" : {
"filter" : [
"lowercase","asciifolding","**name_nGram"
],
"tokenizer" : "standard"
},
"full_name_index" : {
"filter" : [
"lowercase","asciifolding"
],
"tokenizer" : "keyword"
},
"scientificname_index" : {
"filter" : [
"lowercase","asciifolding","**name_nGram"
],
"tokenizer" : "keyword"
},
"name_search" : {
"filter" : [
"lowercase","asciifolding"
],
"tokenizer" : "keyword"
}
}
}
}
},
"mappings" : {
"taxon" : {
"properties" : {
"name" : {
"type" : "multi_field",
"fields":{
"name":{
"type" : "string",
"index_analyzer" : "full_name_index",
"search_analyzer" : "name_search"
},
"ngrams":{
"type" : "string",
"index_analyzer" : "scientificname_index",
"search_analyzer" : "name_search"
}
}
},
"status":{
"index" : "not_analyzed",
"type" : "string"
},
"namehtml":{
"index" : "not_analyzed",
"type" : "string"
},
"namehtmlauthor":{
"index" : "not_analyzed",
"type" : "string"
},
"rankname":{
"index" : "not_analyzed",
"type" : "string"
},
"parentid":{
"index" : "not_analyzed",
"type" : "integer"
},
"parentnamehtml":{
"index" : "not_analyzed",
"type" : "string"
}
}
},
"vernacular" : {
"properties" : {
"name" : {
"type" : "multi_field",
"fields":{
"name":{
"type" : "string",
"index" : "not_analyzed"
},
"ngrams":{
"type" : "string",
"search_analyzer" : "name_search",
"index_analyzer" : "name_index"
}
}
},
"taxonid":{
"index" : "not_analyzed",
"type" : "integer"
},
"status":{
"index" : "not_analyzed",
"type" : "string"
},
"language":{
"index" : "not_analyzed",
"type" : "string"
},
"taxonnamehtml":{
"index" : "not_analyzed",
"type" : "string"
}
}
}
}
}'

Add some data:

curl -XPUT 'http://localhost:9200/**myindex/taxon/1http://localhost:9200/myindex/taxon/1
' -d '{
"name" : "carex"
}'
curl -XPUT 'http://localhost:9200/**myindex/taxon/2http://localhost:9200/myindex/taxon/2
' -d '{
"name" : "carex feta"
}'

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/8EWZmY_PZyE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I still have this issue and I can now reproduce it:

Settings and mapping
curl -XPOST "localhost:9200/myindex" -d '
{
"settings":{
"index":{
"analysis":{
"filter" : {
"name_nGram" : {
"max_gram" : 100,
"min_gram" : 2,
"type" : "edge_ngram"
},
"strip_hybrid_sign_filter":{
"pattern":"\u00D7",
"replacement":"",
"type": "pattern_replace"
}
},
"analyzer":{
"name_index" : {
"filter" : [
"lowercase","asciifolding","name_nGram"
],
"tokenizer" : "standard"
},
"full_name_index" : {
"filter" : [
"lowercase","asciifolding"
],
"tokenizer" : "keyword"
},
"scientificname_index" : {
"filter" : [
"lowercase","asciifolding","strip_hybrid_sign_filter","name_nGram"
],
"tokenizer" : "keyword"
},
"name_search" : {
"filter" : [
"lowercase","asciifolding"
],
"tokenizer" : "keyword"
},
"scientificname_search" : {
"filter" : [
"lowercase","asciifolding","strip_hybrid_sign_filter"
],
"tokenizer" : "keyword"
}
}
}
}
},
"mappings" : {
"taxon" : {
"properties" : {
"name" : {
"type" : "multi_field",
"fields":{
"name":{
"type" : "string",
"index_analyzer" : "full_name_index",
"search_analyzer" : "name_search"
},
"ngrams":{
"type" : "string",
"index_analyzer" : "scientificname_index",
"search_analyzer" : "scientificname_search"
}
}
},
"status":{
"index" : "not_analyzed",
"type" : "string"
},
"namehtml":{
"index" : "not_analyzed",
"type" : "string"
},
"namehtmlauthor":{
"index" : "not_analyzed",
"type" : "string"
},
"rankname":{
"index" : "not_analyzed",
"type" : "string"
},
"parentid":{
"index" : "not_analyzed",
"type" : "integer"
},
"parentnamehtml":{
"index" : "not_analyzed",
"type" : "string"
}
}
},
"vernacular" : {
"properties" : {
"name" : {
"type" : "multi_field",
"fields":{
"name":{
"type" : "string",
"index" : "not_analyzed"
},
"ngrams":{
"type" : "string",
"search_analyzer" : "name_search",
"index_analyzer" : "name_index"
}
}
},
"taxonid":{
"index" : "not_analyzed",
"type" : "integer"
},
"status":{
"index" : "not_analyzed",
"type" : "string"
},
"language":{
"index" : "not_analyzed",
"type" : "string"
},
"taxonnamehtml":{
"index" : "not_analyzed",
"type" : "string"
}
}
}
}
}'

Add data:
curl -XPUT 'http://localhost:9200/myindex/taxon/1' -d '{
"name" : "×Achnella"
}'
curl -XPUT 'http://localhost:9200/myindex/taxon/2' -d '{
"name" : "carex feta"
}'

I get no result for this query:
curl -XGET localhost:9200/myindex/_search?pretty=1 -d
'{"query":{"match":{"name":"×Achnella"}}}'

But I do get the row if I use this query:
curl -XGET localhost:9200/myindex/taxon/_search?pretty=1 -d
'{"query":{"match":{"name":"×Achnella"}}}'

Maybe it's related to my "strip_hybrid_sign_filter"?
But the result of this command looks good:
curl -XGET'localhost:9200/myindex/_analyze?field=name&pretty=1' -d
"×Achnella"

Any idea?

Thanks

On Tuesday, July 16, 2013 12:21:59 PM UTC-4, Christian Gendreau wrote:

Hi,

I'm getting strange results trying to search on an index with 2 types
using a keyword tokenizer.

Using ElasticSearch 0.90.2 this :
curl -XGET localhost:9200/myindex/_search?pretty=1 -d '{"query":{"match":{"name":"carex
f"}}}'
Returns a result containing "Carex" alone (unexpected behaviour)

curl -XGET localhost:9200/myindex/taxon/_search?pretty=1 -d '{"query":{"match":{"name":"carex
f"}}}'
Will return an expected results of "Carex feta" and not "Carex" alone.

If I do the same thing using ElasticSearch 0.90.1, the 2 queries above
will return the expected results. This could be related to different
configuration but I am using the default configurations on both versions.

So, I would like to know what is the ElasticSearch expected behavior for
the first query?
Could it be related to ES using a default tokenizer (Standard) when we use
multiple types?

Here are the current settings:

curl -XPOST "localhost:9200/myindex" -d '
{
"settings":{
"index":{
"analysis":{
"filter" : {
"name_nGram" : {
"max_gram" : 100,
"min_gram" : 2,
"type" : "edge_ngram"
}
},
"analyzer":{
"name_index" : {
"filter" : [
"lowercase","asciifolding","name_nGram"
],
"tokenizer" : "standard"
},
"full_name_index" : {
"filter" : [
"lowercase","asciifolding"
],
"tokenizer" : "keyword"
},
"scientificname_index" : {
"filter" : [
"lowercase","asciifolding","name_nGram"
],
"tokenizer" : "keyword"
},
"name_search" : {
"filter" : [
"lowercase","asciifolding"
],
"tokenizer" : "keyword"
}
}
}
}
},
"mappings" : {
"taxon" : {
"properties" : {
"name" : {
"type" : "multi_field",
"fields":{
"name":{
"type" : "string",
"index_analyzer" : "full_name_index",
"search_analyzer" : "name_search"
},
"ngrams":{
"type" : "string",
"index_analyzer" : "scientificname_index",
"search_analyzer" : "name_search"
}
}
},
"status":{
"index" : "not_analyzed",
"type" : "string"
},
"namehtml":{
"index" : "not_analyzed",
"type" : "string"
},
"namehtmlauthor":{
"index" : "not_analyzed",
"type" : "string"
},
"rankname":{
"index" : "not_analyzed",
"type" : "string"
},
"parentid":{
"index" : "not_analyzed",
"type" : "integer"
},
"parentnamehtml":{
"index" : "not_analyzed",
"type" : "string"
}
}
},
"vernacular" : {
"properties" : {
"name" : {
"type" : "multi_field",
"fields":{
"name":{
"type" : "string",
"index" : "not_analyzed"
},
"ngrams":{
"type" : "string",
"search_analyzer" : "name_search",
"index_analyzer" : "name_index"
}
}
},
"taxonid":{
"index" : "not_analyzed",
"type" : "integer"
},
"status":{
"index" : "not_analyzed",
"type" : "string"
},
"language":{
"index" : "not_analyzed",
"type" : "string"
},
"taxonnamehtml":{
"index" : "not_analyzed",
"type" : "string"
}
}
}
}
}'

Add some data:

curl -XPUT 'http://localhost:9200/myindex/taxon/1' -d '{
"name" : "carex"
}'
curl -XPUT 'http://localhost:9200/myindex/taxon/2' -d '{
"name" : "carex feta"
}'

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Christian,
I was able to reproduce your issue.

The mapping for the field name is different in the two types that you have.
You used once keyword tokenizer + lowercase filter etc., while on the other
one the field is not analyzed.

Elasticsearch has to pick an analyzer here for the query, and it picks the
wrong one in your case unfortunately. In fact it ends up not analyzing the
query and querying for exactly the same term you use in the query, while in
the index you have the lowercased version, thus there's no match.

When you specify the type the problem doesn't exist since there's only one
field called name under that type and there's only one analyzer, thus no
choice to be made.

I think it was just an error in your mapping, but if you do want to have
fields with same name and different analyzers under the same index, well
that's not a good idea. You'd better go for two separate indices since the
query would be analyzed differently then per index.

Hope this clarifies things for you

Cheers
Luca

On Tuesday, July 16, 2013 6:21:59 PM UTC+2, Christian Gendreau wrote:

Hi,

I'm getting strange results trying to search on an index with 2 types
using a keyword tokenizer.

Using ElasticSearch 0.90.2 this :
curl -XGET localhost:9200/myindex/_search?pretty=1 -d '{"query":{"match":{"name":"carex
f"}}}'
Returns a result containing "Carex" alone (unexpected behaviour)

curl -XGET localhost:9200/myindex/taxon/_search?pretty=1 -d '{"query":{"match":{"name":"carex
f"}}}'
Will return an expected results of "Carex feta" and not "Carex" alone.

If I do the same thing using ElasticSearch 0.90.1, the 2 queries above
will return the expected results. This could be related to different
configuration but I am using the default configurations on both versions.

So, I would like to know what is the ElasticSearch expected behavior for
the first query?
Could it be related to ES using a default tokenizer (Standard) when we use
multiple types?

Here are the current settings:

curl -XPOST "localhost:9200/myindex" -d '
{
"settings":{
"index":{
"analysis":{
"filter" : {
"name_nGram" : {
"max_gram" : 100,
"min_gram" : 2,
"type" : "edge_ngram"
}
},
"analyzer":{
"name_index" : {
"filter" : [
"lowercase","asciifolding","name_nGram"
],
"tokenizer" : "standard"
},
"full_name_index" : {
"filter" : [
"lowercase","asciifolding"
],
"tokenizer" : "keyword"
},
"scientificname_index" : {
"filter" : [
"lowercase","asciifolding","name_nGram"
],
"tokenizer" : "keyword"
},
"name_search" : {
"filter" : [
"lowercase","asciifolding"
],
"tokenizer" : "keyword"
}
}
}
}
},
"mappings" : {
"taxon" : {
"properties" : {
"name" : {
"type" : "multi_field",
"fields":{
"name":{
"type" : "string",
"index_analyzer" : "full_name_index",
"search_analyzer" : "name_search"
},
"ngrams":{
"type" : "string",
"index_analyzer" : "scientificname_index",
"search_analyzer" : "name_search"
}
}
},
"status":{
"index" : "not_analyzed",
"type" : "string"
},
"namehtml":{
"index" : "not_analyzed",
"type" : "string"
},
"namehtmlauthor":{
"index" : "not_analyzed",
"type" : "string"
},
"rankname":{
"index" : "not_analyzed",
"type" : "string"
},
"parentid":{
"index" : "not_analyzed",
"type" : "integer"
},
"parentnamehtml":{
"index" : "not_analyzed",
"type" : "string"
}
}
},
"vernacular" : {
"properties" : {
"name" : {
"type" : "multi_field",
"fields":{
"name":{
"type" : "string",
"index" : "not_analyzed"
},
"ngrams":{
"type" : "string",
"search_analyzer" : "name_search",
"index_analyzer" : "name_index"
}
}
},
"taxonid":{
"index" : "not_analyzed",
"type" : "integer"
},
"status":{
"index" : "not_analyzed",
"type" : "string"
},
"language":{
"index" : "not_analyzed",
"type" : "string"
},
"taxonnamehtml":{
"index" : "not_analyzed",
"type" : "string"
}
}
}
}
}'

Add some data:

curl -XPUT 'http://localhost:9200/myindex/taxon/1' -d '{
"name" : "carex"
}'
curl -XPUT 'http://localhost:9200/myindex/taxon/2' -d '{
"name" : "carex feta"
}'

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Luca,
Thanks for the explanations, that helps a lot to understand what is going
on.

Yes, the mapping is intentionally different for the 2 types.
The goal was to get merged results from the 2 types, with one query, using
the same field name. Since my 2 types are different "kind" of data, I have
2 different analyzers.

Then, I though ES would take the analyzer mapped to the field depending on
the type if I do not specify a specific type in my query.
I guess this is where I was wrong.

What would be my best option?
Considering I want to merge the result of my 2 types when I send a query,
should I

  1. rename my fields and use a multi_match query
  2. use 2 separate index and an indices query

Regards,

Christian

On Monday, July 22, 2013 10:52:39 AM UTC-4, Luca Cavanna wrote:

Hi Christian,
I was able to reproduce your issue.

The mapping for the field name is different in the two types that you
have. You used once keyword tokenizer + lowercase filter etc., while on the
other one the field is not analyzed.

Elasticsearch has to pick an analyzer here for the query, and it picks the
wrong one in your case unfortunately. In fact it ends up not analyzing the
query and querying for exactly the same term you use in the query, while in
the index you have the lowercased version, thus there's no match.

When you specify the type the problem doesn't exist since there's only one
field called name under that type and there's only one analyzer, thus no
choice to be made.

I think it was just an error in your mapping, but if you do want to have
fields with same name and different analyzers under the same index, well
that's not a good idea. You'd better go for two separate indices since the
query would be analyzed differently then per index.

Hope this clarifies things for you

Cheers
Luca

On Tuesday, July 16, 2013 6:21:59 PM UTC+2, Christian Gendreau wrote:

Hi,

I'm getting strange results trying to search on an index with 2 types
using a keyword tokenizer.

Using ElasticSearch 0.90.2 this :
curl -XGET localhost:9200/myindex/_search?pretty=1 -d '{"query":{"match":{"name":"carex
f"}}}'
Returns a result containing "Carex" alone (unexpected behaviour)

curl -XGET localhost:9200/myindex/taxon/_search?pretty=1 -d '{"query":{"match":{"name":"carex
f"}}}'
Will return an expected results of "Carex feta" and not "Carex" alone.

If I do the same thing using ElasticSearch 0.90.1, the 2 queries above
will return the expected results. This could be related to different
configuration but I am using the default configurations on both versions.

So, I would like to know what is the ElasticSearch expected behavior for
the first query?
Could it be related to ES using a default tokenizer (Standard) when we
use multiple types?

Here are the current settings:

curl -XPOST "localhost:9200/myindex" -d '
{
"settings":{
"index":{
"analysis":{
"filter" : {
"name_nGram" : {
"max_gram" : 100,
"min_gram" : 2,
"type" : "edge_ngram"
}
},
"analyzer":{
"name_index" : {
"filter" : [
"lowercase","asciifolding","name_nGram"
],
"tokenizer" : "standard"
},
"full_name_index" : {
"filter" : [
"lowercase","asciifolding"
],
"tokenizer" : "keyword"
},
"scientificname_index" : {
"filter" : [
"lowercase","asciifolding","name_nGram"
],
"tokenizer" : "keyword"
},
"name_search" : {
"filter" : [
"lowercase","asciifolding"
],
"tokenizer" : "keyword"
}
}
}
}
},
"mappings" : {
"taxon" : {
"properties" : {
"name" : {
"type" : "multi_field",
"fields":{
"name":{
"type" : "string",
"index_analyzer" : "full_name_index",
"search_analyzer" : "name_search"
},
"ngrams":{
"type" : "string",
"index_analyzer" : "scientificname_index",
"search_analyzer" : "name_search"
}
}
},
"status":{
"index" : "not_analyzed",
"type" : "string"
},
"namehtml":{
"index" : "not_analyzed",
"type" : "string"
},
"namehtmlauthor":{
"index" : "not_analyzed",
"type" : "string"
},
"rankname":{
"index" : "not_analyzed",
"type" : "string"
},
"parentid":{
"index" : "not_analyzed",
"type" : "integer"
},
"parentnamehtml":{
"index" : "not_analyzed",
"type" : "string"
}
}
},
"vernacular" : {
"properties" : {
"name" : {
"type" : "multi_field",
"fields":{
"name":{
"type" : "string",
"index" : "not_analyzed"
},
"ngrams":{
"type" : "string",
"search_analyzer" : "name_search",
"index_analyzer" : "name_index"
}
}
},
"taxonid":{
"index" : "not_analyzed",
"type" : "integer"
},
"status":{
"index" : "not_analyzed",
"type" : "string"
},
"language":{
"index" : "not_analyzed",
"type" : "string"
},
"taxonnamehtml":{
"index" : "not_analyzed",
"type" : "string"
}
}
}
}
}'

Add some data:

curl -XPUT 'http://localhost:9200/myindex/taxon/1' -d '{
"name" : "carex"
}'
curl -XPUT 'http://localhost:9200/myindex/taxon/2' -d '{
"name" : "carex feta"
}'

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey Christian,
ok the reason why it doesn't work under the same type is that we execute a
single lucene query per index (which is composed of one more shards, that
are effectivey the lucene indices). Thus making it work would mean
executing multiple lucene queries (analyzed differently) on the same shard,
which is not really what we want.

Both the options to solve the problem look good. I'd say it depends on how
much it's important for you to keep the two datasets on the same index, or
how much it bothers you to use two different names for the two fields.
Don't know enough about your domain to make any choice but both ways are
fine.

Cheers
Luca

On Mon, Jul 22, 2013 at 5:49 PM, Christian Gendreau <
christiangendreau@gmail.com> wrote:

Hi Luca,
Thanks for the explanations, that helps a lot to understand what is going
on.

Yes, the mapping is intentionally different for the 2 types.
The goal was to get merged results from the 2 types, with one query, using
the same field name. Since my 2 types are different "kind" of data, I have
2 different analyzers.

Then, I though ES would take the analyzer mapped to the field depending on
the type if I do not specify a specific type in my query.
I guess this is where I was wrong.

What would be my best option?
Considering I want to merge the result of my 2 types when I send a query,
should I

  1. rename my fields and use a multi_match query
  2. use 2 separate index and an indices query

Regards,

Christian

On Monday, July 22, 2013 10:52:39 AM UTC-4, Luca Cavanna wrote:

Hi Christian,
I was able to reproduce your issue.

The mapping for the field name is different in the two types that you
have. You used once keyword tokenizer + lowercase filter etc., while on the
other one the field is not analyzed.

Elasticsearch has to pick an analyzer here for the query, and it picks
the wrong one in your case unfortunately. In fact it ends up not analyzing
the query and querying for exactly the same term you use in the query,
while in the index you have the lowercased version, thus there's no match.

When you specify the type the problem doesn't exist since there's only
one field called name under that type and there's only one analyzer, thus
no choice to be made.

I think it was just an error in your mapping, but if you do want to have
fields with same name and different analyzers under the same index, well
that's not a good idea. You'd better go for two separate indices since the
query would be analyzed differently then per index.

Hope this clarifies things for you

Cheers
Luca

On Tuesday, July 16, 2013 6:21:59 PM UTC+2, Christian Gendreau wrote:

Hi,

I'm getting strange results trying to search on an index with 2 types
using a keyword tokenizer.

Using ElasticSearch 0.90.2 this :
curl -XGET localhost:9200/myindex/_search**?pretty=1 -d
'{"query":{"match":{"name":"**carex f"}}}'
Returns a result containing "Carex" alone (unexpected behaviour)

curl -XGET localhost:9200/myindex/taxon/_**search?pretty=1 -d
'{"query":{"match":{"name":"**carex f"}}}'
Will return an expected results of "Carex feta" and not "Carex" alone.

If I do the same thing using ElasticSearch 0.90.1, the 2 queries above
will return the expected results. This could be related to different
configuration but I am using the default configurations on both versions.

So, I would like to know what is the ElasticSearch expected behavior for
the first query?
Could it be related to ES using a default tokenizer (Standard) when we
use multiple types?

Here are the current settings:

curl -XPOST "localhost:9200/myindex" -d '
{
"settings":{
"index":{
"analysis":{
"filter" : {
"name_nGram" : {
"max_gram" : 100,
"min_gram" : 2,
"type" : "edge_ngram"
}
},
"analyzer":{
"name_index" : {
"filter" : [
"lowercase","asciifolding","**name_nGram"
],
"tokenizer" : "standard"
},
"full_name_index" : {
"filter" : [
"lowercase","asciifolding"
],
"tokenizer" : "keyword"
},
"scientificname_index" : {
"filter" : [
"lowercase","asciifolding","**name_nGram"
],
"tokenizer" : "keyword"
},
"name_search" : {
"filter" : [
"lowercase","asciifolding"
],
"tokenizer" : "keyword"
}
}
}
}
},
"mappings" : {
"taxon" : {
"properties" : {
"name" : {
"type" : "multi_field",
"fields":{
"name":{
"type" : "string",
"index_analyzer" : "full_name_index",
"search_analyzer" : "name_search"
},
"ngrams":{
"type" : "string",
"index_analyzer" : "scientificname_index",
"search_analyzer" : "name_search"
}
}
},
"status":{
"index" : "not_analyzed",
"type" : "string"
},
"namehtml":{
"index" : "not_analyzed",
"type" : "string"
},
"namehtmlauthor":{
"index" : "not_analyzed",
"type" : "string"
},
"rankname":{
"index" : "not_analyzed",
"type" : "string"
},
"parentid":{
"index" : "not_analyzed",
"type" : "integer"
},
"parentnamehtml":{
"index" : "not_analyzed",
"type" : "string"
}
}
},
"vernacular" : {
"properties" : {
"name" : {
"type" : "multi_field",
"fields":{
"name":{
"type" : "string",
"index" : "not_analyzed"
},
"ngrams":{
"type" : "string",
"search_analyzer" : "name_search",
"index_analyzer" : "name_index"
}
}
},
"taxonid":{
"index" : "not_analyzed",
"type" : "integer"
},
"status":{
"index" : "not_analyzed",
"type" : "string"
},
"language":{
"index" : "not_analyzed",
"type" : "string"
},
"taxonnamehtml":{
"index" : "not_analyzed",
"type" : "string"
}
}
}
}
}'

Add some data:

curl -XPUT 'http://localhost:9200/**myindex/taxon/1http://localhost:9200/myindex/taxon/1
' -d '{
"name" : "carex"
}'
curl -XPUT 'http://localhost:9200/**myindex/taxon/2http://localhost:9200/myindex/taxon/2
' -d '{
"name" : "carex feta"
}'

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/8EWZmY_PZyE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.