Issue with searching in fields


(Matthias Johnson) #1

Good day to all!

We are struggling with what seems like a potential problem with ES and
searching for data in fields. To simplify I reduced some documents to only
the
salient points. Consider the following two simply documents:

test1:

{
"field1":"LANG000000904",
"field2":"LANG000000904"
}

test2:

{
"field1":"monkey",
"field2":"LANG000000904"
}

I placed the documents into an index called /test/type with names 1 and 2
respectively.

The mapping for this comes out with string type for both fields.

{
"test": {
"type": {
"properties": {
"field1": {
"type": "string"
},
"field2": {
"type": "string"
}
}
}
}
}

Note that field1 contains LANG000000904 in the first document and *monkey
*in the second.

Now when I search for LANG000000904 i get 0 hits:

curl -s 'http://localhost:9200/test/type/_search?pretty' -d '{ "query" : {

"term" : { "field1":"LANG000000904" } } }'
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}

However searching for monkey i get one result as expected:

curl -s 'http://localhost:9200/test/type/_search?pretty' -d '{ "query" :

{ "term" : { "field1":"monkey" } } }'
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "test",
"_type" : "type",
"_id" : "2",
"_score" : 1.0, "_source" : {
"field1":"monkey",
"field2":"LANG000000904"
}

} ]

}
}

It seems to me that the first search for LANG000000904 should return 1
hit for document 1, but it seems that the alphanumeric string is somehow
not found while the purely alphabetic string is found .... Are we missing
something that would make this work correctly?

Additionally we tested the GET URI requests for searching and those appear
to be working as expected:

curl -s

'http://localhost:9200/test/type/_search?q=field1:LANG000000904&pretty'

{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "test",
"_type" : "type",
"_id" : "1",
"_score" : 1.0, "_source" : {
"field1":"LANG000000904",
"field2":"LANG000000904"
}

} ]

}
}

It seems that perhaps there is something not working correctly with
POST/JSON
query, but perhaps we are not doing it right.

Any comments and ideas would be much appreciated.

Thanks,

@matthias

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Rafał Kuć) #2

Hello!

Note that the term query is not analyzed, while the URI query is analyzed. For the inputs like LANG000000904 you probably want your fields to be not analyzed (property of string field set to index="not_analyzed") and thus you'll be able to match exactly the field value you've indexed. On the other hand, for fields that require full text searching, you want the default behavior, which is analyzing the content of the field.

Now for the reason for the behavior - it is because of the default analyzer being used. You check check what is going on by using the Indices Analyze API provided by ElasticSearch (http://www.elasticsearch.org/guide/reference/api/admin-indices-analyze/). By using this API you can see how your data is analyzed, for example:

$ curl -XGET 'localhost:9200/test/_analyze?text=LANG000000904&pretty=true'

Will probably result in something like this:

{

"tokens" : [ {

"token" : "lang000000904",


"start_offset" : 0,


"end_offset" : 13,


"type" : "<ALPHANUM>",


"position" : 1

} ]

}

You can try your term query with the lowercased lang000000904 to see if it works, but I suppose it will.

This behavior is expected btw, because some of the queries provided by ElasticSearch are analyzed and some are not (like the term query).

--

Regards,

Rafał Kuć

Sematext :: http://sematext.com/ :: Solr - Lucene - ElasticSearch

Good day to all!

We are struggling with what seems like a potential problem with ES and

searching for data in fields. To simplify I reduced some documents to only the

salient points. Consider the following two simply documents:

test1:

{

"field1":"LANG000000904",

"field2":"LANG000000904"

}

test2:

{

"field1":"monkey",

"field2":"LANG000000904"

}

I placed the documents into an index called /test/type with names 1 and 2

respectively.

The mapping for this comes out with string type for both fields.

{

"test":{

  "type":{

     "properties":{

        "field1":{

           "type":"string"

        },

        "field2":{

           "type":"string"

        }

     }

  }

}

}

Note that field1 contains LANG000000904 in the first document and monkey in the second.

Now when I search for LANG000000904 i get 0 hits:

curl -s 'http://localhost:9200/test/type/_search?pretty' -d '{ "query" : { "term" : { "field1":"LANG000000904" } } }'

{

"took":1,

"timed_out":false,

"_shards":{

"total":1,

"successful":1,

"failed":0

},

"hits":{

"total":0,

"max_score":null,

"hits":[]

}

}

However searching for monkey i get one result as expected:

curl -s 'http://localhost:9200/test/type/_search?pretty' -d '{ "query" : { "term" : { "field1":"monkey" } } }'

{

"took":1,

"timed_out":false,

"_shards":{

"total":1,

"successful":1,

"failed":0

},

"hits":{

"total":1,

"max_score":1.0,

"hits":[{

  "_index":"test",

  "_type":"type",

  "_id":"2",

  "_score":1.0,"_source":{

"field1":"monkey",

"field2":"LANG000000904"

}

}]

}

}

It seems to me that the first search for LANG000000904 should return 1

hit for document 1, but it seems that the alphanumeric string is somehow

not found while the purely alphabetic string is found .... Are we missing

something that would make this work correctly?

Additionally we tested the GET URI requests for searching and those appear

to be working as expected:

curl -s 'http://localhost:9200/test/type/_search?q=field1:LANG000000904&pretty'

{

"took":3,

"timed_out":false,

"_shards":{

"total":1,

"successful":1,

"failed":0

},

"hits":{

"total":1,

"max_score":1.0,

"hits":[{

  "_index":"test",

  "_type":"type",

(Matthias Johnson) #3

Fascinating. Thank you Rafał!

Turns out you are correct. lowercasing the search string does return the
desired results as you suspect. Also using the not_analyzed does work. This
seems to imply that during indexing without the non_analyzed option in the
index, everything is made lower case in the indexes giving us essentially
case insensitive searching while setting non_analyzed maintains the case.
Is that correct?

Since the GET URI syntax is applying the analyzer, is there a way to force
that in the POST/JSON request as well? (That seems to lack in orthoganlity
a little ... since I'd expect the queries to behave the same).

@matthias

On Friday, September 6, 2013 10:32:50 AM UTC-6, Rafał Kuć wrote:

Hello!

Note that the term query is not analyzed, while the URI query is analyzed.
For the inputs like LANG000000904 you probably want your fields to be not
analyzed (property of string field set to index="not_analyzed") and thus
you'll be able to match exactly the field value you've indexed. On the
other hand, for fields that require full text searching, you want the
default behavior, which is analyzing the content of the field.

Now for the reason for the behavior - it is because of the default
analyzer being used. You check check what is going on by using the Indices
Analyze API provided by ElasticSearch (
http://www.elasticsearch.org/guide/reference/api/admin-indices-analyze/).
By using this API you can see how your data is analyzed, for example:

$ curl -XGET 'localhost:9200/test/_analyze?text=LANG000000904&pretty=true'

Will probably result in something like this:

{
"tokens" : [ {
"token" : "lang000000904",
"start_offset" : 0,
"end_offset" : 13,
"type" : "",
"position" : 1
} ]
}

You can try your term query with the lowercased lang000000904 to see if it
works, but I suppose it will.

This behavior is expected btw, because some of the queries provided by
ElasticSearch are analyzed and some are not (like the term query).

*--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - ElasticSearch

Good day to all!

We are struggling with what seems like a potential problem with ES and
searching for data in fields. To simplify I reduced some documents to only
the
salient points. Consider the following two simply documents:

test1:

{
"field1":"LANG000000904",
"field2":"LANG000000904"
}

test2:

{
"field1":"monkey",
"field2":"LANG000000904"
}

I placed the documents into an index called /test/type with names 1 and 2
respectively.

The mapping for this comes out with string type for both fields.

{
"test":{
"type":{
"properties":{
"field1":{
"type":"string"
},
"field2":{
"type":"string"
}
}
}
}
}

Note that field1 contains LANG000000904 in the first document and *
monkey *in the second.

Now when I search for LANG000000904 i get 0 hits:

curl -s 'http://localhost:9200/test/type/_search?pretty' -d '{ "query"

: { "term" : { "field1":"LANG000000904" } } }'
{
"took":1,
"timed_out":false,
"_shards":{
"total":1,
"successful":1,
"failed":0
},
"hits":{
"total":0,
"max_score":null,
"hits":[]
}
}

However searching for monkey i get one result as expected:

curl -s 'http://localhost:9200/test/type/_search?pretty' -d '{ "query"

: { "term" : { "field1":"monkey" } } }'
{
"took":1,
"timed_out":false,
"_shards":{
"total":1,
"successful":1,
"failed":0
},
"hits":{
"total":1,
"max_score":1.0,
"hits":[{
"_index":"test",
"_type":"type",
"_id":"2",
"_score":1.0,"_source":{
"field1":"monkey",
"field2":"LANG000000904"
}

}]

}
}

It seems to me that the first search for LANG000000904 should return 1
hit for document 1, but it seems that the alphanumeric string is somehow
not found while the purely alphabetic string is found .... Are we missing
something that would make this work correctly?

Additionally we tested the GET URI requests for searching and those appear
to be working as expected:

curl -s '

http://localhost:9200/test/type/_search?q=field1:LANG000000904&pretty'

{
"took":3,
"timed_out":false,
"_shards":{
"total":1,
"successful":1,
"failed":0
},
"hits":{
"total":1,
"max_score":1.0,
"hits":[{
"_index":"test",
"_type":"type",
"_id":"1",
"_score":1.0,"_source":{
"field1":"LANG000000904",
"field2":"LANG000000904"
}

}]

}
}

It seems that perhaps there is something not working correctly with
POST/JSON
query, but perhaps we are not doing it right.

Any comments and ideas would be much appreciated.

Thanks,

@matthias

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Rafał Kuć) #4

Hello!

How the data is analyzed depends on the analyzer used, you can look at the list of analyzers available in ElasticSearch by default and find information about analysis in general here: http://www.elasticsearch.org/guide/reference/index-modules/analysis/ .

As for the queries - yes this is possible also with JSON queries. Look at the query DSL reference (http://www.elasticsearch.org/guide/reference/query-dsl/) and you'll see that for example the match query is analyzed. For example this query should work in your case:

curl -s 'http://localhost:9200/test/type/_search?pretty' -d '{ "query" : { "match" : { "field1":"LANG000000904" } } }'

--

Regards,

Rafał Kuć

Sematext :: http://sematext.com/ :: Solr - Lucene - ElasticSearch

Fascinating. Thank you Rafał!

Turns out you are correct. lowercasing the search string does return the desired results as you suspect. Also using the not_analyzed does work. This seems to imply that during indexing without the non_analyzed option in the index, everything is made lower case in the indexes giving us essentially case insensitive searching while setting non_analyzed maintains the case. Is that correct?

Since the GET URI syntax is applying the analyzer, is there a way to force that in the POST/JSON request as well? (That seems to lack in orthoganlity a little ... since I'd expect the queries to behave the same).

@matthias

On Friday, September 6, 2013 10:32:50 AM UTC-6, Rafał Kuć wrote:

Hello!

Note that the term query is not analyzed, while the URI query is analyzed. For the inputs like LANG000000904 you probably want your fields to be not analyzed (property of string field set to index="not_analyzed") and thus you'll be able to match exactly the field value you've indexed. On the other hand, for fields that require full text searching, you want the default behavior, which is analyzing the content of the field.

Now for the reason for the behavior - it is because of the default analyzer being used. You check check what is going on by using the Indices Analyze API provided by ElasticSearch (http://www.elasticsearch.org/guide/reference/api/admin-indices-analyze/). By using this API you can see how your data is analyzed, for example:

$ curl -XGET 'localhost:9200/test/_analyze?text=LANG000000904&pretty=true'

Will probably result in something like this:

{

"tokens" : [ {

"token" : "lang000000904",


"start_offset" : 0,


"end_offset" : 13,


"type" : "&lt;ALPHANUM&gt;",


"position" : 1

} ]

}

You can try your term query with the lowercased lang000000904 to see if it works, but I suppose it will.

This behavior is expected btw, because some of the queries provided by ElasticSearch are analyzed and some are not (like the term query).

--

Regards,

Rafał Kuć

Sematext :: http://sematext.com/ :: Solr - Lucene - ElasticSearch

Good day to all!

We are struggling with what seems like a potential problem with ES and

searching for data in fields. To simplify I reduced some documents to only the

salient points. Consider the following two simply documents:

test1:

{

"field1":"LANG000000904",

"field2":"LANG000000904"

}

test2:

{

"field1":"monkey",

"field2":"LANG000000904"

}

I placed the documents into an index called /test/type with names 1 and 2

respectively.

The mapping for this comes out with string type for both fields.

{

"test":{

  "type":{

     "properties":{

        "field1":{

           "type":"string"

        },

        "field2":{

           "type":"string"

        }

     }

  }

}

}

Note that field1 contains LANG000000904 in the first document and monkey in the second.

Now when I search for LANG000000904 i get 0 hits:

curl -s 'http://localhost:9200/test/type/_search?pretty' -d '{ "query" : { "term" : { "field1":"LANG000000904" } } }'

{

"took":1,

"timed_out":false,

"_shards":{

"total":1,

"successful":1,

"failed":0

},

"hits":{

"total":0,

"max_score":null,

"hits":[]

}

}

However searching for monkey i get one result as expected:

curl -s 'http://localhost:9200/test/type/_search?pretty' -d '{ "query" : { "term" : { "field1":"monkey" } } }'

{

"took":1,

"timed_out":false,

"_shards":{

"total":1,

"successful":1,

"failed":0

},

"hits":{

"total":1,

"max_score":1.0,

"hits":[{

  "_index":"test",

  "_type":"type",

  "_id":"2",

  "_score":1.0,"_source":{

"field1":"monkey",

"field2":"LANG000000904"

}

}]

}

}

It seems to me that the first search for LANG000000904 should return 1

hit for document 1, but it seems that the alphanumeric string is somehow

not found while the purely alphabetic string is found .... Are we missing

something that would make this work correctly?

Additionally we tested the GET URI requests for searching and those appear

to be working as expected:

curl -s 'http://localhost:9200/test/type/_search?q=field1:LANG000000904&pretty'

{

"took":3,

"timed_out":false,

"_shards":{

"total":1,

"successful":1,

"failed":0

},

"hits":{

"total":1,

"max_score":1.0,

"hits":[{

  "_index":"test",

  "_type":"type",

  "_id":"1",

  "_score":1.0,"_source":{

"field1":"LANG000000904",

"field2":"LANG000000904"

}

}]

}

}

It seems that perhaps there is something not working correctly with POST/JSON

query, but perhaps we are not doing it right.

Any comments and ideas would be much appreciated.

Thanks,

@matthias

--

You received this message because you are subscribed to the Google Groups "elasticsearch" group.

To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.--

You received this message because you are subscribed to the Google Groups "elasticsearch" group.

To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.


(Matthias Johnson) #5

Awesome. That all makes perfect sense now. Thank you very much!

@matthias

On Friday, September 6, 2013 12:32:31 PM UTC-6, Rafał Kuć wrote:

Hello!

How the data is analyzed depends on the analyzer used, you can look at the
list of analyzers available in ElasticSearch by default and find
information about analysis in general here:
http://www.elasticsearch.org/guide/reference/index-modules/analysis/ .

As for the queries - yes this is possible also with JSON queries. Look at
the query DSL reference (
http://www.elasticsearch.org/guide/reference/query-dsl/) and you'll see
that for example the match query is analyzed. For example this query should
work in your case:

curl -s 'http://localhost:9200/test/type/_search?pretty' -d '{ "query" :
{ "match" : { "field1":"LANG000000904" } } }'

*--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - ElasticSearch

Fascinating. Thank you Rafał!

Turns out you are correct. lowercasing the search string does return the
desired results as you suspect. Also using the not_analyzed does work. This
seems to imply that during indexing without the non_analyzed option in the
index, everything is made lower case in the indexes giving us essentially
case insensitive searching while setting non_analyzed maintains the case.
Is that correct?

Since the GET URI syntax is applying the analyzer, is there a way to force
that in the POST/JSON request as well? (That seems to lack in orthoganlity
a little ... since I'd expect the queries to behave the same).

@matthias

On Friday, September 6, 2013 10:32:50 AM UTC-6, Rafał Kuć wrote:
Hello!

Note that the term query is not analyzed, while the URI query is analyzed.
For the inputs like LANG000000904 you probably want your fields to be not
analyzed (property of string field set to index="not_analyzed") and thus
you'll be able to match exactly the field value you've indexed. On the
other hand, for fields that require full text searching, you want the
default behavior, which is analyzing the content of the field.

Now for the reason for the behavior - it is because of the default
analyzer being used. You check check what is going on by using the Indices
Analyze API provided by ElasticSearch (http://www.elasticsearch.org/http://www.elasticsearch.org/guide/reference/api/admin-indices-analyze/
guide/reference/api/admin-http://www.elasticsearch.org/guide/reference/api/admin-indices-analyze/
indices-analyze/http://www.elasticsearch.org/guide/reference/api/admin-indices-analyze/).
By using this API you can see how your data is analyzed, for example:

$ curl -XGET 'localhost:9200/test/_analyze?text=LANG000000904&pretty=true'

Will probably result in something like this:

{
"tokens" : [ {
"token" : "lang000000904",
"start_offset" : 0,
"end_offset" : 13,
"type" : "",
"position" : 1
} ]
}

You can try your term query with the lowercased lang000000904 to see if it
works, but I suppose it will.

This behavior is expected btw, because some of the queries provided by
ElasticSearch are analyzed and some are not (like the term query).

*--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - ElasticSearch

Good day to all!

We are struggling with what seems like a potential problem with ES and
searching for data in fields. To simplify I reduced some documents to only
the
salient points. Consider the following two simply documents:

test1:

{
"field1":"LANG000000904",
"field2":"LANG000000904"
}

test2:

{
"field1":"monkey",
"field2":"LANG000000904"
}

I placed the documents into an index called /test/type with names 1 and 2
respectively.

The mapping for this comes out with string type for both fields.

{
"test":{
"type":{
"properties":{
"field1":{
"type":"string"
},
"field2":{
"type":"string"
}
}
}
}
}

Note that field1 contains LANG000000904 in the first document and *
monkey *in the second.

Now when I search for LANG000000904 i get 0 hits:

curl -s 'http://localhost:9200/test/http://localhost:9200/test/type/_search?pretty

type/_search?pretty http://localhost:9200/test/type/_search?pretty' -d
'{ "query" : { "term" : { "field1":"LANG000000904" } } }'
{
"took":1,
"timed_out":false,
"_shards":{
"total":1,
"successful":1,
"failed":0
},
"hits":{
"total":0,
"max_score":null,
"hits":[]
}
}

However searching for monkey i get one result as expected:

curl -s 'http://localhost:9200/test/http://localhost:9200/test/type/_search?pretty

type/_search?pretty http://localhost:9200/test/type/_search?pretty' -d
'{ "query" : { "term" : { "field1":"monkey" } } }'
{
"took":1,
"timed_out":false,
"_shards":{
"total":1,
"successful":1,
"failed":0
},
"hits":{
"total":1,
"max_score":1.0,
"hits":[{
"_index":"test",
"_type":"type",
"_id":"2",
"_score":1.0,"_source":{
"field1":"monkey",
"field2":"LANG000000904"
}

}]

}
}

It seems to me that the first search for LANG000000904 should return 1
hit for document 1, but it seems that the alphanumeric string is somehow
not found while the purely alphabetic string is found .... Are we missing
something that would make this work correctly?

Additionally we tested the GET URI requests for searching and those appear
to be working as expected:

curl -s 'http://localhost:9200/test/http://localhost:9200/test/type/_search?q=field1:LANG000000904&pretty

type/_search?q=field1:http://localhost:9200/test/type/_search?q=field1:LANG000000904&pretty
LANG000000904&prettyhttp://localhost:9200/test/type/_search?q=field1:LANG000000904&pretty'

{
"took":3,
"timed_out":false,
"_shards":{
"total":1,
"successful":1,
"failed":0
},
"hits":{
"total":1,
"max_score":1.0,
"hits":[{
"_index":"test",
"_type":"type",
"_id":"1",
"_score":1.0,"_source":{
"field1":"LANG000000904",
"field2":"LANG000000904"
}

}]

}
}

It seems that perhaps there is something not working correctly with
POST/JSON
query, but perhaps we are not doing it right.

Any comments and ideas would be much appreciated.

Thanks,

@matthias

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.https://groups.google.com/groups/opt_out
com/groups/opt_out https://groups.google.com/groups/opt_out. --
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #6