Keyword analyzer / lowercase filter


(Jan Kriesten) #1

Hi,

I have configured ES (0.17.4) using

index :
analysis:
analyzer:
default:
type: keyword
max_token_length: 512
tokenizer : lowercase
filter : lowercase

I would now expect ES to downcase all indices/queries before execution.
Trying the folling gives no match, though:

curl -XPUT 'http://localhost:9200/my/test/1' -d '{ "file.title": "German
shepherd, doberman, greyhound" }'

curl -XGET 'http://localhost:9200/my/test/_search?pretty=true' -d '{
"query" : {
"constant_score" : {
"filter" : {
"prefix" : { "file.title" : "germ"}
}
}
}
}'

When searching for the 'prefix' "Germ" it gives a match.

What am I missing?

Best regards, --- Jan.


(Shay Banon) #2

You configuration for the analyzer is not correct. When you set the type to
keyword, then it will use the keyword analyzer, and then tokenizer/filter
are not relevant for it. Here is another configuration:

index :
analysis:
analyzer:
default:
tokenizer : keyword
filter : lowercase

On Tue, Aug 9, 2011 at 8:52 AM, Jan Kriesten kriesten@mail.footprint.dewrote:

Hi,

I have configured ES (0.17.4) using

index :
analysis:
analyzer:
default:
type: keyword
max_token_length: 512
tokenizer : lowercase
filter : lowercase

I would now expect ES to downcase all indices/queries before execution.
Trying the folling gives no match, though:

curl -XPUT 'http://localhost:9200/my/test/1' -d '{ "file.title": "German
shepherd, doberman, greyhound" }'

curl -XGET 'http://localhost:9200/my/test/_search?pretty=true' -d '{
"query" : {
"constant_score" : {
"filter" : {
"prefix" : { "file.title" : "germ"}
}
}
}
}'

When searching for the 'prefix' "Germ" it gives a match.

What am I missing?

Best regards, --- Jan.


(Jan Kriesten) #3

Hi Shay,

thanks, it works so far - except from giving all uppercase to the prefix
doesn't give a match. The lowercase filter should be applied to the
search phrase as well, not?

curl -XGET 'http://localhost:9200/my/test/_search?pretty=true' -d '{
"query" : {
"constant_score" : {
"filter" : {
"prefix" : { "file.title" : "GERM"}
}
}
}
}'

Best regards, --- Jan.

Am 09.08.11 11:27, schrieb Shay Banon:

You configuration for the analyzer is not correct. When you set the type
to keyword, then it will use the keyword analyzer, and then
tokenizer/filter are not relevant for it. Here is another configuration:

index :
analysis:
analyzer:
default:
tokenizer : keyword
filter : lowercase

On Tue, Aug 9, 2011 at 8:52 AM, Jan Kriesten <kriesten@mail.footprint.de
mailto:kriesten@mail.footprint.de> wrote:

Hi,

I have configured ES (0.17.4) using

index :
 analysis:
   analyzer:
     default:
       type: keyword
       max_token_length: 512
       tokenizer : lowercase
       filter : lowercase

I would now expect ES to downcase all indices/queries before execution.
Trying the folling gives no match, though:

curl -XPUT 'http://localhost:9200/my/test/1' -d '{ "file.title": "German
shepherd, doberman, greyhound" }'

curl -XGET 'http://localhost:9200/my/test/_search?pretty=true' -d '{
 "query" : {
   "constant_score" : {
     "filter" : {
        "prefix" : { "file.title" : "germ"}
     }
   }
 }
}'

When searching for the 'prefix' "Germ" it gives a match.

What am I missing?

Best regards, --- Jan.

(Shay Banon) #4

prefix filter does not get analyzed (thus, not lowercased). Try and use text
prefix query:
http://www.elasticsearch.org/guide/reference/query-dsl/text-query.html.

On Tue, Aug 9, 2011 at 5:47 PM, Jan Kriesten kriesten@mail.footprint.dewrote:

Hi Shay,

thanks, it works so far - except from giving all uppercase to the prefix
doesn't give a match. The lowercase filter should be applied to the
search phrase as well, not?

curl -XGET 'http://localhost:9200/my/test/_search?pretty=true' -d '{
"query" : {
"constant_score" : {
"filter" : {
"prefix" : { "file.title" : "GERM"}
}
}
}
}'

Best regards, --- Jan.

Am 09.08.11 11:27, schrieb Shay Banon:

You configuration for the analyzer is not correct. When you set the type
to keyword, then it will use the keyword analyzer, and then
tokenizer/filter are not relevant for it. Here is another configuration:

index :
analysis:
analyzer:
default:
tokenizer : keyword
filter : lowercase

On Tue, Aug 9, 2011 at 8:52 AM, Jan Kriesten <kriesten@mail.footprint.de
mailto:kriesten@mail.footprint.de> wrote:

Hi,

I have configured ES (0.17.4) using

index :
 analysis:
   analyzer:
     default:
       type: keyword
       max_token_length: 512
       tokenizer : lowercase
       filter : lowercase

I would now expect ES to downcase all indices/queries before

execution.

Trying the folling gives no match, though:

curl -XPUT 'http://localhost:9200/my/test/1' -d '{ "file.title":

"German

shepherd, doberman, greyhound" }'

curl -XGET 'http://localhost:9200/my/test/_search?pretty=true' -d '{
 "query" : {
   "constant_score" : {
     "filter" : {
        "prefix" : { "file.title" : "germ"}
     }
   }
 }
}'

When searching for the 'prefix' "Germ" it gives a match.

What am I missing?

Best regards, --- Jan.

(Jan Fiedler) #5

You are using a prefix filter (in a constant score query). The prefix filter
is similar to a term filter and does not analyze the term (i.e. does not
apply the lowercase filter to your uppercase term). In your scenario, I
think its best to lowercase the term in the query (at the client side) to
match what the analyzer does at indexing time.


(Christopher Burkey-2) #6

Am having a related issue.

We are not getting results back unless we pre-format the search term.

http://localhost:9200/system/group/_search' -d '{ "query" :
{ "text" : { "_all" : "Testing" } } }
Returns 0 hits

http://localhost:9200/system/group/_search' -d '{ "query" :
{ "text" : { "_all" : "testing" } } }
Returns 0 hits

http://localhost:9200/system/group/_search' -d '{ "query" :
{ "text" : { "_all" : "test" } } }
Returns 1 hit!

"_shards": {
"failed": 0,
"successful": 5,
"total": 5
},
"hits": {
"hits": [
{
"_id": "testid",
"_index": "system",
"_score": 0.13561106,
"_source": {
"id": "testid",
"name": "Testing"
},
"_type": "group"
}
],
"max_score": 0.13561106,
"total": 1
},
"timed_out": false,
"took": 33

Here is my setup:

"cluster_name": "entermedia-test",
"master_node": "LfnoRHg1SYicB1p1rFSdrg",
"metadata": {
"indices": {
"system": {
"aliases": [],
"mappings": {
"group": {
"properties": {
"_all": {
"analyzer": "lowersnowball",
"type": "string"
},
"id": {
"include_in_all": true,
"index": "not_analyzed",
"store": "yes",
"type": "string"
},
"name": {
"include_in_all": true,
"index": "not_analyzed",
"store": "yes",
"type": "string"
}
}
}
},
"settings": {
"index.analysis.analyzer.lowersnowball.filter.0":
"snowball",
"index.analysis.analyzer.lowersnowball.filter.1":
"standard",
"index.analysis.analyzer.lowersnowball.filter.2":
"lowercase",
"index.analysis.analyzer.lowersnowball.tokenizer":
"standard",
"index.analysis.analyzer.lowersnowball.type":
"custom",
"index.number_of_replicas": "1",
"index.number_of_shards": "5"
},
"state": "open"
}
},

On Aug 9, 11:31 am, Jan Fiedler fiedler....@gmail.com wrote:

You are using a prefix filter (in a constant score query). The prefix filter
is similar to a term filter and does not analyze the term (i.e. does not
apply the lowercase filter to your uppercase term). In your scenario, I
think its best to lowercase the term in the query (at the client side) to
match what the analyzer does at indexing time.


(Christopher Burkey-2) #7

Here is a quick test case the shows the problem:

curl -XDELETE localhost:9200/twitter | python -mjson.tool

curl -XPOST localhost:9200/twitter -d '
{"index":
{ "number_of_shards": 1,
"analysis": {
"filter": {
"snowball": {
"type" : "snowball",
"language" : "English"
}
},
"analyzer": { "a2" : {
"type":"custom",
"tokenizer": "standard",
"filter": ["lowercase", "snowball"]
}
}
}
}
}
}' | python -mjson.tool

sleep 1

curl -XPUT localhost:9200/twitter/tweet/_mapping -d '{
"tweet" : {
"properties" : {
"_all" : {"type" : "string", "analyzer":"a2"},
"message" : {"type" : "string",
"analyzer":"a2","include_in_all":"true"},
"user": {"type":"string"}
}
}}' | python -mjson.tool

sleep 1

curl -XPUT http://localhost:9200/twitter/tweet/1 -d '{ "user":
"kimchy", "message": "Trying out searching teaching, so far so
good?" }' | python -mjson.tool

sleep 1

curl -XGET localhost:9200/twitter/tweet/_search?q=message:teaching |
python -mjson.tool

sleep 1

curl -XGET localhost:9200/twitter/tweet/_search?q=_all:teaching |
python -mjson.tool

echo "Should have a hit"

On Aug 9, 4:53 pm, Christopher Burkey cbur...@entermediasoftware.com
wrote:

Am having a related issue.

We are not getting results back unless we pre-format the search term.

http://localhost:9200/system/group/_search'-d '{ "query" :
{ "text" : { "_all" : "Testing" } } }
Returns 0 hits

http://localhost:9200/system/group/_search'-d '{ "query" :
{ "text" : { "_all" : "testing" } } }
Returns 0 hits

http://localhost:9200/system/group/_search'-d '{ "query" :
{ "text" : { "_all" : "test" } } }
Returns 1 hit!

"_shards": {
"failed": 0,
"successful": 5,
"total": 5
},
"hits": {
"hits": [
{
"_id": "testid",
"_index": "system",
"_score": 0.13561106,
"_source": {
"id": "testid",
"name": "Testing"
},
"_type": "group"
}
],
"max_score": 0.13561106,
"total": 1
},
"timed_out": false,
"took": 33

Here is my setup:

"cluster_name": "entermedia-test",
"master_node": "LfnoRHg1SYicB1p1rFSdrg",
"metadata": {
"indices": {
"system": {
"aliases": [],
"mappings": {
"group": {
"properties": {
"_all": {
"analyzer": "lowersnowball",
"type": "string"
},
"id": {
"include_in_all": true,
"index": "not_analyzed",
"store": "yes",
"type": "string"
},
"name": {
"include_in_all": true,
"index": "not_analyzed",
"store": "yes",
"type": "string"
}
}
}
},
"settings": {
"index.analysis.analyzer.lowersnowball.filter.0":
"snowball",
"index.analysis.analyzer.lowersnowball.filter.1":
"standard",
"index.analysis.analyzer.lowersnowball.filter.2":
"lowercase",
"index.analysis.analyzer.lowersnowball.tokenizer":
"standard",
"index.analysis.analyzer.lowersnowball.type":
"custom",
"index.number_of_replicas": "1",
"index.number_of_shards": "5"
},
"state": "open"
}
},

On Aug 9, 11:31 am, Jan Fiedler fiedler....@gmail.com wrote:

You are using a prefix filter (in a constant score query). The prefix filter
is similar to a term filter and does not analyze the term (i.e. does not
apply the lowercase filter to your uppercase term). In your scenario, I
think its best to lowercase the term in the query (at the client side) to
match what the analyzer does at indexing time.


(Jan Fiedler) #8

I repeated your steps (without python though) and did not get a hit with the
last query curl (so I guess I was able to reproduce). I then looked at your
mapping definition and found the explicit definition of the _all field a
little strange. You already correctly define the analyzer at the message
field but you do it again at the _all field. I was suspecting that maybe
this leads to the analyzer applied twice (cannot prove this though). Anyway,
removing the mapping for the _all field I got the same hit for both
queries.

However, what still confuses me is that I do not get the same result when
searching for 'teach' instead of 'teaching'. This works on the message field
(this is the point of the snowball stemmer) but it does not give me hits on
the _all field. If I use your original mapping (with the analyzer assigned
to both fields) I get no hits for 'teaching' on _all but I do get a hit for
'teach' on _all.

I maybe doing / testing things wrong but it smells a little buggy around the
_all field.


(Christopher Burkey-2) #9

You should be able to define the _all field just like any other field.
Bottom line is there is a bug here since the _all field always uses
the default search_analyzer instead of the specified one. So this
means you can't do lower case or stemming with _all

Try this minimal test:

curl -XDELETE localhost:9200/twitter | python -mjson.tool

curl -XPOST localhost:9200/twitter -d '
{"index":
{ "number_of_shards": 1,
"analysis": {
"filter": {
"snowball": {
"type" : "snowball",
"language" : "English"
}
},
"analyzer": { "a2" : {
"type":"custom",
"tokenizer": "standard",
"filter": ["lowercase", "snowball"]
}
}
}
}
}
}' | python -mjson.tool

sleep 1

curl -XPUT localhost:9200/twitter/tweet/_mapping -d '{
"tweet" : {
"properties" : {
"message" : {"type" : "string",
"analyzer":"a2","include_in_all":"true"},
"user": {"type":"string"}
}
}}' | python -mjson.tool

sleep 1

curl -XPUT http://localhost:9200/twitter/tweet/1 -d '{ "user":
"kimchy", "message": "Trying out searching teaching, so far so
good?" }' | python -mjson.tool

sleep 1

curl -XGET localhost:9200/twitter/tweet/_search?q=message:teach |
python -mjson.tool

sleep 1

curl -XGET localhost:9200/twitter/tweet/_search?q=_all:teach | python
-mjson.tool

echo "Should have a hit"


(system) #10