Keyword analyzer / lowercase filter

Jan_Kriesten · August 9, 2011, 5:52am

Hi,

I have configured ES (0.17.4) using

index :
analysis:
analyzer:
default:
type: keyword
max_token_length: 512
tokenizer : lowercase
filter : lowercase

I would now expect ES to downcase all indices/queries before execution.
Trying the folling gives no match, though:

curl -XPUT 'http://localhost:9200/my/test/1' -d '{ "file.title": "German
shepherd, doberman, greyhound" }'

curl -XGET 'http://localhost:9200/my/test/_search?pretty=true' -d '{
"query" : {
"constant_score" : {
"filter" : {
"prefix" : { "file.title" : "germ"}
}
}
}
}'

When searching for the 'prefix' "Germ" it gives a match.

What am I missing?

Best regards, --- Jan.

kimchy · August 9, 2011, 9:27am

You configuration for the analyzer is not correct. When you set the type to
keyword, then it will use the keyword analyzer, and then tokenizer/filter
are not relevant for it. Here is another configuration:

index :
analysis:
analyzer:
default:
tokenizer : keyword
filter : lowercase

On Tue, Aug 9, 2011 at 8:52 AM, Jan Kriesten kriesten@mail.footprint.dewrote:

Hi,

I have configured ES (0.17.4) using

index :
analysis:
analyzer:
default:
type: keyword
max_token_length: 512
tokenizer : lowercase
filter : lowercase

I would now expect ES to downcase all indices/queries before execution.
Trying the folling gives no match, though:

curl -XPUT 'http://localhost:9200/my/test/1' -d '{ "file.title": "German
shepherd, doberman, greyhound" }'

curl -XGET 'http://localhost:9200/my/test/_search?pretty=true' -d '{
"query" : {
"constant_score" : {
"filter" : {
"prefix" : { "file.title" : "germ"}
}
}
}
}'

When searching for the 'prefix' "Germ" it gives a match.

What am I missing?

Best regards, --- Jan.

Jan_Kriesten · August 9, 2011, 2:47pm

Hi Shay,

thanks, it works so far - except from giving all uppercase to the prefix
doesn't give a match. The lowercase filter should be applied to the
search phrase as well, not?

curl -XGET 'http://localhost:9200/my/test/_search?pretty=true' -d '{
"query" : {
"constant_score" : {
"filter" : {
"prefix" : { "file.title" : "GERM"}
}
}
}
}'

Best regards, --- Jan.

Am 09.08.11 11:27, schrieb Shay Banon:

You configuration for the analyzer is not correct. When you set the type
to keyword, then it will use the keyword analyzer, and then
tokenizer/filter are not relevant for it. Here is another configuration:

index :
analysis:
analyzer:
default:
tokenizer : keyword
filter : lowercase

On Tue, Aug 9, 2011 at 8:52 AM, Jan Kriesten <kriesten@mail.footprint.de
mailto:kriesten@mail.footprint.de> wrote:
Hi,

I have configured ES (0.17.4) using

index :
 analysis:
   analyzer:
     default:
       type: keyword
       max_token_length: 512
       tokenizer : lowercase
       filter : lowercase

I would now expect ES to downcase all indices/queries before execution.
Trying the folling gives no match, though:

curl -XPUT 'http://localhost:9200/my/test/1' -d '{ "file.title": "German
shepherd, doberman, greyhound" }'

curl -XGET 'http://localhost:9200/my/test/_search?pretty=true' -d '{
 "query" : {
   "constant_score" : {
     "filter" : {
        "prefix" : { "file.title" : "germ"}
     }
   }
 }
}'

When searching for the 'prefix' "Germ" it gives a match.

What am I missing?

Best regards, --- Jan.

kimchy · August 9, 2011, 3:30pm

prefix filter does not get analyzed (thus, not lowercased). Try and use text
prefix query:
Elasticsearch Platform — Find real-time answers at scale | Elastic.

On Tue, Aug 9, 2011 at 5:47 PM, Jan Kriesten kriesten@mail.footprint.dewrote:

Hi Shay,

thanks, it works so far - except from giving all uppercase to the prefix
doesn't give a match. The lowercase filter should be applied to the
search phrase as well, not?

curl -XGET 'http://localhost:9200/my/test/_search?pretty=true' -d '{
"query" : {
"constant_score" : {
"filter" : {
"prefix" : { "file.title" : "GERM"}
}
}
}
}'

Best regards, --- Jan.

Am 09.08.11 11:27, schrieb Shay Banon:
You configuration for the analyzer is not correct. When you set the type
to keyword, then it will use the keyword analyzer, and then
tokenizer/filter are not relevant for it. Here is another configuration:

index :
analysis:
analyzer:
default:
tokenizer : keyword
filter : lowercase

On Tue, Aug 9, 2011 at 8:52 AM, Jan Kriesten <kriesten@mail.footprint.de
mailto:kriesten@mail.footprint.de> wrote:
Hi,

I have configured ES (0.17.4) using

index :
 analysis:
   analyzer:
     default:
       type: keyword
       max_token_length: 512
       tokenizer : lowercase
       filter : lowercase

I would now expect ES to downcase all indices/queries before
execution.
Trying the folling gives no match, though:

curl -XPUT 'http://localhost:9200/my/test/1' -d '{ "file.title":
"German
shepherd, doberman, greyhound" }'

curl -XGET 'http://localhost:9200/my/test/_search?pretty=true' -d '{
 "query" : {
   "constant_score" : {
     "filter" : {
        "prefix" : { "file.title" : "germ"}
     }
   }
 }
}'

When searching for the 'prefix' "Germ" it gives a match.

What am I missing?

Best regards, --- Jan.

Jan_Fiedler · August 9, 2011, 3:31pm

You are using a prefix filter (in a constant score query). The prefix filter
is similar to a term filter and does not analyze the term (i.e. does not
apply the lowercase filter to your uppercase term). In your scenario, I
think its best to lowercase the term in the query (at the client side) to
match what the analyzer does at indexing time.

Christopher_Burkey_2 · August 9, 2011, 8:53pm

Am having a related issue.

We are not getting results back unless we pre-format the search term.

http://localhost:9200/system/group/_search' -d '{ "query" :
{ "text" : { "_all" : "Testing" } } }
Returns 0 hits

http://localhost:9200/system/group/_search' -d '{ "query" :
{ "text" : { "_all" : "testing" } } }
Returns 0 hits

http://localhost:9200/system/group/_search' -d '{ "query" :
{ "text" : { "_all" : "test" } } }
Returns 1 hit!

"_shards": {
"failed": 0,
"successful": 5,
"total": 5
},
"hits": {
"hits": [
{
"_id": "testid",
"_index": "system",
"_score": 0.13561106,
"_source": {
"id": "testid",
"name": "Testing"
},
"_type": "group"
}
],
"max_score": 0.13561106,
"total": 1
},
"timed_out": false,
"took": 33

Here is my setup:

"cluster_name": "entermedia-test",
"master_node": "LfnoRHg1SYicB1p1rFSdrg",
"metadata": {
"indices": {
"system": {
"aliases": ,
"mappings": {
"group": {
"properties": {
"_all": {
"analyzer": "lowersnowball",
"type": "string"
},
"id": {
"include_in_all": true,
"index": "not_analyzed",
"store": "yes",
"type": "string"
},
"name": {
"include_in_all": true,
"index": "not_analyzed",
"store": "yes",
"type": "string"
}
}
}
},
"settings": {
"index.analysis.analyzer.lowersnowball.filter.0":
"snowball",
"index.analysis.analyzer.lowersnowball.filter.1":
"standard",
"index.analysis.analyzer.lowersnowball.filter.2":
"lowercase",
"index.analysis.analyzer.lowersnowball.tokenizer":
"standard",
"index.analysis.analyzer.lowersnowball.type":
"custom",
"index.number_of_replicas": "1",
"index.number_of_shards": "5"
},
"state": "open"
}
},

On Aug 9, 11:31 am, Jan Fiedler fiedler....@gmail.com wrote:

You are using a prefix filter (in a constant score query). The prefix filter
is similar to a term filter and does not analyze the term (i.e. does not
apply the lowercase filter to your uppercase term). In your scenario, I
think its best to lowercase the term in the query (at the client side) to
match what the analyzer does at indexing time.

Christopher_Burkey_2 · August 9, 2011, 9:44pm

Here is a quick test case the shows the problem:

curl -XDELETE localhost:9200/twitter | python -mjson.tool

curl -XPOST localhost:9200/twitter -d '
{"index":
{ "number_of_shards": 1,
"analysis": {
"filter": {
"snowball": {
"type" : "snowball",
"language" : "English"
}
},
"analyzer": { "a2" : {
"type":"custom",
"tokenizer": "standard",
"filter": ["lowercase", "snowball"]
}
}
}
}
}
}' | python -mjson.tool

sleep 1

curl -XPUT localhost:9200/twitter/tweet/_mapping -d '{
"tweet" : {
"properties" : {
"_all" : {"type" : "string", "analyzer":"a2"},
"message" : {"type" : "string",
"analyzer":"a2","include_in_all":"true"},
"user": {"type":"string"}
}
}}' | python -mjson.tool

sleep 1

curl -XPUT http://localhost:9200/twitter/tweet/1 -d '{ "user":
"kimchy", "message": "Trying out searching teaching, so far so
good?" }' | python -mjson.tool

sleep 1

curl -XGET localhost:9200/twitter/tweet/_search?q=message:teaching |
python -mjson.tool

sleep 1

curl -XGET localhost:9200/twitter/tweet/_search?q=_all:teaching |
python -mjson.tool

echo "Should have a hit"

On Aug 9, 4:53 pm, Christopher Burkey cbur...@entermediasoftware.com
wrote:

Am having a related issue.

We are not getting results back unless we pre-format the search term.

http://localhost:9200/system/group/_search'-d '{ "query" :
{ "text" : { "_all" : "Testing" } } }
Returns 0 hits

http://localhost:9200/system/group/_search'-d '{ "query" :
{ "text" : { "_all" : "testing" } } }
Returns 0 hits

http://localhost:9200/system/group/_search'-d '{ "query" :
{ "text" : { "_all" : "test" } } }
Returns 1 hit!

"_shards": {
"failed": 0,
"successful": 5,
"total": 5
},
"hits": {
"hits": [
{
"_id": "testid",
"_index": "system",
"_score": 0.13561106,
"_source": {
"id": "testid",
"name": "Testing"
},
"_type": "group"
}
],
"max_score": 0.13561106,
"total": 1
},
"timed_out": false,
"took": 33

Here is my setup:

"cluster_name": "entermedia-test",
"master_node": "LfnoRHg1SYicB1p1rFSdrg",
"metadata": {
"indices": {
"system": {
"aliases": ,
"mappings": {
"group": {
"properties": {
"_all": {
"analyzer": "lowersnowball",
"type": "string"
},
"id": {
"include_in_all": true,
"index": "not_analyzed",
"store": "yes",
"type": "string"
},
"name": {
"include_in_all": true,
"index": "not_analyzed",
"store": "yes",
"type": "string"
}
}
}
},
"settings": {
"index.analysis.analyzer.lowersnowball.filter.0":
"snowball",
"index.analysis.analyzer.lowersnowball.filter.1":
"standard",
"index.analysis.analyzer.lowersnowball.filter.2":
"lowercase",
"index.analysis.analyzer.lowersnowball.tokenizer":
"standard",
"index.analysis.analyzer.lowersnowball.type":
"custom",
"index.number_of_replicas": "1",
"index.number_of_shards": "5"
},
"state": "open"
}
},

On Aug 9, 11:31 am, Jan Fiedler fiedler....@gmail.com wrote:

You are using a prefix filter (in a constant score query). The prefix filter
is similar to a term filter and does not analyze the term (i.e. does not
apply the lowercase filter to your uppercase term). In your scenario, I
think its best to lowercase the term in the query (at the client side) to
match what the analyzer does at indexing time.

Jan_Fiedler · August 10, 2011, 9:41am

I repeated your steps (without python though) and did not get a hit with the
last query curl (so I guess I was able to reproduce). I then looked at your
mapping definition and found the explicit definition of the _all field a
little strange. You already correctly define the analyzer at the message
field but you do it again at the _all field. I was suspecting that maybe
this leads to the analyzer applied twice (cannot prove this though). Anyway,
removing the mapping for the _all field I got the same hit for both
queries.

However, what still confuses me is that I do not get the same result when
searching for 'teach' instead of 'teaching'. This works on the message field
(this is the point of the snowball stemmer) but it does not give me hits on
the _all field. If I use your original mapping (with the analyzer assigned
to both fields) I get no hits for 'teaching' on _all but I do get a hit for
'teach' on _all.

I maybe doing / testing things wrong but it smells a little buggy around the
_all field.

Christopher_Burkey_2 · August 10, 2011, 8:43pm

You should be able to define the _all field just like any other field.
Bottom line is there is a bug here since the _all field always uses
the default search_analyzer instead of the specified one. So this
means you can't do lower case or stemming with _all

Try this minimal test:

curl -XDELETE localhost:9200/twitter | python -mjson.tool

curl -XPOST localhost:9200/twitter -d '
{"index":
{ "number_of_shards": 1,
"analysis": {
"filter": {
"snowball": {
"type" : "snowball",
"language" : "English"
}
},
"analyzer": { "a2" : {
"type":"custom",
"tokenizer": "standard",
"filter": ["lowercase", "snowball"]
}
}
}
}
}
}' | python -mjson.tool

sleep 1

curl -XPUT localhost:9200/twitter/tweet/_mapping -d '{
"tweet" : {
"properties" : {
"message" : {"type" : "string",
"analyzer":"a2","include_in_all":"true"},
"user": {"type":"string"}
}
}}' | python -mjson.tool

sleep 1

curl -XPUT http://localhost:9200/twitter/tweet/1 -d '{ "user":
"kimchy", "message": "Trying out searching teaching, so far so
good?" }' | python -mjson.tool

sleep 1

curl -XGET localhost:9200/twitter/tweet/_search?q=message:teach |
python -mjson.tool

sleep 1

curl -XGET localhost:9200/twitter/tweet/_search?q=_all:teach | python
-mjson.tool

echo "Should have a hit"

Topic		Replies	Views
Requesting help with Case-insensitive Analyzer Elasticsearch	3	295	March 27, 2024
Prefix query is case sensitive despite both index and search analyzers using lowercase filter? Elasticsearch	1	1246	June 13, 2018
Is there a way to search terms lower cased? Elasticsearch	9	476	July 6, 2017
Html_strip and lowercase on keyword analyzed fields Elasticsearch	1	653	July 5, 2017
Case-insensitive term query Elasticsearch	3	2943	January 20, 2017

Keyword analyzer / lowercase filter

Related topics