Questions about analysis with TermQueryBuilder and PrefixQueryBuilder

brian_yoder · March 18, 2013, 4:10pm

Input data:

{ "create" : { "_index" : "fizzbuzz", "_type" : "person", "_id" :
"7214560012" } }
{ "telno" : "7214560012", "gn" : "Aurelio", "sn" : "Phzee", "o" :
"Philadelphia Fizzies" }
{ "create" : { "_index" : "fizzbuzz", "_type" : "person", "_id" :
"7214560013" } }
{ "telno" : "7214560013", "gn" : "Ognit", "sn" : "Ferglaps", "o" :
"Philadelphia Fizzies" }
{ "create" : { "_index" : "fizzbuzz", "_type" : "person", "_id" :
"7214560014" } }
{ "telno" : "7214560014", "gn" : "Bill", "sn" : "Barf", "o" : "Philadelphia
Fizzies" }
{ "create" : { "_index" : "fizzbuzz", "_type" : "person", "_id" :
"7214560015" } }
{ "telno" : "7214560015", "gn" : "Chuck", "sn" : "Wagon", "o" : "Dallas
Debbies" }

Explicitly added settings and mappings that were specified before any of
the data was loaded:

{
"settings" : {
"index" : {
"number_of_shards" : 1,
"analysis" : {
"char_filter" : { },
"filter" : {
"english_snowball_filter" : {
"type" : "snowball",
"language" : "English"
}
},
"analyzer" : {
"english_standard_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase" ]
},
"english_stemming_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "english_snowball_filter"
]
}
}
}
}
},
"mappings" : {
"person" : {
"_all" : {
"enabled" : false
},
"properties" : {
"telno" : {
"type" : "string",
"analyzer" : "english_standard_analyzer"
},
"gn" : {
"type" : "string",
"analyzer" : "english_standard_analyzer"
},
"sn" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
},
"o" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
}
}
}
}
}

To verify the mappings, the gn field is lowercased with no stop word
removal, but is not stemmed:

$ curl -XGET 'http://localhost:9200/fizzbuzz/_analyze?field=gn&pretty=true'
-d "Bobby and Debbie"
{
"tokens" : [ {
"token" : "bobby",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "and",
"start_offset" : 6,
"end_offset" : 9,
"type" : "",
"position" : 2
}, {
"token" : "debbie",
"start_offset" : 10,
"end_offset" : 16,
"type" : "",
"position" : 3
} ]
}

The "o" field is also lowercased with no stop filter, and is also stemmed
using the English snowball analyzer:

$ curl -XGET 'http://localhost:9200/fizzbuzz/_analyze?field=o&pretty=true'
-d "Bobby and Debbie"
{
"tokens" : [ {
"token" : "bobbi",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "and",
"start_offset" : 6,
"end_offset" : 9,
"type" : "",
"position" : 2
}, {
"token" : "debbi",
"start_offset" : 10,
"end_offset" : 16,
"type" : "",
"position" : 3
} ]
}

A simple query of "sn" matching the value "PHZEE". The MatchQueryBuilder
was used to match the value. The value is analyzed at query time the same
way that it is analyzed at index build time, and the expected record is
found:

$ curl -XGET 'http://localhost:9200/fizzbuzz/person/_search?pretty=true' -d'
{
"from" : 0,
"size" : 20,
"query" : {
"match" : {
"sn" : {
"query" : "PHZEE",
"type" : "boolean"
}
}
},
"version" : true,
"explain" : false
}'

{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.6931472,
"hits" : [ {
"_index" : "fizzbuzz",
"_type" : "person",
"_id" : "7214560012",
"_version" : 1,
"_score" : 1.6931472, "_source" : { "telno" : "7214560012", "gn" :
"Aurelio", "sn" : "Phzee", "o" : "Philadelphia Fizzies" }
} ]
}
}

Now a TermQueryBuilder was used, but it no longer matches PHZEE:

$ curl -XGET 'http://localhost:9200/fizzbuzz/person/_search?pretty=true' -d'
{
"from" : 0,
"size" : 20,
"query" : {
"term" : {
"sn" : "PHZEE"
}
},
"version" : true,
"explain" : false
}'

{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}

But it matches phzee since I lowercased the term:

$ curl -XGET 'http://localhost:9200/fizzbuzz/person/_search?pretty=true' -d'
{
"from" : 0,
"size" : 20,
"query" : {
"term" : {
"sn" : "phzee"
}
},
"version" : true,
"explain" : false
}'
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.6931472,
"hits" : [ {
"_index" : "fizzbuzz",
"_type" : "person",
"_id" : "7214560012",
"_version" : 1,
"_score" : 1.6931472, "_source" : { "telno" : "7214560012", "gn" :
"Aurelio", "sn" : "Phzee", "o" : "Philadelphia Fizzies" }
} ]
}
}

Also, the following prefix query does not work. It's as if the prefix PHZ
is not being analyzed:

$ curl -XGET 'http://localhost:9200/fizzbuzz/person/_search?pretty=true' -d'
{
"from" : 0,
"size" : 20,
"query" : {
"prefix" : {
"sn" : "PHZ"
}
},
"version" : true,
"explain" : false
}'

{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}

So I analyze it myself using the highly custom analyzer known as the Java
String.toLowerCase method, and now it works fine:

$ curl -XGET 'http://localhost:9200/fizzbuzz/person/_search?pretty=true' -d'
{
"from" : 0,
"size" : 20,
"query" : {
"prefix" : {
"sn" : "phz"
}
},
"version" : true,
"explain" : false
}'

{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "fizzbuzz",
"_type" : "person",
"_id" : "7214560012",
"_version" : 1,
"_score" : 1.0, "_source" : { "telno" : "7214560012", "gn" :
"Aurelio", "sn" : "Phzee", "o" : "Philadelphia Fizzies" }
} ]
}
}

I would expect that with snowball stemming, since Debbies is indexed as
debi, the prefix "debbie" is not found. And I've verified this to be true.
And since I always use the lowercase token filter, my use of
String.toLowerCase is working as well as can be expected with stemming.

But the gn field is not stemmed, and the following prefix also returns no
hits:

$ curl -XGET 'http://localhost:9200/fizzbuzz/person/_search?pretty=true' -d'
{
"from" : 0,
"size" : 20,
"query" : {
"prefix" : {
"gn" : "CHU"
}
},
"version" : true,
"explain" : false
}'

Did I miss something in the documentation, or is neither a TermQueryBuilder
nor a PrefixQueryBuilder analyzed by default, and must always have some
form of explicit analysis?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

brian_yoder · March 18, 2013, 6:21pm

OK, I missed something. I stumbled across
http://stackoverflow.com/questions/11657505/defining-analyzer-while-querying-in-elasticsearchto find the answer.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Prefix Query doesn't apply the analyzer used when indexing Elasticsearch	2	1011	July 6, 2017
Safe to use prefix query on analysed field? Elasticsearch	4	484	July 6, 2017
Make pharse prefix search on whole text of a field Elasticsearch	5	731	July 5, 2017
PrefixQuery and Analyzer Elasticsearch	3	390	July 5, 2017
Prefix Queries Using NEST API Elasticsearch language-clients	3	340	July 19, 2022

Questions about analysis with TermQueryBuilder and PrefixQueryBuilder

Related topics