Analyzer change in behaviour in 0.16 - bug? feature?

Clinton_Gormley · April 26, 2011, 3:04pm

Hiya

There has been a change in behaviour in how analyzers are applied
between 0.15.2 and 0.16.0.

For instance, in 0.15.2, an 'edge-ngram' tokenizer, was being applied at
index time, but not at search time.

In 0.16.0, it is also applied at search time.

For instance, I index a field containing "london" (with an edge-ngram
tokenizer). I expect a search for 'lon' to match the doc, but not a
search for 'londres'. This was the case in 0.15

However, because the query term is being passed through the same
tokenizer, the query is actually for the terms
"l","lo","lon","lond",londr"...etc so this search DOES match in 0.16

It can easily be worked around by using a different analyzer at index
and search time, but I'm not sure that this is correct behaviour. It
makes sense for (eg) snowball analyzers, but ngrams?

What do you think?

clint

curl -XPUT 'http://127.0.0.1:9200/foo/?pretty=1' -d '
{
"mappings" : {
"bar" : {
"properties" : {
"tokens" : {
"type" : "string",
"analyzer" : "edge_ngram"
}
}
}
},
"settings" : {
"analysis" : {
"filter" : {
"edge_ngram" : {
"side" : "front",
"max_gram" : 20,
"min_gram" : 1,
"type" : "edgeNGram"
}
},
"analyzer" : {
"edge_ngram" : {
"filter" : [
"standard",
"lowercase",
"edge_ngram"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
}
}
'

curl -XPOST 'http://127.0.0.1:9200/foo/bar?pretty=1&refresh=true' -d '
{
"tokens" : "london"
}
'

curl -XGET 'http://127.0.0.1:9200/foo/bar/_search?pretty=1' -d '
{
"query" : {
"field" : {
"tokens" : "londres"
}
}
}
'

RESULT IN 0.15.2:

{

"hits" : {

"hits" : [],

"max_score" : null,

"total" : 0

},

"timed_out" : false,

"_shards" : {

"failed" : 0,

"successful" : 5,

"total" : 5

},

"took" : 3

}

RESULT IN 0.16.0:

[Tue Apr 26 17:02:37 2011] Response:

{

"hits" : {

"hits" : [

{

"_source" : {

"tokens" : "london"

},

"_score" : 0.043920923,

"_index" : "foo",

"_id" : "M0aZ9of6Q7e1gEOe7syQvA",

"_type" : "bar"

}

],

"max_score" : 0.043920923,

"total" : 1

},

"timed_out" : false,

"_shards" : {

"failed" : 0,

"successful" : 5,

"total" : 5

},

"took" : 51

}

kimchy · April 26, 2011, 5:37pm

I need to verify and track the what you see, but, when an analyzer is set in the mappings, it is applied both for search and query time (for query_string / field queries), that is the intended behavior, regardless of what type of analyzer you use.
On Tuesday, April 26, 2011 at 6:04 PM, Clinton Gormley wrote:

Hiya

There has been a change in behaviour in how analyzers are applied
between 0.15.2 and 0.16.0.

For instance, in 0.15.2, an 'edge-ngram' tokenizer, was being applied at
index time, but not at search time.

In 0.16.0, it is also applied at search time.

For instance, I index a field containing "london" (with an edge-ngram
tokenizer). I expect a search for 'lon' to match the doc, but not a
search for 'londres'. This was the case in 0.15

However, because the query term is being passed through the same
tokenizer, the query is actually for the terms
"l","lo","lon","lond",londr"...etc so this search DOES match in 0.16

It can easily be worked around by using a different analyzer at index
and search time, but I'm not sure that this is correct behaviour. It
makes sense for (eg) snowball analyzers, but ngrams?

What do you think?

clint

curl -XPUT 'http://127.0.0.1:9200/foo/?pretty=1' -d '
{
"mappings" : {
"bar" : {
"properties" : {
"tokens" : {
"type" : "string",
"analyzer" : "edge_ngram"
}
}
}
},
"settings" : {
"analysis" : {
"filter" : {
"edge_ngram" : {
"side" : "front",
"max_gram" : 20,
"min_gram" : 1,
"type" : "edgeNGram"
}
},
"analyzer" : {
"edge_ngram" : {
"filter" : [
"standard",
"lowercase",
"edge_ngram"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
}
}
'

curl -XPOST 'http://127.0.0.1:9200/foo/bar?pretty=1&refresh=true' -d '
{
"tokens" : "london"
}
'

curl -XGET 'http://127.0.0.1:9200/foo/bar/_search?pretty=1' -d '
{
"query" : {
"field" : {
"tokens" : "londres"
}
}
}
'

RESULT IN 0.15.2:

{

"hits" : {

"hits" : ,

"max_score" : null,

"total" : 0

},

"timed_out" : false,

"_shards" : {

"failed" : 0,

"successful" : 5,

"total" : 5

},

"took" : 3

}

RESULT IN 0.16.0:

[Tue Apr 26 17:02:37 2011] Response:

{

"hits" : {

"hits" : [

{

"_source" : {

"tokens" : "london"

},

"_score" : 0.043920923,

"_index" : "foo",

"_id" : "M0aZ9of6Q7e1gEOe7syQvA",

"_type" : "bar"

}

],

"max_score" : 0.043920923,

"total" : 1

},

"timed_out" : false,

"_shards" : {

"failed" : 0,

"successful" : 5,

"total" : 5

},

"took" : 51

}

Topic		Replies	Views
Email Analyzer failing in 0.16.0 Elasticsearch	2	295	July 6, 2017
Changes to Analyzers in 0.16.1 Elasticsearch	3	344	July 6, 2017
Query Time Analysis: Are field value also analyzed? Elasticsearch	5	352	July 6, 2017
Query analzyer with respect to field/index analzyer Elasticsearch	5	363	July 6, 2017
What analyzer does query_string use for highlighting? Elasticsearch	4	1619	July 6, 2017

Analyzer change in behaviour in 0.16 - bug? feature?

{

"hits" : {

"hits" : [],

"max_score" : null,

"total" : 0

},

"timed_out" : false,

"_shards" : {

"failed" : 0,

"successful" : 5,

"total" : 5

},

"took" : 3

}

[Tue Apr 26 17:02:37 2011] Response:

{

"hits" : {

"hits" : [

{

"_source" : {

"tokens" : "london"

},

"_score" : 0.043920923,

"_index" : "foo",

"_id" : "M0aZ9of6Q7e1gEOe7syQvA",

"_type" : "bar"

}

],

"max_score" : 0.043920923,

"total" : 1

},

"timed_out" : false,

"_shards" : {

"failed" : 0,

"successful" : 5,

"total" : 5

},

"took" : 51

}

{

"hits" : {

"hits" : ,

"max_score" : null,

"total" : 0

},

"timed_out" : false,

"_shards" : {

"failed" : 0,

"successful" : 5,

"total" : 5

},

"took" : 3

}

[Tue Apr 26 17:02:37 2011] Response:

{

"hits" : {

"hits" : [

{

"_source" : {

"tokens" : "london"

},

"_score" : 0.043920923,

"_index" : "foo",

"_id" : "M0aZ9of6Q7e1gEOe7syQvA",

"_type" : "bar"

}

],

"max_score" : 0.043920923,

"total" : 1

},

"timed_out" : false,

"_shards" : {

"failed" : 0,

"successful" : 5,

"total" : 5

},

"took" : 51

}

Related topics