Analyzer change in behaviour in 0.16 - bug? feature?


(Clinton Gormley) #1

Hiya

There has been a change in behaviour in how analyzers are applied
between 0.15.2 and 0.16.0.

For instance, in 0.15.2, an 'edge-ngram' tokenizer, was being applied at
index time, but not at search time.

In 0.16.0, it is also applied at search time.

For instance, I index a field containing "london" (with an edge-ngram
tokenizer). I expect a search for 'lon' to match the doc, but not a
search for 'londres'. This was the case in 0.15

However, because the query term is being passed through the same
tokenizer, the query is actually for the terms
"l","lo","lon","lond",londr"...etc so this search DOES match in 0.16

It can easily be worked around by using a different analyzer at index
and search time, but I'm not sure that this is correct behaviour. It
makes sense for (eg) snowball analyzers, but ngrams?

What do you think?

clint

curl -XPUT 'http://127.0.0.1:9200/foo/?pretty=1' -d '
{
"mappings" : {
"bar" : {
"properties" : {
"tokens" : {
"type" : "string",
"analyzer" : "edge_ngram"
}
}
}
},
"settings" : {
"analysis" : {
"filter" : {
"edge_ngram" : {
"side" : "front",
"max_gram" : 20,
"min_gram" : 1,
"type" : "edgeNGram"
}
},
"analyzer" : {
"edge_ngram" : {
"filter" : [
"standard",
"lowercase",
"edge_ngram"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
}
}
'

curl -XPOST 'http://127.0.0.1:9200/foo/bar?pretty=1&refresh=true' -d '
{
"tokens" : "london"
}
'

curl -XGET 'http://127.0.0.1:9200/foo/bar/_search?pretty=1' -d '
{
"query" : {
"field" : {
"tokens" : "londres"
}
}
}
'

RESULT IN 0.15.2:

{

"hits" : {

"hits" : [],

"max_score" : null,

"total" : 0

},

"timed_out" : false,

"_shards" : {

"failed" : 0,

"successful" : 5,

"total" : 5

},

"took" : 3

}

RESULT IN 0.16.0:

[Tue Apr 26 17:02:37 2011] Response:

{

"hits" : {

"hits" : [

{

"_source" : {

"tokens" : "london"

},

"_score" : 0.043920923,

"_index" : "foo",

"_id" : "M0aZ9of6Q7e1gEOe7syQvA",

"_type" : "bar"

}

],

"max_score" : 0.043920923,

"total" : 1

},

"timed_out" : false,

"_shards" : {

"failed" : 0,

"successful" : 5,

"total" : 5

},

"took" : 51

}


(Shay Banon) #2

I need to verify and track the what you see, but, when an analyzer is set in the mappings, it is applied both for search and query time (for query_string / field queries), that is the intended behavior, regardless of what type of analyzer you use.
On Tuesday, April 26, 2011 at 6:04 PM, Clinton Gormley wrote:

Hiya

There has been a change in behaviour in how analyzers are applied
between 0.15.2 and 0.16.0.

For instance, in 0.15.2, an 'edge-ngram' tokenizer, was being applied at
index time, but not at search time.

In 0.16.0, it is also applied at search time.

For instance, I index a field containing "london" (with an edge-ngram
tokenizer). I expect a search for 'lon' to match the doc, but not a
search for 'londres'. This was the case in 0.15

However, because the query term is being passed through the same
tokenizer, the query is actually for the terms
"l","lo","lon","lond",londr"...etc so this search DOES match in 0.16

It can easily be worked around by using a different analyzer at index
and search time, but I'm not sure that this is correct behaviour. It
makes sense for (eg) snowball analyzers, but ngrams?

What do you think?

clint

curl -XPUT 'http://127.0.0.1:9200/foo/?pretty=1' -d '
{
"mappings" : {
"bar" : {
"properties" : {
"tokens" : {
"type" : "string",
"analyzer" : "edge_ngram"
}
}
}
},
"settings" : {
"analysis" : {
"filter" : {
"edge_ngram" : {
"side" : "front",
"max_gram" : 20,
"min_gram" : 1,
"type" : "edgeNGram"
}
},
"analyzer" : {
"edge_ngram" : {
"filter" : [
"standard",
"lowercase",
"edge_ngram"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
}
}
'

curl -XPOST 'http://127.0.0.1:9200/foo/bar?pretty=1&refresh=true' -d '
{
"tokens" : "london"
}
'

curl -XGET 'http://127.0.0.1:9200/foo/bar/_search?pretty=1' -d '
{
"query" : {
"field" : {
"tokens" : "londres"
}
}
}
'

RESULT IN 0.15.2:

{

"hits" : {

"hits" : [],

"max_score" : null,

"total" : 0

},

"timed_out" : false,

"_shards" : {

"failed" : 0,

"successful" : 5,

"total" : 5

},

"took" : 3

}

RESULT IN 0.16.0:

[Tue Apr 26 17:02:37 2011] Response:

{

"hits" : {

"hits" : [

{

"_source" : {

"tokens" : "london"

},

"_score" : 0.043920923,

"_index" : "foo",

"_id" : "M0aZ9of6Q7e1gEOe7syQvA",

"_type" : "bar"

}

],

"max_score" : 0.043920923,

"total" : 1

},

"timed_out" : false,

"_shards" : {

"failed" : 0,

"successful" : 5,

"total" : 5

},

"took" : 51

}


(system) #3