I'm trying to convert a field which currently uses the edge_ngram tokenizer to use the edge_ngram filter instead, and tokens created by the edge_ngram filter are not scoring the same as the same tokens created by the edge_ngram tokenizer. Here's what I'm seeing....an example index with just 2 fields...1 using the edge_ngram tokenizer and 1 using the edge_ngram filter...
DELETE /test1
PUT /test1
{
"settings": {
"analysis": {
"analyzer": {
"edge_tokenizer": {
"tokenizer": "my_edge_ngram_tokenizer"
},
"edge_filter": {
"tokenizer": "standard",
"filter": ["my_edge_ngram_filter"]
}
},
"filter": {
"my_edge_ngram_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"tokenizer": {
"my_edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
}
}
},
"mappings": {
"properties": {
"tok": {
"type": "text",
"similarity": "boolean",
"analyzer": "edge_tokenizer"
},
"filt": {
"type": "text",
"similarity": "boolean",
"analyzer": "edge_filter"
}
}
}
}
PUT /test1/_doc/1
{
"tok": "foobar",
"filt": "foobar"
}
PUT /test1/_doc/2
{
"tok": "flub",
"filt": "flub"
}
Now if I run a search using the token field, I see the results I expect - the doc with the 4-character match scores "4", the doc with a 1-character match scores "1":
GET /test1/_search
{
"query": { "match": { "tok": "flub" }}
}
Response:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 4.0,
"hits" : [
{
"_index" : "test1",
"_type" : "_doc",
"_id" : "2",
"_score" : 4.0,
"_source" : {
"tok" : "flub",
"filt" : "flub"
}
},
{
"_index" : "test1",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"tok" : "foobar",
"filt" : "foobar"
}
}
]
}
}
However if I query using the "filt" field, both documents score "1":
GET /test1/_search
{
"query": { "match": { "filt": "flub" }}
}
Response:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "test1",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"tok" : "foobar",
"filt" : "foobar"
}
},
{
"_index" : "test1",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"tok" : "flub",
"filt" : "flub"
}
}
]
}
}
which I don't understand at all. The same tokens are generated using both analyzers, however with different types and positions
GET /test1/_analyze
{
"analyzer": "edge_tokenizer",
"text": "flub"
}
{
"tokens" : [
{
"token" : "f",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "fl",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
},
{
"token" : "flu",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 2
},
{
"token" : "flub",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 3
}
]
}
GET /test1/_analyze
{
"analyzer": "edge_filter",
"text": "flub"
}
{
"tokens" : [
{
"token" : "f",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "fl",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "flu",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "flub",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
I don't really understand how the type+position affects the scoring. I thought it just takes each token from the query and looks for a matching one in the index....if there is a match then score+1 (with boolean similarity). All the google hits I can find say the edge_ngram tokenizer and filter do the same thing, just at different points in the pipeline. Can someone explain to me what's going on here? Thanks