Hi,
I've been trying to use the unique tokenfilter in my analyzer, but it seems
to continue using duplicate tokens in scoring.
Here's a simple example of the problem:
curl -XPUT 'http://localhost:9200/people_search2/?pretty=1' -d '
{
"mappings" : {
"person" : {
"properties" : {
"_id" : {
"type" : "integer",
"index" : "no"
},
"names" : {
"type" : "string",
"store":true,
"omit_norms":true,
"analyzer":"partial_name",
"index_analyzer":"partial_name"
}
}
}
},
"settings" : {
"analysis" : {
"filter" : {
"name_ngrams" : {
"side" : "front",
"max_gram" : 25,
"min_gram" : 2,
"type" : "edgeNGram"
},
"unique_token_filter": {
"type":"unique",
"only_on_same_position":false
}
},
"analyzer" : {
"partial_name" : {
"filter" : [
"standard",
"lowercase",
"asciifolding",
"name_ngrams",
"unique_token_filter"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
}
}
'
Using the _analysis API, I was able to verify that the partial_name
analyzer only returns unique tokens. However, if I load in this test data:
curl -XPOST
'http://ec2-54-224-89-81.compute-1.amazonaws.com:9200/_bulk?pretty=1' -d '
{"index" : {"_index" : "people_search2", "_type" : "person", "_id" : 1}}
{"_id" : 1, "names":["Tim","Cook","Apple", "Google"]}
{"index" : {"_index" : "people_search2", "_type" : "person", "_id" : 2}}
{"_id" : 2,"names":["Tim","Cook","Tim Cook Consulting","Random Co"]}
{"index" : {"_index" : "people_search2", "_type" : "person", "_id" : 3}}
{"_id" : 3, "names":["John","Smith","Standard","ABC inc"]}
'
And then issue this query:
curl -XPOST
'http://ec2-54-224-89-81.compute-1.amazonaws.com:9200/people_search2/person/_search?search_type=dfs_query_then_fetch'
-d '
{
"explain": true,
"query": {
"match": {
"names": {
"query": "Tim Cook"
}
}
}
}
'
It will rank the doc with _id=2 higher because it has Tim Cook twice in the
document (explain shows a termFreq=2). Why is this happening?
Thanks!
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.