Unique tokenfilter issues?

Hi,

I've been trying to use the unique tokenfilter in my analyzer, but it seems
to continue using duplicate tokens in scoring.

Here's a simple example of the problem:
curl -XPUT 'http://localhost:9200/people_search2/?pretty=1' -d '
{
"mappings" : {
"person" : {
"properties" : {
"_id" : {
"type" : "integer",
"index" : "no"
},
"names" : {
"type" : "string",
"store":true,
"omit_norms":true,
"analyzer":"partial_name",
"index_analyzer":"partial_name"
}
}
}
},
"settings" : {
"analysis" : {
"filter" : {
"name_ngrams" : {
"side" : "front",
"max_gram" : 25,
"min_gram" : 2,
"type" : "edgeNGram"
},
"unique_token_filter": {
"type":"unique",
"only_on_same_position":false
}
},
"analyzer" : {
"partial_name" : {
"filter" : [
"standard",
"lowercase",
"asciifolding",
"name_ngrams",
"unique_token_filter"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
}
}
'

Using the _analysis API, I was able to verify that the partial_name
analyzer only returns unique tokens. However, if I load in this test data:
curl -XPOST
'http://ec2-54-224-89-81.compute-1.amazonaws.com:9200/_bulk?pretty=1' -d '
{"index" : {"_index" : "people_search2", "_type" : "person", "_id" : 1}}
{"_id" : 1, "names":["Tim","Cook","Apple", "Google"]}
{"index" : {"_index" : "people_search2", "_type" : "person", "_id" : 2}}
{"_id" : 2,"names":["Tim","Cook","Tim Cook Consulting","Random Co"]}
{"index" : {"_index" : "people_search2", "_type" : "person", "_id" : 3}}
{"_id" : 3, "names":["John","Smith","Standard","ABC inc"]}
'

And then issue this query:
curl -XPOST
'http://ec2-54-224-89-81.compute-1.amazonaws.com:9200/people_search2/person/_search?search_type=dfs_query_then_fetch'
-d '
{
"explain": true,
"query": {
"match": {
"names": {
"query": "Tim Cook"
}
}
}
}
'

It will rank the doc with _id=2 higher because it has Tim Cook twice in the
document (explain shows a termFreq=2). Why is this happening?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Why not simply omit term frequencies for that field?

What I think might be happening is that the analyzer is applied for each
instance of the names field, so you will have two names fields with "tim".
You are storing the field, so it is easy to verify.

http://ec2-54-224-89-81.compute-1.amazonaws.com:9200/people_search2/person/_search/2?fields=names

--
Ivan

On Wed, Jun 26, 2013 at 10:36 AM, mtthrok@gmail.com wrote:

Hi,

I've been trying to use the unique tokenfilter in my analyzer, but it
seems to continue using duplicate tokens in scoring.

Here's a simple example of the problem:
curl -XPUT 'http://localhost:9200/people_search2/?pretty=1' -d '
{
"mappings" : {
"person" : {
"properties" : {
"_id" : {
"type" : "integer",
"index" : "no"
},
"names" : {
"type" : "string",
"store":true,
"omit_norms":true,
"analyzer":"partial_name",
"index_analyzer":"partial_name"
}
}
}
},
"settings" : {
"analysis" : {
"filter" : {
"name_ngrams" : {
"side" : "front",
"max_gram" : 25,
"min_gram" : 2,
"type" : "edgeNGram"
},
"unique_token_filter": {
"type":"unique",
"only_on_same_position":false
}
},
"analyzer" : {
"partial_name" : {
"filter" : [
"standard",
"lowercase",
"asciifolding",
"name_ngrams",
"unique_token_filter"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
}
}
'

Using the _analysis API, I was able to verify that the partial_name
analyzer only returns unique tokens. However, if I load in this test data:
curl -XPOST '
http://ec2-54-224-89-81.compute-1.amazonaws.com:9200/_bulk?pretty=1' -d '
{"index" : {"_index" : "people_search2", "_type" : "person", "_id" : 1}}
{"_id" : 1, "names":["Tim","Cook","Apple", "Google"]}
{"index" : {"_index" : "people_search2", "_type" : "person", "_id" : 2}}
{"_id" : 2,"names":["Tim","Cook","Tim Cook Consulting","Random Co"]}
{"index" : {"_index" : "people_search2", "_type" : "person", "_id" : 3}}
{"_id" : 3, "names":["John","Smith","Standard","ABC inc"]}
'

And then issue this query:
curl -XPOST '
http://ec2-54-224-89-81.compute-1.amazonaws.com:9200/people_search2/person/_search?search_type=dfs_query_then_fetch'
-d '
{
"explain": true,
"query": {
"match": {
"names": {
"query": "Tim Cook"
}
}
}
}
'

It will rank the doc with _id=2 higher because it has Tim Cook twice in
the document (explain shows a termFreq=2). Why is this happening?

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.