Unique tokenfilter issues?

mtthrok · June 26, 2013, 5:36pm

Hi,

I've been trying to use the unique tokenfilter in my analyzer, but it seems
to continue using duplicate tokens in scoring.

Here's a simple example of the problem:
curl -XPUT 'http://localhost:9200/people_search2/?pretty=1' -d '
{
"mappings" : {
"person" : {
"properties" : {
"_id" : {
"type" : "integer",
"index" : "no"
},
"names" : {
"type" : "string",
"store":true,
"omit_norms":true,
"analyzer":"partial_name",
"index_analyzer":"partial_name"
}
}
}
},
"settings" : {
"analysis" : {
"filter" : {
"name_ngrams" : {
"side" : "front",
"max_gram" : 25,
"min_gram" : 2,
"type" : "edgeNGram"
},
"unique_token_filter": {
"type":"unique",
"only_on_same_position":false
}
},
"analyzer" : {
"partial_name" : {
"filter" : [
"standard",
"lowercase",
"asciifolding",
"name_ngrams",
"unique_token_filter"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
}
}
'

Using the _analysis API, I was able to verify that the partial_name
analyzer only returns unique tokens. However, if I load in this test data:
curl -XPOST
'http://ec2-54-224-89-81.compute-1.amazonaws.com:9200/_bulk?pretty=1' -d '
{"index" : {"_index" : "people_search2", "_type" : "person", "_id" : 1}}
{"_id" : 1, "names":["Tim","Cook","Apple", "Google"]}
{"index" : {"_index" : "people_search2", "_type" : "person", "_id" : 2}}
{"_id" : 2,"names":["Tim","Cook","Tim Cook Consulting","Random Co"]}
{"index" : {"_index" : "people_search2", "_type" : "person", "_id" : 3}}
{"_id" : 3, "names":["John","Smith","Standard","ABC inc"]}
'

And then issue this query:
curl -XPOST
'http://ec2-54-224-89-81.compute-1.amazonaws.com:9200/people_search2/person/_search?search_type=dfs_query_then_fetch'
-d '
{
"explain": true,
"query": {
"match": {
"names": {
"query": "Tim Cook"
}
}
}
}
'

It will rank the doc with _id=2 higher because it has Tim Cook twice in the
document (explain shows a termFreq=2). Why is this happening?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · June 26, 2013, 5:57pm

Why not simply omit term frequencies for that field?

What I think might be happening is that the analyzer is applied for each
instance of the names field, so you will have two names fields with "tim".
You are storing the field, so it is easy to verify.

http://ec2-54-224-89-81.compute-1.amazonaws.com:9200/people_search2/person/_search/2?fields=names

--
Ivan

On Wed, Jun 26, 2013 at 10:36 AM, mtthrok@gmail.com wrote:

Hi,

I've been trying to use the unique tokenfilter in my analyzer, but it
seems to continue using duplicate tokens in scoring.

Here's a simple example of the problem:
curl -XPUT 'http://localhost:9200/people_search2/?pretty=1' -d '
{
"mappings" : {
"person" : {
"properties" : {
"_id" : {
"type" : "integer",
"index" : "no"
},
"names" : {
"type" : "string",
"store":true,
"omit_norms":true,
"analyzer":"partial_name",
"index_analyzer":"partial_name"
}
}
}
},
"settings" : {
"analysis" : {
"filter" : {
"name_ngrams" : {
"side" : "front",
"max_gram" : 25,
"min_gram" : 2,
"type" : "edgeNGram"
},
"unique_token_filter": {
"type":"unique",
"only_on_same_position":false
}
},
"analyzer" : {
"partial_name" : {
"filter" : [
"standard",
"lowercase",
"asciifolding",
"name_ngrams",
"unique_token_filter"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
}
}
'

Using the _analysis API, I was able to verify that the partial_name
analyzer only returns unique tokens. However, if I load in this test data:
curl -XPOST '
http://ec2-54-224-89-81.compute-1.amazonaws.com:9200/_bulk?pretty=1' -d '
{"index" : {"_index" : "people_search2", "_type" : "person", "_id" : 1}}
{"_id" : 1, "names":["Tim","Cook","Apple", "Google"]}
{"index" : {"_index" : "people_search2", "_type" : "person", "_id" : 2}}
{"_id" : 2,"names":["Tim","Cook","Tim Cook Consulting","Random Co"]}
{"index" : {"_index" : "people_search2", "_type" : "person", "_id" : 3}}
{"_id" : 3, "names":["John","Smith","Standard","ABC inc"]}
'

And then issue this query:
curl -XPOST '
http://ec2-54-224-89-81.compute-1.amazonaws.com:9200/people_search2/person/_search?search_type=dfs_query_then_fetch'
-d '
{
"explain": true,
"query": {
"match": {
"names": {
"query": "Tim Cook"
}
}
}
}
'

It will rank the doc with _id=2 higher because it has Tim Cook twice in
the document (explain shows a termFreq=2). Why is this happening?

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Duplicate Tokens in elasticsearch uax_url_email tokenizer Elasticsearch	1	180	April 30, 2022
How to add an analyzer that can remove duplicate tokens from the analyzed field? Elasticsearch	1	200	January 25, 2023
Unique token filter not working with array of strings Elasticsearch	1	483	February 27, 2019
Unique token filter with string array Elasticsearch	1	620	December 6, 2017
Problem when using analyzers (very small data set) Elasticsearch	3	317	July 6, 2017

Unique tokenfilter issues?

Related topics