Synonym filter results in term facet


(ravi063) #1

Hi All,

I have a requirement in which I need to find distinct company names. I was
using "Keyword" tokenizer for that field and through term facet I was able
to get distinct company names. However terms facet treated company names
like "ibm suisse", "ibm corporation", "ibm" as different companies.
Online documentation suggested me to use "Synonym filter" to solve this. My
settings is:

curl -XPUT 'http://localhost:9200/dataindex/' -d '{
"settings": {
"index": {
"analysis": {
"analyzer": {
"customAnalyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase","synonym"
]
}
},
"filter": {
"synonym" : {
"type" : "synonym",
"tokenizer": "keyword",
"synonyms_path" : "analysis/synonym.txt"
}
}
}
}
}
}'

My mapping is:

curl -XPUT 'http://localhost:9200/dataindex/tweet/_mapping' -d '
{
"tweet" : {
"properties" : {
"company": {
"type": "string",
"analyzer": "customAnalyzer"
}
}
}
}'

In the synonym.txt file I have : ibm suisse, ibm corporation, ibm business,
ibm => ibm corp ltd

Indexed data:
curl -XPUT 'http://localhost:9200/dataindex/tweet/1' -d '{
"company" : "ibm"
}'
curl -XPUT 'http://localhost:9200/dataindex/tweet/2' -d '{
"company" : "ibm corporation"
}'
curl -XPUT 'http://localhost:9200/dataindex/tweet/3' -d '{
"company" : "ibm suisse"
}'
curl -XPUT 'http://localhost:9200/dataindex/tweet/4' -d '{
"company" : "ibm business"
}'

If I run a terms facet:
{
"facets": {
"loc_facet": {
"terms": {
"field": "company"
}
}
}
}
I get 3 terms ie {term: ibm corp ltd, count: 3} {term: suisse, count: 1}
{term: corporation, count: 1}
I want the facet result to return only one term: ibm corp ltd with count=3.
This way i will get distinct company names and also map synonym names into
single company name.
Please correct me if I am using wrong tokenizer or my approach is not
correct.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1ba32926-7015-4b8a-89ae-bf43a2561b71%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(vineeth mohan-2) #2

Hello Ravi ,

Your approach is wrong.
When you use synonym filter , it indexes all synonyms of that token hence
and synonym will match against that term.
So when you do a facet , you will get an aggregation of all synonyms rather
than just one.

Better approach would be to store the unique name into some other field and
take a facet of that field.

Thanks
Vineeth

On Mon, Jul 21, 2014 at 11:21 PM, ravi063@gmail.com wrote:

Hi All,

I have a requirement in which I need to find distinct company names. I was
using "Keyword" tokenizer for that field and through term facet I was able
to get distinct company names. However terms facet treated company names
like "ibm suisse", "ibm corporation", "ibm" as different companies.
Online documentation suggested me to use "Synonym filter" to solve this.
My settings is:

curl -XPUT 'http://localhost:9200/dataindex/' -d '{
"settings": {
"index": {
"analysis": {
"analyzer": {
"customAnalyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase","synonym"
]
}
},
"filter": {
"synonym" : {
"type" : "synonym",
"tokenizer": "keyword",
"synonyms_path" : "analysis/synonym.txt"
}
}
}
}
}
}'

My mapping is:

curl -XPUT 'http://localhost:9200/dataindex/tweet/_mapping' -d '
{
"tweet" : {
"properties" : {
"company": {
"type": "string",
"analyzer": "customAnalyzer"
}
}
}
}'

In the synonym.txt file I have : ibm suisse, ibm corporation, ibm
business, ibm => ibm corp ltd

Indexed data:
curl -XPUT 'http://localhost:9200/dataindex/tweet/1' -d '{
"company" : "ibm"
}'
curl -XPUT 'http://localhost:9200/dataindex/tweet/2' -d '{
"company" : "ibm corporation"
}'
curl -XPUT 'http://localhost:9200/dataindex/tweet/3' -d '{
"company" : "ibm suisse"
}'
curl -XPUT 'http://localhost:9200/dataindex/tweet/4' -d '{
"company" : "ibm business"
}'

If I run a terms facet:
{
"facets": {
"loc_facet": {
"terms": {
"field": "company"
}
}
}
}
I get 3 terms ie {term: ibm corp ltd, count: 3} {term: suisse, count: 1}
{term: corporation, count: 1}
I want the facet result to return only one term: ibm corp ltd with
count=3. This way i will get distinct company names and also map synonym
names into single company name.
Please correct me if I am using wrong tokenizer or my approach is not
correct.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1ba32926-7015-4b8a-89ae-bf43a2561b71%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/1ba32926-7015-4b8a-89ae-bf43a2561b71%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGdPd5ny%3Di76CHwpbEoY-4nGaraQfz-Tmmm5MVJbiA%2B0nrgKZQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #3