Synonym filter behavior for single word / multi words


(Bernhardt Scherer) #1

Hello there,

I am currently trying out the synonym filter. Here are my settings:

"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 15
},
"synonym": {
"type": "synonym",
"synonyms_path" : "analysis/synonym.txt",
"ignore_case": true,
"expand": true
}
},
"analyzer": {
"synonym_analyzer": {
"type":"custom",
"tokenizer": "whitespace",
"filter": [
"synonym",
"lowercase",
"asciifolding"
]
}
}
}
}

In the synonym.txt file I have the following line:

Inbus, Innensechskant, Imbus

When I test the analyzer, I get the following results (output is
reformatted for ease of reading). It seems the synonym analyzer only does
its job when the "synonym" word is surrounded by other words

localhost:9200/index_v1/_analyze?analyzer=synonym_analyzer
--input: 'Inbus'
--output: inbus

localhost:9200/index_v1/_analyze?analyzer=synonym_analyzer
--input: 'Der Inbus'
--output: 'der inbus'

localhost:9200/index_v1/_analyze?analyzer=synonym_analyzer
--input: 'Der Inbus ist'
--output: 'der inbus innensechskant imbus ist'

Could anyone please explain why it behaves like this and how to implement
this correctly?

Big thanks in advance!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c843106e-be78-4d4c-9101-2c73720b4062%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(ElasticSearch Users mailing list) #2

It seems to work fine for me (ES 1.2). Can you please post an full
reproducible sequence of commands that I can execute to try?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bc463d08-873f-4a25-983a-7a1c89c42cbc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(ElasticSearch Users mailing list) #3

What I mean by working:

Input: Inbus
Output:

{
"tokens" : [ {
"token" : "inbus",
"start_offset" : 0,
"end_offset" : 5,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "innensechskant",
"start_offset" : 0,
"end_offset" : 5,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "imbus",
"start_offset" : 0,
"end_offset" : 5,
"type" : "SYNONYM",
"position" : 1
} ]
}

Input: Der Inbus
Output:

{
"tokens" : [ {
"token" : "der",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "inbus",
"start_offset" : 4,
"end_offset" : 9,
"type" : "SYNONYM",
"position" : 2
}, {
"token" : "innensechskant",
"start_offset" : 4,
"end_offset" : 9,
"type" : "SYNONYM",
"position" : 2
}, {
"token" : "imbus",
"start_offset" : 4,
"end_offset" : 9,
"type" : "SYNONYM",
"position" : 2
} ]
}

Input: Der Inbus ist
Output:

{
"tokens" : [ {
"token" : "der",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "inbus",
"start_offset" : 4,
"end_offset" : 9,
"type" : "SYNONYM",
"position" : 2
}, {
"token" : "innensechskant",
"start_offset" : 4,
"end_offset" : 9,
"type" : "SYNONYM",
"position" : 2
}, {
"token" : "imbus",
"start_offset" : 4,
"end_offset" : 9,
"type" : "SYNONYM",
"position" : 2
}, {
"token" : "ist",
"start_offset" : 10,
"end_offset" : 13,
"type" : "word",
"position" : 3
} ]
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d49aaaab-98d5-4d7a-aa16-63a77d0d0fab%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Bernhardt Scherer) #4

Hey Binh,

thanks for your reply!

I tried the following:

POST localhost:9200/index_v2/

{
"settings": {
"analysis": {
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [
"Schraubenzieher, Schraubendreher",
"Inbus, Innensechskant, Imbus, Innen-6-Kant",
"Innensechskantschlüssel, Inbusschlüssel",
"Bauhelm, Schutzhelm"
],
"ignore_case": true,
"expand": true
}
},
"analyzer": {
"synonym_analyzer": {
"type":"custom",
"tokenizer": "whitespace",
"filter": [
"synonym",
"lowercase",
"asciifolding"
]
}
}
}
}
}

POST localhost:9200/index_v2/_analyze?analyzer=synonym_analyzer
'Inbus'

Output:
{
"tokens": [
{
"token": "'inbus'",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 1
}
]
}

POST localhost:9200/index_v2/_analyze?analyzer=synonym_analyzer
'Inbus'

Output:
{
"tokens": [
{
"token": "'der",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "inbus",
"start_offset": 5,
"end_offset": 10,
"type": "SYNONYM",
"position": 2
},
{
"token": "innensechskant",
"start_offset": 5,
"end_offset": 10,
"type": "SYNONYM",
"position": 2
},
{
"token": "imbus",
"start_offset": 5,
"end_offset": 10,
"type": "SYNONYM",
"position": 2
},
{
"token": "innen-6-kant",
"start_offset": 5,
"end_offset": 10,
"type": "SYNONYM",
"position": 2
},
{
"token": "ist'",
"start_offset": 11,
"end_offset": 15,
"type": "word",
"position": 3
}
]
}

I have changed to 1.0.1 but the behavior was the same on 1.2..

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f57e83e1-8af1-4102-8083-b23532932ac1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Bernhardt Scherer) #5

Hey Binh,

thanks for your reply!

I tried the following:

POST localhost:9200/index_v2/

{
"settings": {
"analysis": {
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [
"Schraubenzieher, Schraubendreher",
"Inbus, Innensechskant, Imbus, Innen-6-Kant",
"Innensechskantschlüssel, Inbusschlüssel",
"Bauhelm, Schutzhelm"
],
"ignore_case": true,
"expand": true
}
},
"analyzer": {
"synonym_analyzer": {
"type":"custom",
"tokenizer": "whitespace",
"filter": [
"synonym",
"lowercase",
"asciifolding"
]
}
}
}
}
}

POST localhost:9200/index_v2/_analyze?analyzer=synonym_analyzer
'Inbus'

Output:
{
"tokens": [
{
"token": "'inbus'",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 1
}
]
}

POST localhost:9200/index_v2/_analyze?analyzer=synonym_analyzer
'Der Inbus ist'

Output:
{
"tokens": [
{
"token": "'der",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "inbus",
"start_offset": 5,
"end_offset": 10,
"type": "SYNONYM",
"position": 2
},
{
"token": "innensechskant",
"start_offset": 5,
"end_offset": 10,
"type": "SYNONYM",
"position": 2
},
{
"token": "imbus",
"start_offset": 5,
"end_offset": 10,
"type": "SYNONYM",
"position": 2
},
{
"token": "innen-6-kant",
"start_offset": 5,
"end_offset": 10,
"type": "SYNONYM",
"position": 2
},
{
"token": "ist'",
"start_offset": 11,
"end_offset": 15,
"type": "word",
"position": 3
}
]
}

I have changed to 1.0.1 but the behavior was the same on 1.2..

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/18c6aa41-12b7-4301-8661-53162454f158%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #6