Hello, I'm using almost latest Elastic 8.13 and currently trying to make analyzer with multiplexer, that uses synonym filter. However, I found out that results from simple filter-chaining (without multiplexer) differ from multiplexer with the same token filters. I made an artificial example, here is two analyzers:
PUT /test-index
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"test_analyzer": {
"tokenizer": "classic",
"filter": [
"test_stemmer",
"test_synonym_filter"
]
},
"test_analyzer_multiplexer": {
"tokenizer": "classic",
"filter": [
"multiplexer_custom"
]
}
},
"filter": {
"multiplexer_custom": {
"type": "multiplexer",
"filters": [
"test_stemmer, test_synonym_filter"
],
"preserve_original": true
},
"test_synonym_filter": {
"type": "synonym_graph",
"synonyms": [
"walking, jumping fox"
]
},
"test_stemmer": {
"type": "stemmer",
"language": "english"
}
}
}
}
}
}
They are basically identical and should (I assume) output same results. But when I test it, I get different tokens. For simple filter-chaining everything is ok:
GET test-index/_analyze
{
"analyzer": "test_analyzer",
"text": "jumping fox"
}
Result:
{
"tokens": [
{
"token": "walk",
"start_offset": 0,
"end_offset": 11,
"type": "SYNONYM",
"position": 0,
"positionLength": 2
},
{
"token": "jump",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "fox",
"start_offset": 8,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
}
]
}
I am getting my stemmed synonym "walk", as expected. But if I test analyzer with multiplexer:
GET test-index/_analyze
{
"analyzer": "test_analyzer_multiplexer",
"text": "jumping fox"
}
Result is:
{
"tokens": [
{
"token": "jumping",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "jump",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "fox",
"start_offset": 8,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
}
]
}
No synonym at all. I believe it's happening because by default multiplexer preserves original tokens, but where is synonym nevertheless? If I add preserve_original: false
, I am getting right result, but what if I need to keep original tokens AND get synonyms while using multiplexer?
Either it's a kind of bug or I don't fully understand how it should work.
P.S. I saw almost identical topic Synonym filter not working within a Multiplexer Filter, but I believe my case is different - my synonyms doesn't intersect with each other, so RemoveDuplicatesTokenFilter
from Lucene should (I think) work correctly. Maybe something going wrong in other place?