Synonym filter not working within a Multiplexer Filter

gquaire · April 29, 2021, 9:07pm

Hello,

I'm using Elasticsearch v7.6.2. I am currently trying to create an custom analyser that uses a MultiplexerFilter with only one filters branc. This branch contains only a SynonymFilter. My goal is to keep original tokens with getting the synonyms. My test analyzer looks like:

    GET test/_analyze
    {
      "explain": false,
      "tokenizer": "whitespace",
      "filter": [{
        "type": "multiplexer",
        "preserve_original": true,
        "filters" : ["synonym_expression"]
      }
        ], 
      "text": ["gré à gré"]
    }

In the multiplexer, the synonym filter is configured in my index settings as:

    "synonym_expression": {

              "type": "synonym",

              "synonyms_path": "dictionaries/protectedExpression.txt"

            }

The synonym files contains this line (Solr format):

gré à gré => greagre

If I run the _analyze query, I get this output:

    {
      "tokens" : [
        {
          "token" : "gré",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "à",
          "start_offset" : 4,
          "end_offset" : 5,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "gré",
          "start_offset" : 6,
          "end_offset" : 9,
          "type" : "word",
          "position" : 2
        }
      ]
    }

I see any synonym in the result tokens.
If I set the "preserve_original" to false, I get this new ouput:

    {
      "tokens" : [
        {
          "token" : "greagre",
          "start_offset" : 0,
          "end_offset" : 9,
          "type" : "SYNONYM",
          "position" : 0
        }
      ]
    }

I have my synonym in the output. I don't understand the behaviour of my analyzer. What I am doing wrong? How can I get in the output of my Multiplexer filter the original tokens plus the synonyms?
Thank you in advance for your Help.

Gérald

gquaire · May 4, 2021, 5:53am

Hello everyone,

OK, I have finally a little idea of what is going on here, thanks to @ludovic_boutros. He has helped me to find out the reason of the issue. Apparently, I am facing of a side effect of the "remove duplicates" filtre used at the output of the Multiplexer filter. In the code of the "remove duplicates" filter, we can see:

boolean duplicate = (posIncrement == 0 && previous.contains(term, 0, length));

So the filter considers that a tojen is included in the other one that is a duplicate token. In my case the tem 'gre' is inculuded in the synonym 'greagre', so it is considered as a duplicate token. Then, it is removed from the token streams. It is the reason why the original tokens are removed becuse they are always included in the synonym token.
So to resolve the issue, there are two options:

Replace the "contains" String method by the "equals" in the test of the "RemoveDuplicatesTokenFilter"
Modify the output synonyms in my synonyms file to be different from the original tokens, for instance:
gre à gre -> 87a5c1d9a479a2cd
(the CRC64 code of 'greagre') or other type of coding.

I will implement the second solution, and I will create an issue tiket for the "Lucene project".

I hope that my explanations could help others.

Gérald

system · June 1, 2021, 5:53am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Multiplexer with synonyms doesn't work as expected Elasticsearch	0	85	June 6, 2024
Synonym filter and preserve originals Elasticsearch	1	365	August 26, 2021
I have got a little Problem with my synonym filter Elasticsearch	5	565	July 6, 2017
Unable to bypass restriction with synonym token filter Elasticsearch	7	1018	October 3, 2019
Synonym filter behavior for single word / multi words Elasticsearch	5	697	July 6, 2017

Synonym filter not working within a Multiplexer Filter

Related topics