Synonym filter not working within a Multiplexer Filter

Hello,

I'm using Elasticsearch v7.6.2. I am currently trying to create an custom analyser that uses a MultiplexerFilter with only one filters branc. This branch contains only a SynonymFilter. My goal is to keep original tokens with getting the synonyms. My test analyzer looks like:

    GET test/_analyze
    {
      "explain": false,
      "tokenizer": "whitespace",
      "filter": [{
        "type": "multiplexer",
        "preserve_original": true,
        "filters" : ["synonym_expression"]
      }
        ], 
      "text": ["gré à gré"]
    }

In the multiplexer, the synonym filter is configured in my index settings as:

    "synonym_expression": {

              "type": "synonym",

              "synonyms_path": "dictionaries/protectedExpression.txt"

            }

The synonym files contains this line (Solr format):

gré à gré => greagre

If I run the _analyze query, I get this output:

    {
      "tokens" : [
        {
          "token" : "gré",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "à",
          "start_offset" : 4,
          "end_offset" : 5,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "gré",
          "start_offset" : 6,
          "end_offset" : 9,
          "type" : "word",
          "position" : 2
        }
      ]
    }

I see any synonym in the result tokens.
If I set the "preserve_original" to false, I get this new ouput:

    {
      "tokens" : [
        {
          "token" : "greagre",
          "start_offset" : 0,
          "end_offset" : 9,
          "type" : "SYNONYM",
          "position" : 0
        }
      ]
    }

I have my synonym in the output. I don't understand the behaviour of my analyzer. What I am doing wrong? How can I get in the output of my Multiplexer filter the original tokens plus the synonyms?
Thank you in advance for your Help.

Gérald

Hello everyone,

OK, I have finally a little idea of what is going on here, thanks to @ludovic_boutros. He has helped me to find out the reason of the issue. Apparently, I am facing of a side effect of the "remove duplicates" filtre used at the output of the Multiplexer filter. In the code of the "remove duplicates" filter, we can see:

boolean duplicate = (posIncrement == 0 && previous.contains(term, 0, length));

So the filter considers that a tojen is included in the other one that is a duplicate token. In my case the tem 'gre' is inculuded in the synonym 'greagre', so it is considered as a duplicate token. Then, it is removed from the token streams. It is the reason why the original tokens are removed becuse they are always included in the synonym token.
So to resolve the issue, there are two options:

  • Replace the "contains" String method by the "equals" in the test of the "RemoveDuplicatesTokenFilter"
  • Modify the output synonyms in my synonyms file to be different from the original tokens, for instance:
    gre à gre -> 87a5c1d9a479a2cd
    (the CRC64 code of 'greagre') or other type of coding.

I will implement the second solution, and I will create an issue tiket for the "Lucene project".

I hope that my explanations could help others.

Gérald