Synonym order with unique filter breaks search

Hi, have a weird issue with synonyms along with a unique token filter that I cannot get my head around.

MVP:

Settings:

                    {
                      "settings": {
                        "index": {
                          "analysis": {
                            "filter": {
                              "synonym": {
                                "type": "synonym_graph",
                                "synonyms_path": "synonyms/synonyms.txt",
                                "updateable": true
                              }
                            },
                            "analyzer": {
                              "synonym": {
                                "tokenizer": "standard",
                                "filter": [
                                  "synonym",
                                  "unique"
                                ]
                              }
                            }
                          }
                        }
                      }
                    }

synonyms.txt

billie jo wilsson,billiejo,billie-jo,billiejo wilson,billie-jo wilson,billiejo wilsson,billie-jo wilsson,billiejoo,billie jo

Mappings:

                {
                    "properties": {
                      "description": {
                        "type": "text",
                        "index": true
                      }
                    }
                }

Documents:

                {
                  "description": "billie-jo wilsson"
                }

Query:

                {
                  "query": {
                    "multi_match": {
                      "query": "billie-jo",
                      "fields": ["description"],
                      "type": "cross_fields",
                      "analyzer": "synonym",
                      "operator": "AND",
                      "boost": 0.4
                    }
                  }
                }

Doing this query yields a hit on the document indexed which is expected, but rolling the synonyms one step to the left (ie. moving the first synonym last in row):

billiejo,billie-jo,billiejo wilson,billie-jo wilson,billiejo wilsson,billie-jo wilsson,billiejoo,billie jo,billie jo wilsson

... then no hits are returned. Why is that? Is the order of the synonyms of importance ?

However if I remove the unique analyzer filter then the query starts to work again even with the rolled synonyms.

Is this behaviour to be expected for some reason I cannot understand or is there a synonym issue here?

This is performed in Elasticsearch v7.14

In cases like this its always a good starting point to check what the actual analysis output looks like. This can be done using the “_analyze” endpoint:

POST /test/_analyze
{
    "field" : "description",
    "text" : "billie-jo wilsson"    
}

This shows you the input document text is split into three tokens in subsequent positions

{
    “tokens”: [
        {
            “token”: “billie”,
            “start_offset”: 0,
            “end_offset”: 6,
            “type”: “<ALPHANUM>”,
            “position”: 0
        },
        {
            “token”: “jo”,
            “start_offset”: 7,
            “end_offset”: 9,
            “type”: “<ALPHANUM>”,
            “position”: 1
        },
        {
            “token”: “wilsson”,
            “start_offset”: 10,
            “end_offset”: 17,
            “type”: “<ALPHANUM>”,
            “position”: 2
        }
    ]
}

To get the “synonym” analyser output, do:

POST /test/_analyze
{
    “Analyzer” : “synonym”,
    “text” : “billie-jo”    
}

which shows you six tokens, three on position 0 and the rest in subsequent positions:

{
    "tokens": [
        {
            "token": "billie",
            "start_offset": 0,
            "end_offset": 9,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "billiejo",
            "start_offset": 0,
            "end_offset": 9,
            "type": "SYNONYM",
            "position": 0,
            "positionLength": 9
        },
        {
            "token": "billiejoo",
            "start_offset": 0,
            "end_offset": 9,
            "type": "SYNONYM",
            "position": 0,
            "positionLength": 9
        },
        {
            "token": "jo",
            "start_offset": 0,
            "end_offset": 9,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "wilsson",
            "start_offset": 0,
            "end_offset": 9,
            "type": "SYNONYM",
            "position": 2,
            "positionLength": 7
        },
        {
            "token": "wilson",
            "start_offset": 0,
            "end_offset": 9,
            "type": "SYNONYM",
            "position": 3,
            "positionLength": 6
        }
    ]
}

If you remove the “unique” filter you will see many more tokens, some of which I guess are needed to connect the token graph. The output is a bit complex to parse in one glance, if this needs further explanation I might have to dig a bit deeper when time allows me to.

And as a short note: synonyms shouldn't rely on their order, but the order tokens pass the "unique" filter might change which ones get discarded.

Yes that was my conclusion also that the unique filter removes tokens that is needed to match ... but that doesn't explain why I get a hit on the first version of the synonyms-file but not when shifting it one step, both with unique filter turned on.

For me the tokens should remain the same no matter in what order the synonyms appear right?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.