Synonyms API inconsistency in identifying synonyms [Elastic 8.13.3]

created a synonyms set (synonyms_set_chat_testing_numbers) with a size of 50,000 with IDs ranging from 1 to 50000.

Each id will have synonyms as {id}, {id}a

eg:

"synonyms set": [ 
    { 
      "id": "1", 
      "synonyms": "1, 1a" 
   }, 
    { 
      "id": "2", 
      "synonyms": "2, 2a" 
    } 
  ]

Issue: Synonyms weren’t getting identified properly for random numbers tested and didn’t follow any pattern with the inconsistencies, however, the synonyms set was populated properly.

Index Settings

PUT /synonyms-set-test-new-index/
{
   "settings": {
      "index": {
        "analyze": {
          "max_token_count": "20000"
        },
        "analysis": {
          "filter": {
            "synonyms_filter": {
              "updateable": "true",
              "type": "synonym_graph",
              "lenient": "true",
              "synonyms_set": "synonyms_set_chat_testing_numbers"
            }
          },
          "analyzer": {
            "index_analyzer": {
              "filter": [
                "lowercase"
              ],
              "type": "custom",
              "tokenizer": "index_tokenizer"
            },
            "synonyms": {
              "filter": [
                "lowercase",
                "synonyms_filter"
              ],
              "type": "custom",
              "tokenizer": "index_tokenizer"
            }
          },
          "tokenizer": {
            "index_tokenizer": {
              "type": "char_group",
              "tokenize_on_chars": [
                "whitespace",
                ",",
                ".",
                "“",
                "”",
                "‘",
                "’",
                "'",
                "\"",
                """
""",
                "+",
                "@",
                ":",
                "(",
                ")",
                "[",
                "]",
                "<",
                ">",
                "{",
                "}",
                """\""",
                "/",
                "-",
                ";",
                "?",
                "*",
                "&",
                "!",
                "~",
                "$",
                "%",
                "_",
                "^",
                "#",
                "`",
                "|",
                "—",
                " ",
                "▪",
                "•",
                "‒",
                "®"
              ]
            }
          }
        }
      }
    }
}

Output from Synonyms set(for IDs 1000, 2000):

GET /_synonyms/synonyms_set_chat_testing_numbers/1000

output(1000, 1000a are shown as syonyms, meaning the synonyms set is populated properly) :

{
  "id": "1000",
  "synonyms": "1000, 1000a"
}
GET /_synonyms/synonyms_set_chat_testing_numbers/2000

output(2000, 2000a are shown as syonyms, meaning the synonyms set is populated properly) :

{
  "id": "2000",
  "synonyms": "2000, 2000a"
}

Output when analyze API is used to check the synonyms produced

GET synonyms-set-test-new-index/_analyze
{
  "text": "1000",
  "analyzer": "synonyms"
}

Output (1000a is identified as synonym for 1000):

{
  "tokens": [
    {
      "token": "1000a",
      "start_offset": 0,
      "end_offset": 4,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "1000",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    }
  ]
}
GET synonyms-set-test-new-index/_analyze
{
  "text": "2000",
  "analyzer": "synonyms"
}

output (2000a is not shown as a synonym to 2000, meaning the synonyms are not identified properly) :

{
  "tokens": [
    {
      "token": "2000",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    }
  ]
}

Inconsistencies are observed for several random IDs like above, without any pattern in the inconsistencies.

Is there any reason why such inconsistencies are being observed?
Any advice you could give to fix this would be greatly appreciated.

Hi @AbhilashB thanks for reporting this!

I have a few followup questions:

  1. What version of Elasticsearch are you using?
  2. How are you updating the synonyms? 50K is too large for a single synonym set payload. Are you updating them all individually, sending in bulk updates, etc?
  3. Are the inconsistencies repeatable? For example, does 2000 always return an incorrect answer, or is it variable?
  4. If you manually reload search analyzers does this issue still occur?
  5. Can you reproduce this with smaller datasets?

I haven't yet been able to duplicate this yet, but hopefully this will help narrow down what might be the issue.

Hi @Kathleen_DeRusso, thanks for addressing this so promptly. The answers to your followup questions are as follows:

  1. Elastic search version - 8.13.3

  1. Regarding the updating of synonyms, we are creating the synonyms set using the Python Elasticsearch client.

The connection object serves as the Elasticsearch connection instance. We are sending a PUT request in the same manner as we would execute it in the Kibana console.

    connection.transport.perform_request(
        "PUT",
        f"/_synonyms/{synonyms_set_name}",
        body={"synonyms_set": synonyms_list},
        headers={'content-type': 'application/json'}
    )

sample synonyms_list looks like this

[ 
    { 
      "id": "1", 
      "synonyms": "1, 1a" 
   }, 
    { 
      "id": "2", 
      "synonyms": "2, 2a" 
    } 
  ]

however, I don't think there is any issue with the payload being large, as the synonyms set is populated as expected even with the 50K size. for all the ids till 50K, that were checked randomly gave expected output, meaning synonyms set is populated properly.

GET /_synonyms/synonyms_set_chat_testing_numbers/50000

Output:

{
  "id": "50000",
  "synonyms": "50000, 50000a"
}

  1. Yes, 2000 always returns an incorrect answer for the 50K set we have taken.

  1. Even if we manually reload search analyzers using
    POST index_name/_reload_search_analyzers
    command, this issue still occurs

  1. I have tried to reproduce this issue with smaller datasets with sizes of 1k, 2k, 5k, 10k and 15k

for 1k, 2k, 5k, and 10k, I was not able to reproduce the issue, and it seems to be working fine for all the random numbers that I have tested.

however, when the size was 15k,

  • we were getting correct answer for id 2000
GET synonyms-set-test-new-index-15k/_analyze
{
  "text": "2000",
  "analyzer": "synonyms"
}

output:

{
  "tokens": [
    {
      "token": "2000a",
      "start_offset": 0,
      "end_offset": 4,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "2000",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    }
  ]
}
  • but getting incorrect answers for ids 7000, 8000, 9000, and some more numbers between 7000-9000 range
GET synonyms-set-test-new-index-15k/_analyze
{
  "text": "9000",
  "analyzer": "synonyms"
}

Output:

{
  "tokens": [
    {
      "token": "9000",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    }
  ]
}
  • getting the correct answer for ID 10000.
GET synonyms-set-test-new-index-15k/_analyze
{
  "text": "10000",
  "analyzer": "synonyms"
}

Output:

{
  "tokens": [
    {
      "token": "10000a",
      "start_offset": 0,
      "end_offset": 5,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "10000",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    }
  ]
}

All these incorrect cases are producing expected/correct answers when the size is 10k.

Followup question:

  1. Is there any max size/limit for the synonyms set for which the analyzer produces the synonyms properly?
  • because till (synonyms set with a size of) 10k, we have observed that synonyms are getting properly identified, and when it was 15k same synonyms that were properly identified were giving incorrect answers.

Another Followup question:
2. Is there any configurable parameter by which we can increase this size limit(if there's any)?

@AbhilashB thanks for the additional information. I was able to reproduce this issue. You found a bug!

Unfortunately, I'm not aware of a workaround at this time but I've logged this issue: Large synonyms sets inconsistently return synonym results · Issue #108785 · elastic/elasticsearch · GitHub and will bring it to the team's attention. You can follow along with this issue for updates if you're interested.

Thanks for reporting the bug!

@Kathleen_DeRusso Thank you for bringing this to the team's attention. I will follow along with the GitHub thread for updates. Any timeline, ETA, or rough estimate for when this issue will be resolved would be greatly appreciated. Can we expect any?

We can't provide an ETA unfortunately, but we have confirmed with the team that 10,000 is an undocumented hard limit at this time. The documentation will likely be updated first.

I'm interested in knowing a little about your use case - do you plan to use 50K synonyms in production? More?

@Kathleen_DeRusso yes, we plan to use them in production,

  • As of now, we have 79K synonym rules(each synonym rule being comma-separated synonyms eg: "1, 1a"), this size is expected to increase gradually at a slower pace (maybe 1K for 3-4 months).

  • These 79K synonyms are currently present in a text file, and Elastic search index in prod reads synonyms from this file.

    • each line in this text file represents a synonym rule, hence the synonym file contains 79K lines, Eg:
line 1 : 1, 1a
line 2:  2, 2a
3, 3a
....
    • the settings for elastic index in production are configured to read synonyms from this text file. (settings are exactly the same as described for synonyms-set-test-new-index, only change being "synonyms_path" instead of using "synonyms_set".
          "filter": {
            "synonyms_filter": {
              "updateable": "true",
              "type": "synonym_graph",
              "lenient": "true",
              "synonyms_path": "synonyms_file_name.txt"
            }
          }

However, we wanted to migrate to using the synonyms set instead of using a synonyms text file, we prefer synonyms set over a synonyms text file for the following reasons:

  1. Easier Management: Ease of CRUD operations/management with the help of ID(unique identifier) in the synonyms set, especially when we have larger datasets, and need to update/delete only some specific ID-related data.
  2. Consistency: Using synonyms set ensures consistency across different indices, without manually using the _reload_search_analyzers command.
  3. Instantaneous Reflection: Updates to the synonyms set are immediately reflected in the respective index, providing real-time synchronization without delays.

That's when we started to explore synonym sets as an alternative to using synonym text files.

  • When we tried to populate the same synonyms that were part of the text file into a synonyms set, we observed that some of the search queries were returning incorrect results when compared with using a synonyms text file.

  • post which we started debugging the issue by creating the sample synonyms set with a size of 50K with IDs ranging from 1 to 50000, just to verify if there were any inconsistencies

  • As discussed in the thread, there is a bug, and we requested an ETA to help us determine whether to continue using a synonyms text file or to migrate to a synonyms set in production.


I hope this provides an idea of our use case and why we plan to use the synonyms set in production, as well as the size we anticipate using in production.

Thanks for the context on scale, this is helpful input to our discussions.

The only other potential workaround in the current version, is to see if you could chain multiple synonyms sets in individual filters, where each synonyms set is < 10,000 synonyms?

@Kathleen_DeRusso Yes, that would be the approach if we are to use the synonyms set method right now and evaluate its performance, especially for chaining multiple synonym sets.

  • Concern: The primary concern is the management of multiple synonym sets and tracking which IDs are present in each set. This is crucial for updating or deleting synonym rules based on their IDs.
  • Expectation: Ideally, once the bugs are fixed, we expect the synonyms set to function in the same way as the text file, even with larger sizes. Only with these assurances and based on the bug fixes would we feel comfortable transitioning from using a synonyms text file to a synonyms set in production.

Is there a way you can potentially share more about the structure of the synonyms?
If you indeed have synonyms with a structure like 1, 1a, 2, 2a - maybe you can take a look at the word_delimiter or word_delimiter_graph filter - with the generate_word_parts text like abc1000 could be split into multiple tokens abc and 1000. In this case you would not need to have a synonym set for abc1000, abc1001. Searching for abc1001 would also return results that contain abc1000.
Or maybe the pattern filter could be a good fit.
Just trying to figure out whether using something other than synonyms would also work for your use case.

1 Like