created a synonyms set (synonyms_set_chat_testing_numbers) with a size of 50,000 with IDs ranging from 1 to 50000.
Each id will have synonyms as {id}, {id}a
eg:
"synonyms set": [
{
"id": "1",
"synonyms": "1, 1a"
},
{
"id": "2",
"synonyms": "2, 2a"
}
]
Issue: Synonyms weren’t getting identified properly for random numbers tested and didn’t follow any pattern with the inconsistencies, however, the synonyms set was populated properly.
Index Settings
PUT /synonyms-set-test-new-index/
{
"settings": {
"index": {
"analyze": {
"max_token_count": "20000"
},
"analysis": {
"filter": {
"synonyms_filter": {
"updateable": "true",
"type": "synonym_graph",
"lenient": "true",
"synonyms_set": "synonyms_set_chat_testing_numbers"
}
},
"analyzer": {
"index_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "index_tokenizer"
},
"synonyms": {
"filter": [
"lowercase",
"synonyms_filter"
],
"type": "custom",
"tokenizer": "index_tokenizer"
}
},
"tokenizer": {
"index_tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"whitespace",
",",
".",
"“",
"”",
"‘",
"’",
"'",
"\"",
"""
""",
"+",
"@",
":",
"(",
")",
"[",
"]",
"<",
">",
"{",
"}",
"""\""",
"/",
"-",
";",
"?",
"*",
"&",
"!",
"~",
"$",
"%",
"_",
"^",
"#",
"`",
"|",
"—",
" ",
"▪",
"•",
"‒",
"®"
]
}
}
}
}
}
}
Output from Synonyms set(for IDs 1000, 2000):
GET /_synonyms/synonyms_set_chat_testing_numbers/1000
output(1000, 1000a are shown as syonyms, meaning the synonyms set is populated properly) :
{
"id": "1000",
"synonyms": "1000, 1000a"
}
GET /_synonyms/synonyms_set_chat_testing_numbers/2000
output(2000, 2000a are shown as syonyms, meaning the synonyms set is populated properly) :
{
"id": "2000",
"synonyms": "2000, 2000a"
}
Output when analyze API is used to check the synonyms produced
GET synonyms-set-test-new-index/_analyze
{
"text": "1000",
"analyzer": "synonyms"
}
Output (1000a is identified as synonym for 1000):
{
"tokens": [
{
"token": "1000a",
"start_offset": 0,
"end_offset": 4,
"type": "SYNONYM",
"position": 0
},
{
"token": "1000",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
}
]
}
GET synonyms-set-test-new-index/_analyze
{
"text": "2000",
"analyzer": "synonyms"
}
output (2000a is not shown as a synonym to 2000, meaning the synonyms are not identified properly) :
{
"tokens": [
{
"token": "2000",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
}
]
}
Inconsistencies are observed for several random IDs like above, without any pattern in the inconsistencies.
Is there any reason why such inconsistencies are being observed?
Any advice you could give to fix this would be greatly appreciated.