Unexpected result from Synonym Filter if Char Filter is also used

Patrick_N · April 21, 2022, 10:29am

I have a field which uses both a char filter (it converts Chinese words to some separator text) and a synonym filter.

PUT test
{
    "mappings": {
        "properties": {
            "description": {
                "analyzer": "standard",
                "search_analyzer": "foobar", 
                "type": "text"
            }
        }
    },
    "settings": {
        "analysis": {
            "analyzer": {
                "foobar": {
                    "char_filter": [
                        "chinese_to_sep"
                    ],
                    "filter": [
                        "custom_synonyms"
                    ],
                    "tokenizer": "standard"
                }
            },
            "char_filter": {
                "chinese_to_sep": {
                    "pattern": "([\\p{IsHan}]+)",
                    "replacement": " sepsep ",
                    "type": "pattern_replace"
                }
            },
            "filter": {
              "custom_synonyms" : {
                "type" : "synonym",
                "updateable" : "true",
                "synonyms" : [
                  "苹果 => apple",
                  "香蕉 => banana",
                  "柠檬 => lemon",
                  "xyz => universe"
                ]
              }
            }
        }
    }
}

(You may ask why my synonym filter still has Chinese words if my char filter will remove all Chinese characters? It's because that synonym filter is used by other fields as well in the real setup, and not all fields will remove Chinese words.)

Anyway, when I run this query, I expect I will get just two tokens ("Sun", and "sepsep"). But to my surprise, I got ("Sun", "apple", "banana", "lemon")!


GET test/_analyze
{
  "analyzer": "foobar",
  "text": "Sun 太陽",
  "explain": false
}

To my surprise, I got this instead:

{
  "tokens" : [
    {
      "token" : "Sun",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "apple",
      "start_offset" : 5,
      "end_offset" : 5,
      "type" : "SYNONYM",
      "position" : 1
    },
    {
      "token" : "banana",
      "start_offset" : 5,
      "end_offset" : 5,
      "type" : "SYNONYM",
      "position" : 1
    },
    {
      "token" : "lemon",
      "start_offset" : 5,
      "end_offset" : 5,
      "type" : "SYNONYM",
      "position" : 1
    }
  ]
}

It seems like the synonym filter is somehow "pre-processed" by the char-filter and thus the first 3 entries will map "sepsep" to a synonym. Is this expected? Or is this a bug?

system · May 19, 2022, 10:29am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Synonym token filter didn't work properly with Chinese Elasticsearch	1	663	July 5, 2017
Analyzer [full_chinese] contains filters [my_synonym] that are not allowed to run in index time mode Elasticsearch	2	449	May 23, 2023
Using a char_filter in combination with a lowercase filter Elasticsearch	4	2034	July 6, 2017
Search analyser + preserve special characters Elasticsearch	1	477	December 8, 2020
Char_filter doesn't work properly Elasticsearch	13	225	August 24, 2023

Unexpected result from Synonym Filter if Char Filter is also used

Related topics