Unexpected result from Synonym Filter if Char Filter is also used

I have a field which uses both a char filter (it converts Chinese words to some separator text) and a synonym filter.

PUT test
{
    "mappings": {
        "properties": {
            "description": {
                "analyzer": "standard",
                "search_analyzer": "foobar", 
                "type": "text"
            }
        }
    },
    "settings": {
        "analysis": {
            "analyzer": {
                "foobar": {
                    "char_filter": [
                        "chinese_to_sep"
                    ],
                    "filter": [
                        "custom_synonyms"
                    ],
                    "tokenizer": "standard"
                }
            },
            "char_filter": {
                "chinese_to_sep": {
                    "pattern": "([\\p{IsHan}]+)",
                    "replacement": " sepsep ",
                    "type": "pattern_replace"
                }
            },
            "filter": {
              "custom_synonyms" : {
                "type" : "synonym",
                "updateable" : "true",
                "synonyms" : [
                  "苹果 => apple",
                  "香蕉 => banana",
                  "柠檬 => lemon",
                  "xyz => universe"
                ]
              }
            }
        }
    }
}

(You may ask why my synonym filter still has Chinese words if my char filter will remove all Chinese characters? It's because that synonym filter is used by other fields as well in the real setup, and not all fields will remove Chinese words.)

Anyway, when I run this query, I expect I will get just two tokens ("Sun", and "sepsep"). But to my surprise, I got ("Sun", "apple", "banana", "lemon")!


GET test/_analyze
{
  "analyzer": "foobar",
  "text": "Sun 太陽",
  "explain": false
}

To my surprise, I got this instead:

{
  "tokens" : [
    {
      "token" : "Sun",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "apple",
      "start_offset" : 5,
      "end_offset" : 5,
      "type" : "SYNONYM",
      "position" : 1
    },
    {
      "token" : "banana",
      "start_offset" : 5,
      "end_offset" : 5,
      "type" : "SYNONYM",
      "position" : 1
    },
    {
      "token" : "lemon",
      "start_offset" : 5,
      "end_offset" : 5,
      "type" : "SYNONYM",
      "position" : 1
    }
  ]
}

It seems like the synonym filter is somehow "pre-processed" by the char-filter and thus the first 3 entries will map "sepsep" to a synonym. Is this expected? Or is this a bug?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.