ICU transform filters slowing down indexing: how avoid duplicate transliterations?

I was investigating slow Bulk API indexing as discussed in Bulk index slowing down as index size increases.

It turns out the root cause seems to be slow ICU transform filters, when we start to index Chinese data.

We have mappings such as this:

{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "ascii_keyword": {
          "tokenizer": "keyword",
          "char_filter": [ "multi_space_char_filter", "apostrophe_remove" ],
          "filter": [ "lowercase", "no_accent_latin", "trim" ]
        },
        "ascii_keyword_reverse": {
          "tokenizer": "keyword",
          "char_filter": [ "multi_space_char_filter", "apostrophe_remove" ],
          "filter": [ "lowercase", "no_accent_latin", "trim", "reverse" ]
        },
        "standard_ascii": {
          "tokenizer": "standard",
          "char_filter": [ "multi_space_char_filter", "apostrophe_remove" ],
          "filter": [ "lowercase", "no_accent_latin" ]
        },
        "standard_ascii_reverse": {
          "tokenizer": "standard",
          "char_filter": [ "multi_space_char_filter", "apostrophe_remove" ],
          "filter": [ "lowercase", "no_accent_latin", "reverse" ]
        }
      },
      "normalizer": {
        "lowercase_ascii": {
          "type": "custom",
          "filter":  [ "lowercase", "no_accent_latin" ]
        }
      },
      "char_filter": {
        "multi_space_char_filter": ...,
        "apostrophe_remove": ...
      },
      "filter" : {
        "no_accent_latin" : {
          "type" : "icu_transform",
          "id" : "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC"
        }
      }
    }
  },
  "mappings" : {
    "dynamic" : "strict",
    "properties" : {
      "searchableName" : {
        "type" : "text",
        "fields": {
          "ascii_keyword": { "type": "text", "analyzer": "ascii_keyword" },
          "ascii_keyword_reverse": { "type": "text", "analyzer": "ascii_keyword_reverse" },
          "standard_ascii": { "type": "text", "analyzer": "standard_ascii" },
          "standard_ascii_reverse": { "type": "text", "analyzer": "standard_ascii_reverse" },
          "sorted_latin": { "type": "keyword", "normalizer": "lowercase_ascii" },
          ...
        }
      }
    }
  }
}

Am I correct to assume that because searchableName fields, in the example above, use 5 different analyzers/normalized that all utilize icu_transform, we do the transliteration for the same text 5 different times? Can we somehow optimize this? Can filters somehow utilize intermediate results of other filters?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.