ICU transform filters slowing down indexing: how avoid duplicate transliterations?

Pyppe · May 27, 2020, 1:28pm

I was investigating slow Bulk API indexing as discussed in Bulk index slowing down as index size increases.

It turns out the root cause seems to be slow ICU transform filters, when we start to index Chinese data.

We have mappings such as this:

{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "ascii_keyword": {
          "tokenizer": "keyword",
          "char_filter": [ "multi_space_char_filter", "apostrophe_remove" ],
          "filter": [ "lowercase", "no_accent_latin", "trim" ]
        },
        "ascii_keyword_reverse": {
          "tokenizer": "keyword",
          "char_filter": [ "multi_space_char_filter", "apostrophe_remove" ],
          "filter": [ "lowercase", "no_accent_latin", "trim", "reverse" ]
        },
        "standard_ascii": {
          "tokenizer": "standard",
          "char_filter": [ "multi_space_char_filter", "apostrophe_remove" ],
          "filter": [ "lowercase", "no_accent_latin" ]
        },
        "standard_ascii_reverse": {
          "tokenizer": "standard",
          "char_filter": [ "multi_space_char_filter", "apostrophe_remove" ],
          "filter": [ "lowercase", "no_accent_latin", "reverse" ]
        }
      },
      "normalizer": {
        "lowercase_ascii": {
          "type": "custom",
          "filter":  [ "lowercase", "no_accent_latin" ]
        }
      },
      "char_filter": {
        "multi_space_char_filter": ...,
        "apostrophe_remove": ...
      },
      "filter" : {
        "no_accent_latin" : {
          "type" : "icu_transform",
          "id" : "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC"
        }
      }
    }
  },
  "mappings" : {
    "dynamic" : "strict",
    "properties" : {
      "searchableName" : {
        "type" : "text",
        "fields": {
          "ascii_keyword": { "type": "text", "analyzer": "ascii_keyword" },
          "ascii_keyword_reverse": { "type": "text", "analyzer": "ascii_keyword_reverse" },
          "standard_ascii": { "type": "text", "analyzer": "standard_ascii" },
          "standard_ascii_reverse": { "type": "text", "analyzer": "standard_ascii_reverse" },
          "sorted_latin": { "type": "keyword", "normalizer": "lowercase_ascii" },
          ...
        }
      }
    }
  }
}

Am I correct to assume that because searchableName fields, in the example above, use 5 different analyzers/normalized that all utilize icu_transform, we do the transliteration for the same text 5 different times? Can we somehow optimize this? Can filters somehow utilize intermediate results of other filters?

system · June 24, 2020, 1:28pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Issue with elasticsearch-analysis-icu plugin Elasticsearch	20	3513	June 27, 2017
ICU and upgrading from 7.17.1 to 8.5 Elasticsearch	2	224	November 30, 2022
unicodeSetFilter in analysis-icu ignored Elasticsearch	5	1309	July 5, 2017
ICU exclude lowercase filter Elasticsearch	1	605	July 5, 2017
Elasticsearch filter comparison with "preserve_original": true Elasticsearch	1	94	April 26, 2024

ICU transform filters slowing down indexing: how avoid duplicate transliterations?

Related topics