I was investigating slow Bulk API indexing as discussed in Bulk index slowing down as index size increases.
It turns out the root cause seems to be slow ICU transform filters, when we start to index Chinese data.
We have mappings such as this:
{
"settings" : {
"analysis" : {
"analyzer" : {
"ascii_keyword": {
"tokenizer": "keyword",
"char_filter": [ "multi_space_char_filter", "apostrophe_remove" ],
"filter": [ "lowercase", "no_accent_latin", "trim" ]
},
"ascii_keyword_reverse": {
"tokenizer": "keyword",
"char_filter": [ "multi_space_char_filter", "apostrophe_remove" ],
"filter": [ "lowercase", "no_accent_latin", "trim", "reverse" ]
},
"standard_ascii": {
"tokenizer": "standard",
"char_filter": [ "multi_space_char_filter", "apostrophe_remove" ],
"filter": [ "lowercase", "no_accent_latin" ]
},
"standard_ascii_reverse": {
"tokenizer": "standard",
"char_filter": [ "multi_space_char_filter", "apostrophe_remove" ],
"filter": [ "lowercase", "no_accent_latin", "reverse" ]
}
},
"normalizer": {
"lowercase_ascii": {
"type": "custom",
"filter": [ "lowercase", "no_accent_latin" ]
}
},
"char_filter": {
"multi_space_char_filter": ...,
"apostrophe_remove": ...
},
"filter" : {
"no_accent_latin" : {
"type" : "icu_transform",
"id" : "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC"
}
}
}
},
"mappings" : {
"dynamic" : "strict",
"properties" : {
"searchableName" : {
"type" : "text",
"fields": {
"ascii_keyword": { "type": "text", "analyzer": "ascii_keyword" },
"ascii_keyword_reverse": { "type": "text", "analyzer": "ascii_keyword_reverse" },
"standard_ascii": { "type": "text", "analyzer": "standard_ascii" },
"standard_ascii_reverse": { "type": "text", "analyzer": "standard_ascii_reverse" },
"sorted_latin": { "type": "keyword", "normalizer": "lowercase_ascii" },
...
}
}
}
}
}
Am I correct to assume that because searchableName
fields, in the example above, use 5 different analyzers/normalized that all utilize icu_transform
, we do the transliteration for the same text 5 different times? Can we somehow optimize this? Can filters somehow utilize intermediate results of other filters?