200% CPU - Elasticsearch 5 index creation very slow with a huge synonyms list

In the middle of the work to migrate an elasticsearch index from 2.3.3 to elasticsearch 5.1.1 we have noticed that the creation of the index has risen up from less than 20 seconds to 17 minutes. This config is from development environment in a Vagrant box.

An overview of the settings & mappings would be like this:

{
"index": "your_index_name",
"settings": {
        "index.requests.cache.enable": true,
        "index.unassigned.node_left.delayed_timeout": "5m",
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": {
            "char_filter": {
                #YOUR CHAR FILTERS
            },
            "analyzer": {
                "your_analyzer": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "char_filter": ["custom_pattern_1", "custom_pattern_2", "custom_pattern_3"],
                    "filter": ["lowercase", "massive_synonym_list_filter", "long_synonym_list_filter"]
                }
            },
            "filter": {
                "long_synonym_list_filter": {
                    "type": "keep",
                    "keep_words": ["list-of-25k-words"]
                    "keep_words_case": false
                },
                "massive_synonym_list_filter": {
                    "tokenizer": "keyword",
                    "type": "synonym",
                    "synonyms": ["list-of-40k-synonyms"]
                }
            }
        }
    },
"mappings": {
        ...
        #YOUR MAPPING
        ...
    } }

If we remove the massive synonyms list out of the equation, the index gets created very very quickly but unfortunately we do need the synonym lists we were benefiting of in ES 2.3.3.

I'm using the official latest Elasticsearch php client provided by elastic and I'm not using a file to store the words and synonyms but adding them in an array within the analysis filter settings.

EDIT
Here can be see a hotthreads dump when creating the index and 200% CPU usage http://pastebin.com/5tJJNGBC

Many thanks!

It looks like the strict settings checks are taking their toll. Not sure if there is anything to do to speed things up when specifying so many synonyms inline. Try putting them into a file instead - hopefully that'll bring back the speed.

Hello Clinton! I'm working on that as well, and I'm a bit concerned at having to store that in files - it'd mean a full cluster restart is required every time we add a synonym right?

No - closing and opening the index would work. It's no worse than specifying them inline

All right, that's a good news. Thanks!

Note that : You have to copy your synonyms file on all nodes. New indices will load the last version of the file, after changes are not applied to old indices.

I opened Speed up filter and prefix settings operations by s1monw · Pull Request #22249 · elastic/elasticsearch · GitHub to improve the situation. From what I can tell it brings back the same perf we had in 2.x. That said, I think settings are not the right place for such a massive list of synonyms. It's simply not the right data structure for this. You should totally use a file, the file has the same semantics as the index settings. But please keep this in mind if you change your synonym list and you use you tokenfilter for indexing you are basically corrupting your index. for correct result you have to reindex every time you change the list. If it's used for searching the situation is different.

Note that : You have to copy your synonyms file on all nodes. New indices will load the last version of the file, after changes are not applied to old indices.

if your index references a file it will pick it up every time an index is allocated on a node. This can be even more tricky since some of the shards might have a new synonym file but others don't ie. if you relocate a shared to a node where no shared of that index exists at the time the relocation happens.

Hello Simon,

Many thanks for your response and for the github issue. I'm going to give some work at the recommendations in this thread.

Much appreciated guys!

Thanks all, we're looking forward for the fix then!

We're aware about the corruption of the index, we will reindex most times but sometime it's just not really worth it.
We also use some synonyms only on query side, and having to deal with files for that is quite annoying (so is opening/closing the index, but I guess there's no better solution)

I use synonyms by a reference mapper that gets fields from another index, so this eliminates the use of synonym files GitHub - jprante/elasticsearch-analysis-reference: A reference mechanism for including content from other documents during the Elasticsearch analysis field mapping phase It works when the index is rebuilt regularly. It does not work with synonyms at client side for query expansion. Query expansion can come with a huge cost.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.