200% CPU - Elasticsearch 5 index creation very slow with a huge synonyms list

teseo · December 14, 2016, 4:38pm

In the middle of the work to migrate an elasticsearch index from 2.3.3 to elasticsearch 5.1.1 we have noticed that the creation of the index has risen up from less than 20 seconds to 17 minutes. This config is from development environment in a Vagrant box.

An overview of the settings & mappings would be like this:

{
"index": "your_index_name",
"settings": {
        "index.requests.cache.enable": true,
        "index.unassigned.node_left.delayed_timeout": "5m",
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": {
            "char_filter": {
                #YOUR CHAR FILTERS
            },
            "analyzer": {
                "your_analyzer": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "char_filter": ["custom_pattern_1", "custom_pattern_2", "custom_pattern_3"],
                    "filter": ["lowercase", "massive_synonym_list_filter", "long_synonym_list_filter"]
                }
            },
            "filter": {
                "long_synonym_list_filter": {
                    "type": "keep",
                    "keep_words": ["list-of-25k-words"]
                    "keep_words_case": false
                },
                "massive_synonym_list_filter": {
                    "tokenizer": "keyword",
                    "type": "synonym",
                    "synonyms": ["list-of-40k-synonyms"]
                }
            }
        }
    },
"mappings": {
        ...
        #YOUR MAPPING
        ...
    } }

If we remove the massive synonyms list out of the equation, the index gets created very very quickly but unfortunately we do need the synonym lists we were benefiting of in ES 2.3.3.

I'm using the official latest Elasticsearch php client provided by elastic and I'm not using a file to store the words and synonyms but adding them in an array within the analysis filter settings.

EDIT
Here can be see a hotthreads dump when creating the index and 200% CPU usage http://pastebin.com/5tJJNGBC

Many thanks!

Clinton_Gormley · December 16, 2016, 5:10pm

It looks like the strict settings checks are taking their toll. Not sure if there is anything to do to speed things up when specifying so many synonyms inline. Try putting them into a file instead - hopefully that'll bring back the speed.

fguery · December 16, 2016, 5:39pm

Hello Clinton! I'm working on that as well, and I'm a bit concerned at having to store that in files - it'd mean a full cluster restart is required every time we add a synonym right?

Clinton_Gormley · December 16, 2016, 5:53pm

No - closing and opening the index would work. It's no worse than specifying them inline

fguery · December 16, 2016, 6:50pm

All right, that's a good news. Thanks!

xavierfacq · December 16, 2016, 10:15pm

Note that : You have to copy your synonyms file on all nodes. New indices will load the last version of the file, after changes are not applied to old indices.

s1monw · December 19, 2016, 8:23am

I opened Speed up filter and prefix settings operations by s1monw · Pull Request #22249 · elastic/elasticsearch · GitHub to improve the situation. From what I can tell it brings back the same perf we had in 2.x. That said, I think settings are not the right place for such a massive list of synonyms. It's simply not the right data structure for this. You should totally use a file, the file has the same semantics as the index settings. But please keep this in mind if you change your synonym list and you use you tokenfilter for indexing you are basically corrupting your index. for correct result you have to reindex every time you change the list. If it's used for searching the situation is different.

Note that : You have to copy your synonyms file on all nodes. New indices will load the last version of the file, after changes are not applied to old indices.

if your index references a file it will pick it up every time an index is allocated on a node. This can be even more tricky since some of the shards might have a new synonym file but others don't ie. if you relocate a shared to a node where no shared of that index exists at the time the relocation happens.

teseo · December 19, 2016, 9:30am

Hello Simon,

Many thanks for your response and for the github issue. I'm going to give some work at the recommendations in this thread.

Much appreciated guys!

fguery · December 19, 2016, 9:56am

Thanks all, we're looking forward for the fix then!

We're aware about the corruption of the index, we will reindex most times but sometime it's just not really worth it.
We also use some synonyms only on query side, and having to deal with files for that is quite annoying (so is opening/closing the index, but I guess there's no better solution)

jprante · December 19, 2016, 10:56am

I use synonyms by a reference mapper that gets fields from another index, so this eliminates the use of synonym files GitHub - jprante/elasticsearch-analysis-reference: A reference mechanism for including content from other documents during the Elasticsearch analysis field mapping phase It works when the index is rebuilt regularly. It does not work with synonyms at client side for query expansion. Query expansion can come with a huge cost.

system · January 16, 2017, 10:57am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Es5 index creation hanging Elasticsearch	5	955	June 10, 2017
Indexing Speed Elastic Search Elasticsearch	6	680	February 13, 2017
Synonym token filter feature is time-consuming when synonym dict is big Elasticsearch	1	386	July 6, 2018
ES create indexes\reindex are slow when using a synonym file Elasticsearch	7	201	January 4, 2024
First time queries (not filters) to Elasticsearch takes lot of time Elasticsearch	7	2656	February 1, 2017

200% CPU - Elasticsearch 5 index creation very slow with a huge synonyms list

Related topics