In the middle of the work to migrate an elasticsearch index from 2.3.3 to elasticsearch 5.1.1 we have noticed that the creation of the index has risen up from less than 20 seconds to 17 minutes. This config is from development environment in a Vagrant box.
An overview of the settings & mappings would be like this:
If we remove the massive synonyms list out of the equation, the index gets created very very quickly but unfortunately we do need the synonym lists we were benefiting of in ES 2.3.3.
I'm using the official latest Elasticsearch php client provided by elastic and I'm not using a file to store the words and synonyms but adding them in an array within the analysis filter settings.
It looks like the strict settings checks are taking their toll. Not sure if there is anything to do to speed things up when specifying so many synonyms inline. Try putting them into a file instead - hopefully that'll bring back the speed.
Hello Clinton! I'm working on that as well, and I'm a bit concerned at having to store that in files - it'd mean a full cluster restart is required every time we add a synonym right?
Note that : You have to copy your synonyms file on all nodes. New indices will load the last version of the file, after changes are not applied to old indices.
I opened Speed up filter and prefix settings operations by s1monw · Pull Request #22249 · elastic/elasticsearch · GitHub to improve the situation. From what I can tell it brings back the same perf we had in 2.x. That said, I think settings are not the right place for such a massive list of synonyms. It's simply not the right data structure for this. You should totally use a file, the file has the same semantics as the index settings. But please keep this in mind if you change your synonym list and you use you tokenfilter for indexing you are basically corrupting your index. for correct result you have to reindex every time you change the list. If it's used for searching the situation is different.
Note that : You have to copy your synonyms file on all nodes. New indices will load the last version of the file, after changes are not applied to old indices.
if your index references a file it will pick it up every time an index is allocated on a node. This can be even more tricky since some of the shards might have a new synonym file but others don't ie. if you relocate a shared to a node where no shared of that index exists at the time the relocation happens.
Thanks all, we're looking forward for the fix then!
We're aware about the corruption of the index, we will reindex most times but sometime it's just not really worth it.
We also use some synonyms only on query side, and having to deal with files for that is quite annoying (so is opening/closing the index, but I guess there's no better solution)
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.