Elasticsearch memory requirements /tuning for zero down time indexing


(sarya) #1

We are trying to implement a zero down time indexing model in our elasticearch version 1.5 cluster (Five shards and one replica) . We are having multiple data sets of 250 million documents each about 700 fields which we are trying to refresh monthly . However we notice that the speed of indexing has been deteriorating (has more than doubled and is taking a week) compared to the first time we indexed the data in 48 hours .
We have 4 node cluster each with 4 cpu's and 30 GB RAM and 15GB allocated to elasticsearch . We are looking for performance tuning tips to speedup the process of indexing .
Do we need more RAM , more CPU per node or add more nodes ?
Thanks
Subra


(eliasah) #2

Zero down time (re)indexing can't be done no matter what hardware configuration you have!


(sarya) #3

We have achieved minimal down time using aliases , our problem is how to improve the performance of re indexing than down time .


(eliasah) #4

I have already answered that question, you can't do a zero down-time re-indexing!

You may consider re-indexing and alias switching after that last action is finished but that's the only think you can do for now.


(Christian Dahlqvist) #5

Heavy indexing can use a lot of CPU and can also result in a lot of disk IO as segments are merged. Monitor your nodes when performing indexing in order to identify what is the limiting factor for your use case.

One way to improve indexing throughput is to set the number of replicas to 0 during indexing and then increase it when indexing has completed. This does however reduce the reliability of the cluster as a only single copy of each shard is kept, which may or may not be an acceptable trade-off. Another common way to increase indexing throughput is to increase the refresh interval, in order to force larger segments to be created from the start, and thereby reduce the merging activity.

A good discussion about performance tuning for indexing can be found here.

Another option might be to spin up a separate temporary cluster, possibly in cloud, to perform the indexing on and then transfer the indexed data to the production cluster using the snapshot and restore mechanism. This might allow you to use a cluster specifically tuned for pure indexing with fast SSD disks and lots of CPU and reduce the indexing time significantly while at the same time minimising the load on the production cluster.


(sarya) #6

Christian ,
Can we change the refresh interval while the the indexing is happening or would it cause any faults .
Thanks


(Christian Dahlqvist) #7

The refresh interval is a dynamic setting that can be changed at any time.


(system) #8