How to improve bulk indexing of huge amount of docs


(Rohanil Raje) #1

Currently, I am using Elasticsearch 5.6 as a data store for my go application. I have a million documents spread in 12 indices. Each index has 5 shards and 2 replicas and they all are running on 4 nodes. First, I load those docs from Elasticsearch and then process them and then index them in bulk with 10000 docs/batch rate. I run 3 workers which have one async goroutine per worker at a time. Those goroutines are per index so they send bulk index requests per index. That means a worker sends around 100,000 docs in a goroutine. The docs are sent in batches, each goroutine sends almost 10 batches. This entire stuff takes more than a minute. Most of the time is taken in bulk indexing.

My current Elasticsearch is running with 6GB RAM and 3.5GB heap size. I tried to tune Elasticsearch to improve indexing speed by increasing index buffer size to 20% ie. 700MB. I disabled indexing for fields which don't need indexing. I optimised numeric fields types in mappings. I disabled _all field. I changed index codec (compression method) to best_compression. After doing all this, there is not much improvement.

So I would like to get ideas to improve bulk indexing performance to finish all process within a minute. Will it be improved if I add more RAM and heap size to Elasticsearch? Any other settings/tuning?


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.