We have number of periodic indexing tasks that perform bulk indexing of a few(let's say 10) million documents daily partitioned into pages of size 5K ( we have 200 tasks each having a page of 5K documents).
When multiple pages(more than 80) are pushed to Elasticsearch for indexing, simulatenously, index queue of the Elasticsearch gets flooded and performance degrades for search and indexing clusterwide. Increasing the index queue_size provides some improvement but seems like a band-aid.
Unfortunately our task queue does not allow to throttle indexing tasks in application side.
So is there a way in Elasticsearch to throttle indexing tasks (without dropping the index requests of course)
latency is not a big problem in our case, indexing taking longer time is acceptable.
We have around 10 nodes in the cluster (Elasticsearch v 2) each acting as a master-eligible data node.
Queues are useful for handling variable load. When they fill up, the subsequent rejections are a form of backpressure that clients should use to throttle themselves. If for some reason throttling is not acceptable, it means the cluster is underprovisioned for the load and needs additional capacity.
Let’s think through what it would mean for Elasticsearch to throttle itself though without the clients backing off. That means Elasticsearch has to buffer all these requests. Eventually it will run out of capacity to do that and will have to start rejecting requests. All we have done is move the backpressure problem. You might say hold on, it won’t run out of capacity if I put a giant disk behind it and Elasticsearch spills the requests to disk. We have just reinvented a persistent queue and we already have a solution for that: Logstash.
So here’s where I am on this: if you’re overwhelming Elasticsearch you either need to make your cluster bigger or apply throttling client side.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.