I have read so many threads for ProcessClusterEventTimeoutException timeouts. We are facing the same while creating an index. Based on my understanding, the real issue is too many updates to cluster state causing this timeout. Too many shards is likely not an issue IMO because we have multiple clusters with over 15k shards each. Some clusters are happy, while others that are more busier relatively and face this particular problem.
We don't use dynamic mapping and have turned off auto index creation. Common operations that update cluster state for our use case involves:
Create Index
Delete Index
My understanding is that the cluster state operations are single threaded. Hence, if there are too many delete index / create index calls in short period of time, it can lead to this timeout. Let me know if that's not the case.
In this thread, I am looking for alternatives / workarounds. It would be great to get some input from an Elastic member:
Instead of updating cluster state via multiple calls, use bulk API. For instance, instead of deleting 1 index at a time, bulk delete 10 indices. This will make only one call to update cluster state in ES mater vs 10 calls. (hopefully under the hood, it does not end up with 10 calls to cluster state)
Increase master timeout for index/delete operations (or in general any operation that involves cluster state update)
15k shards is a lot of shards, and this is likely the source of your problem. Many cluster state updates spend time going through the shards and checking that they're assigned to the right nodes, and with 15k shards I expect this will be taking quite some time. It depends on a few other variables too, so I would expect some variation across different clusters. You can investigate the time-consuming part of the update by looking at the hot threads API while the master is processing your requests.
Creating an index involves multiple cluster state updates: first the index is created, then the primaries are assigned (possibly in multiple steps), then the replicas start to rebuild from the primaries (also possibly in multiple steps) and finally each shard copy is marked as started when it is ready to serve searches.
Some cluster state updates are processed in bulk whether you submit them in bulk or not. For instance, the master processes all pending shard-started tasks at once. Creating an index is not a bulked operation. Deleting a set of indices is also not a bulked operation, but you can still delete multiple indices in a single step with
DELETE /index1,index2,...
You can set the ?master_timeout= and ?timeout= parameters on create-index and delete-index requests. The first of these determines how long the task will wait in the master's queue before being rejected, and the second determines how long it will wait to be applied on every node.
Thank you so much for your response. Few follow up questions:
For my understanding, why are 15k shards too many? We have 5 large data nodes in the cluster and 3 dedicated master nodes. Data nodes have 128G of memory and we have configured 30G of heap space. 1 replica and 3 shards for an index, well under recommended 50G per shard limit.
Also, if 15k shards were too many, shouldn't we see a consistent behavior in other clusters (we have 25+ clusters)?
A linear scan through all shards on a periodic basis seems expensive. Is this a scheduled background process / operation in the cluster? Or is this only triggered on certain conditions (changes that require cluster state update, cluster health changes, etc...).
This is almost impossible for us since the timeouts happen randomly and we have no control over that. Hot threads would only be a feasible solution if the issue continued to happen and we get a chance to debug something live or if we have a way to reproduce the issue.
We use telegraf metrics for elasticsearch. Would monitoring threadpool metrics indicate the frequency of cluster state updates? index_queue seems to be one of them.
Sorry, I wasn't clear with my question earlier. What I was wondering was, if we use the API to delete multiple indices (the one you shared) instead of calling an API to delete one index at a time, would it reduce the number of cluster state update operations? For instance, would one API call to delete 10 indices via delete multiple indices be more efficient than 10 API calls to delete one index at a time?
The rule of thumb is to limit yourself to around 20 shards per GB of heap, which for 5x30GB nodes would recommend around 3000 shards. Here is a more detailed explanation:
As I said, it depends on other factors too.
No, it only occurs when necessary, during certain cluster state updates (including those that create or delete indices).
Yes, a delete operation on multiple indices will delete them all in a single cluster state update.
I have used ES in production for two years and also contributed a bit to the code base. Hence, I would love to understand the reason behind it. I am hearing two contrasting things:
It depends on multiple factors
Somehow, keep the number of shards to 20 / GB of heap space
Ok great, as a contributor you can see much more detail on the sorts of things on which this depends in org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator and in particular the methods allocateUnassigned(), moveShards() and balanceByWeights() in BalancedShardsAllocator.Balancer.
The shards-per-GB thing also depends on multiple factors as described in the article I shared. A "rule of thumb" is not a hard limit, it is a guideline intended to apply in enough cases to be useful. You are exceeding this guideline by a factor of 5.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.