I have read so many threads for ProcessClusterEventTimeoutException timeouts. We are facing the same while creating an index. Based on my understanding, the real issue is too many updates to cluster state causing this timeout. Too many shards is likely not an issue IMO because we have multiple clusters with over 15k shards each. Some clusters are happy, while others that are more busier relatively and face this particular problem.
We don't use dynamic mapping and have turned off auto index creation. Common operations that update cluster state for our use case involves:
- Create Index
- Delete Index
My understanding is that the cluster state operations are single threaded. Hence, if there are too many delete index / create index calls in short period of time, it can lead to this timeout. Let me know if that's not the case.
In this thread, I am looking for alternatives / workarounds. It would be great to get some input from an Elastic member:
- Instead of updating cluster state via multiple calls, use bulk API. For instance, instead of deleting 1 index at a time, bulk delete 10 indices. This will make only one call to update cluster state in ES mater vs 10 calls. (hopefully under the hood, it does not end up with 10 calls to cluster state)
- Increase master timeout for index/delete operations (or in general any operation that involves cluster state update)
Thank you in advance.