Failed to process cluster event Exception

Currently, I am running Elasticsearch in cluster mode and there are 3 nodes. Approximately I have around 15000 indices. Each node has 64GB memory and 16 core CPU.

My logs are flooded with below message whenever I try to do some operations on any index through Java client.

Exception: Failed to execute query: ProcessClusterEventTimeoutException[failed to process cluster event (put-mapping [IndexName]) within 30s]
	at org.elasticsearch.cluster.service.InternalClusterService$2$1.run(InternalClusterService.java:349) [elasticsearch-2.4.1.jar:2.4.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_72]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_72]
	at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_72]

Can I get some help on this?

Out of the three nodes, do you have dedicated master nodes?

Based on the log snippet you have shown, you are executing a PUT mapping event. Where is it being executed from and how is it applying to the cluster?

Hi @mujtabahussain
All are master and data node. There is no dedicated master. I am using java client to create index on node1.

Some other logs are

[2017-10-10 17:31:59,733][DEBUG][action.admin.indices.create] [vm14005] [indexname] failed to create
ProcessClusterEventTimeoutException[failed to process cluster event (create-index [indexname], cause [api]) within 30s]
	at org.elasticsearch.cluster.service.InternalClusterService$2$1.run(InternalClusterService.java:349)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)


[2017-10-10 17:31:52,084][DEBUG][index.shard              ] [vm14002] [index1][0] updating index_buffer_size from [13mb] to [13mb]; IndexWriter now using [0] bytes
[2017-10-10 17:31:52,084][DEBUG][index.translog           ] [vm14002] [index2][0] interval [5s], flush_threshold_ops [2147483647], flush_threshold_size [512mb], flush_threshold_period [30m]
[2017-10-10 17:31:52,084][DEBUG][index.shard              ] [vm14002] [index3][0] updating index_buffer_size from [13mb] to [13mb]; IndexWriter now using [0] bytes
[2017-10-10 17:31:52,084][DEBUG][index.shard              ] [vm14002] [index4][0] updating index_buffer_size from [13mb] to [13mb]; IndexWriter now using [0] bytes
[2017-10-10 17:31:52,084][DEBUG][index.shard              ] [vm14002] [index5][0] updating index_buffer_size from [13mb] to [13mb]; IndexWriter now using [0] bytes
[2017-10-10 17:31:52,084][DEBUG][index.shard              ] [vm14002] [index6][0] updating index_buffer_size from [13mb] to [13mb]; IndexWriter now using [0] bytes
[2017-10-10 17:31:52,084][DEBUG][index.shard              ] [vm14002] [index7][0] updating index_buffer_size from [13mb] to [13mb]; IndexWriter now using [0] bytes
[2017-10-10 17:31:52,084][DEBUG][index.shard              ] [vm14002] [index8][0] updating index_buffer_size from [13mb] to [13mb]; IndexWriter now using [0] bytes
[2017-10-10 17:31:52,084][DEBUG][index.shard              ] [vm14002] [index9][0] updating index_buffer_size from [13mb] to [13mb]; IndexWriter now using [0] bytes
[2017-10-10 17:31:52,084][DEBUG][index.shard              ] [vm14002] [index10][0] updating index_buffer_size from [13mb] to [13mb]; IndexWriter now using [0] bytes
[2017-10-10 17:31:52,084][DEBUG][index.shard              ] [vm14002] [index11][0] updating index_buffer_size from [13mb] to [13mb]; IndexWriter now using [0] bytes
[2017-10-10 17:31:52,084][DEBUG][index.shard              ] [vm14002] [index12][0] updating index_buffer_size from [13mb] to [13mb]; IndexWriter now using [0] bytes

Can you quickly try and create that index and associated mappings with the REST API Client and see if you still get the same error? This should isolate whether the issue is in the Java Client or the setup of the cluster.

On Java Client, I am not seeing any exception. From the client side, everything looks good but from the elastic log, there are tones of exception.

I am also seeing around 6000 pending tasks. Is this causing the issue?

Is that the default 5 shards? If so that's 75000 shards, or 25000 per node.

If that's accurate then thats doing to be entirely the problem, so reducing that (use _shrink, or delete some old indices) is the best idea.

1 Like

@warkolm No, there are three shards. One primary and two replicas. How can I solve this without deleting? All nodes are master and data nodes.

So you have 45000 shards, 15000 per node?

Yes, That's correct

That's still way to many. By a factor of at least 50.

So how can I solve this? Adding new nodes? optimizing indexing?

Reduce your shard count. Adding more nodes will help, but aiming for 300-500 shards per node means 90 nodes, which is probably not what you want.

Why do you have 2 replicas?
Are you using time based indices?

Yes, those are the time-based indices. I have around 75 different category and each customer has around 20 indices. These indices are rolled over when they hit 20GB.

Where'd the 20GB size come from?

That is the limit what I am maintaining for each index. Periodically I check the size of the index and if the size is exceeding then I create another index and move write alias to the new index.

But how and why did you pick that size? Did you do load testing or did you just pick a number?

Picked this magic number based on multiple reasons

  1. To distribute aggregation across index.
  2. To make sure data is available when total index size is less and purge won't delete all records.

This is as part if load test and ended up in this scenario.

Not sure I follow you here. You only have one primary shard, so how does that distribute across the index?

What do you mean by the last part of that? If you are using time based indices why wouldn't the data be available if the index exists?

That's good to hear!

But the problem is you have fixed one problem and created another due to over sharding. As things stand, you are going to be wasting a lot of resources on just managing shards, and it's likely causing this timeout.

If you double your shard size you halve the number of needed shards, that's a great start, and it'd be interesting to see what that means for your performance.

Also, How many shards should I have in my Elasticsearch cluster? | Elastic Blog is a great read on this topic.

@warkolm Maybe I am going in the wrong direction as u suggested. Thank you for ur suggestion. Last QQ. How much data can I store in one index in terms of size. Will it impact query and aggregation?