Failed to process cluster event Exception

Pradeep_Gowda · October 10, 2017, 11:29pm

Currently, I am running Elasticsearch in cluster mode and there are 3 nodes. Approximately I have around 15000 indices. Each node has 64GB memory and 16 core CPU.

My logs are flooded with below message whenever I try to do some operations on any index through Java client.

Exception: Failed to execute query: ProcessClusterEventTimeoutException[failed to process cluster event (put-mapping [IndexName]) within 30s]
	at org.elasticsearch.cluster.service.InternalClusterService$2$1.run(InternalClusterService.java:349) [elasticsearch-2.4.1.jar:2.4.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_72]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_72]
	at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_72]

Can I get some help on this?

mujtabahussain · October 11, 2017, 12:18am

Out of the three nodes, do you have dedicated master nodes?

Based on the log snippet you have shown, you are executing a PUT mapping event. Where is it being executed from and how is it applying to the cluster?

Pradeep_Gowda · October 11, 2017, 12:34am

Hi @mujtabahussain
All are master and data node. There is no dedicated master. I am using java client to create index on node1.

Some other logs are

[2017-10-10 17:31:59,733][DEBUG][action.admin.indices.create] [vm14005] [indexname] failed to create
ProcessClusterEventTimeoutException[failed to process cluster event (create-index [indexname], cause [api]) within 30s]
	at org.elasticsearch.cluster.service.InternalClusterService$2$1.run(InternalClusterService.java:349)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)


[2017-10-10 17:31:52,084][DEBUG][index.shard              ] [vm14002] [index1][0] updating index_buffer_size from [13mb] to [13mb]; IndexWriter now using [0] bytes
[2017-10-10 17:31:52,084][DEBUG][index.translog           ] [vm14002] [index2][0] interval [5s], flush_threshold_ops [2147483647], flush_threshold_size [512mb], flush_threshold_period [30m]
[2017-10-10 17:31:52,084][DEBUG][index.shard              ] [vm14002] [index3][0] updating index_buffer_size from [13mb] to [13mb]; IndexWriter now using [0] bytes
[2017-10-10 17:31:52,084][DEBUG][index.shard              ] [vm14002] [index4][0] updating index_buffer_size from [13mb] to [13mb]; IndexWriter now using [0] bytes
[2017-10-10 17:31:52,084][DEBUG][index.shard              ] [vm14002] [index5][0] updating index_buffer_size from [13mb] to [13mb]; IndexWriter now using [0] bytes
[2017-10-10 17:31:52,084][DEBUG][index.shard              ] [vm14002] [index6][0] updating index_buffer_size from [13mb] to [13mb]; IndexWriter now using [0] bytes
[2017-10-10 17:31:52,084][DEBUG][index.shard              ] [vm14002] [index7][0] updating index_buffer_size from [13mb] to [13mb]; IndexWriter now using [0] bytes
[2017-10-10 17:31:52,084][DEBUG][index.shard              ] [vm14002] [index8][0] updating index_buffer_size from [13mb] to [13mb]; IndexWriter now using [0] bytes
[2017-10-10 17:31:52,084][DEBUG][index.shard              ] [vm14002] [index9][0] updating index_buffer_size from [13mb] to [13mb]; IndexWriter now using [0] bytes
[2017-10-10 17:31:52,084][DEBUG][index.shard              ] [vm14002] [index10][0] updating index_buffer_size from [13mb] to [13mb]; IndexWriter now using [0] bytes
[2017-10-10 17:31:52,084][DEBUG][index.shard              ] [vm14002] [index11][0] updating index_buffer_size from [13mb] to [13mb]; IndexWriter now using [0] bytes
[2017-10-10 17:31:52,084][DEBUG][index.shard              ] [vm14002] [index12][0] updating index_buffer_size from [13mb] to [13mb]; IndexWriter now using [0] bytes

mujtabahussain · October 11, 2017, 12:36am

Can you quickly try and create that index and associated mappings with the REST API Client and see if you still get the same error? This should isolate whether the issue is in the Java Client or the setup of the cluster.

Pradeep_Gowda · October 11, 2017, 1:24am

On Java Client, I am not seeing any exception. From the client side, everything looks good but from the elastic log, there are tones of exception.

Pradeep_Gowda · October 11, 2017, 1:43am

I am also seeing around 6000 pending tasks. Is this causing the issue?

warkolm · October 11, 2017, 3:02am

Is that the default 5 shards? If so that's 75000 shards, or 25000 per node.

If that's accurate then thats doing to be entirely the problem, so reducing that (use _shrink, or delete some old indices) is the best idea.

Pradeep_Gowda · October 11, 2017, 3:07am

@warkolm No, there are three shards. One primary and two replicas. How can I solve this without deleting? All nodes are master and data nodes.

warkolm · October 11, 2017, 3:07am

So you have 45000 shards, 15000 per node?

Pradeep_Gowda · October 11, 2017, 3:13am

Yes, That's correct

warkolm · October 11, 2017, 3:14am

That's still way to many. By a factor of at least 50.

Pradeep_Gowda · October 11, 2017, 3:16am

So how can I solve this? Adding new nodes? optimizing indexing?

warkolm · October 11, 2017, 3:21am

Reduce your shard count. Adding more nodes will help, but aiming for 300-500 shards per node means 90 nodes, which is probably not what you want.

Why do you have 2 replicas?
Are you using time based indices?

Pradeep_Gowda · October 11, 2017, 3:26am

Yes, those are the time-based indices. I have around 75 different category and each customer has around 20 indices. These indices are rolled over when they hit 20GB.

warkolm · October 11, 2017, 3:37am

Where'd the 20GB size come from?

Pradeep_Gowda · October 11, 2017, 3:39am

That is the limit what I am maintaining for each index. Periodically I check the size of the index and if the size is exceeding then I create another index and move write alias to the new index.

warkolm · October 11, 2017, 3:40am

But how and why did you pick that size? Did you do load testing or did you just pick a number?

Pradeep_Gowda · October 11, 2017, 3:52am

Picked this magic number based on multiple reasons

To distribute aggregation across index.
To make sure data is available when total index size is less and purge won't delete all records.

This is as part if load test and ended up in this scenario.

warkolm · October 11, 2017, 3:58am

Not sure I follow you here. You only have one primary shard, so how does that distribute across the index?

What do you mean by the last part of that? If you are using time based indices why wouldn't the data be available if the index exists?

That's good to hear!

But the problem is you have fixed one problem and created another due to over sharding. As things stand, you are going to be wasting a lot of resources on just managing shards, and it's likely causing this timeout.

If you double your shard size you halve the number of needed shards, that's a great start, and it'd be interesting to see what that means for your performance.

Also, How many shards should I have in my Elasticsearch cluster? | Elastic Blog is a great read on this topic.

Pradeep_Gowda · October 11, 2017, 4:04am

@warkolm Maybe I am going in the wrong direction as u suggested. Thank you for ur suggestion. Last QQ. How much data can I store in one index in terms of size. Will it impact query and aggregation?

Topic		Replies	Views
Failed to process cluster event (put-mapping) within 30s Elasticsearch	4	8687	November 30, 2020
Process Cluster Event Timeout Exception on put-mapping Elasticsearch	12	10025	May 31, 2018
Getting exception Process ClusterEvent Timeout Exception after 5 minutes Elasticsearch	3	372	October 14, 2019
Index failed to process cluster event (put-mapping) within 30s Elasticsearch	4	3042	December 15, 2017
Help with process cluster event timeout exception Elasticsearch	2	611	September 3, 2020

Failed to process cluster event Exception

Related topics