Failed to process cluster event Exception

warkolm · October 11, 2017, 5:00am

That depends on the shard count. That blog provides a lot of great info.

Potentially. You see to me ok with testing, so look at setting up Rally with different shard count/sizes.

Pradeep_Gowda · October 11, 2017, 5:03am

Initial idea was to have one primary and two replicas to handle failover. will go through the documents and will get back to u with the test result.

warkolm · October 11, 2017, 5:14am

Is your data important enough and the failure risk high enough to have two replicas though? I don't know your environment and requirements, but I figure I'd ask.

Having two replica's isn't a bad idea, you just need to be aware of the tradeoffs other than just having 50% more data/shards/indices in your cluster

Pradeep_Gowda · October 11, 2017, 6:13am

Yes, Data is very important and failover must be handled. Adding extra node is not a problem. Just need to know how far I can go in terms of scaling wrt index count and aggressive indexing and aggregating.

Pradeep_Gowda · October 11, 2017, 6:28pm

Hi @warkolm
Coming back to the main problem, Why am I seeing lots of pending task on each node? Let's consider my current setup with no replica

warkolm · October 12, 2017, 2:35am

Have a look at hot threads, it might add more.

But it's likely because you have too many shards.

Pradeep_Gowda · October 12, 2017, 2:39am

Most of them are shard initializing and indexing. Some say's the reason as NODE_LEFT. But I haven't seen any node getting disconnected. Do you recommend to increase the default timeout of 30s? And what all settings should I look into in scaled deployments like mine?

warkolm · October 12, 2017, 2:52am

You should reduce your shard count.

I know I repeat that, but it's important. Each time a shard needs to be moved or have a change in state, if a node leaves or joins, if an index is created or updated, is a mapping is created or updated, the cluster master needs to update the cluster state.

In 5.X cluster states are propagated as deltas, whereas in 2.X we'd apply the change and then push the full state out to each node. So 5.X is much more efficient with this.

But when you have 45000 shards, that means that the node routing/mapping state is huge. Irrespective of applying cluster state changes as just the deltas, iterating through a routing state of that size just to move even a single is a big job.

Changing settings here is only ignoring the problem. Reduce your shard count for this many nodes is the real solution.

warkolm · October 12, 2017, 2:53am

To give you more context look at this - https://twitter.com/fdevillamil/status/917044462622330880/photo/1

39899 shards
186 nodes

They also have dedicated master nodes, with relatively powerful infrastructure.

You have 3 nodes and 45000 shards. There's a fair bit of a difference there.

Pradeep_Gowda · October 12, 2017, 3:14am

@warkolm Thank you very much. Definalty will look into reducing the number of shards for my use case and look into testing it on 5.x. Will get back to you if I have any quires.

Pradeep_Gowda · October 18, 2017, 10:45pm

Have a quick question on this. How did we come up with 300-500 shards? Is this number stands well when each shard are running with max capacity (2.1 billion records)?. If I reduce the document count per shard to 1billion can I have 600 to 1000 shards? How to find out how many shards I can have in a node?

Christian_Dahlqvist · October 19, 2017, 11:17am

The recommendation around number of shards per node in a cluster described in the blog post Mark linked to earlier in this thread is based on what we see in the field across multiple types of deployments. It is meant as a guideline and is not an absolute limit.

What drives this is the memory overhead that each shard/segment comes with, which is not linear. Larger shards/segments typically have a lower proportion of overhead compared to the data volume than smaller shards. Having twice the number of shards with the same amount of documents is therefore likely to result in more overhead.

Exactly how much overhead your cluster can handle depends on how much heap querying and indexing requires, and this will depend a lot on the use case. I know of users running with more shards than what we recommended in the blog post, but also other users that had to set the limit lower.

The only way to accurately find out what the limit is for your use case is to test and benchmark with your data and work load.

system · November 16, 2017, 11:18am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Process Cluster Event Timeout Exception Elasticsearch	5	2818	July 6, 2017
Failed to process cluster event (put-mapping) Elasticsearch	5	10401	October 26, 2018
Failed to process cluster event (put-mapping) within 30s Elasticsearch	4	9245	November 30, 2020
Problems in my Cluster Elasticsearch	12	1172	June 20, 2017
Elasticsearch cluster have millions of pending tasks Elasticsearch	15	1183	June 8, 2021

Failed to process cluster event Exception

Related topics