Through we only have 31shards in test cluster, we observe a lot of shard failure tasks:
225 add_listener
1 cluster_reroute(async_shard_fetch)
1 delayed_allocation_reroute
6 master
106945 shard-failed
1 zen-disco-node-failed
21 zen-disco-node-join
Killing a node normally does not reproduce the problem. The recovery in that scenario is normal, a reasonable amount of shard-failed tasks and quick recovery.
@ywelsch
We try to refuse bulk request after high JVM old generation usage in a plugin. The test cluster won't hang, but we observed an interesting problem, which may be the real problem: 5 minutes after the above tc cmd, many transport_server_worker threads of the master is blocked by ShardFailedTransportHandler.messageReceived() method, and sometimes call of the master transport address cost 10+s.
We wonder that the master of the online cluster, whose log is provided before, was also blocked by this problem, that's why a new master was elected.
That's definitely an interesting theory. The stack trace indicates that the amount of "shard failed" events that are handled on the master look to overwhelm it and block network threads. The logs from the production cluster are inconclusive, they show that the master node has trouble reconnecting to the rest of the cluster, but there are also minutes going by without any log information. In case where this happens again in production, can you take stack dumps / heap dumps of the master?
Can you also provide more information on all the steps you took to reproduce this, in particular the ones which led to the 106945 shard-failed events in the test cluster? Thanks.
We will share the heap dump of the master next time.
Our new test cluster enviroment: Elasticsearch cluster:
3 machines each with 4 data nodes: 12 Cores4, 64GB4 memory
Choose another machine as dedicated master
Allocation awareness: cluster.routing.allocation.awareness.attributes: ip
Elasticsearch plugin:
Reject bulk request and shard bulk request when high JVM old generation usage large then 85%.
Steps to reproduce:
We use tc cmd to simulate the hardware failure and reproduce the problem:
Start the cluster
Create an index with 12 shards, each with 1 replica
Do heavy index on data nodes: multiple bulks concurrent on all data nodes continuely, each with 5k document, bulks cost 50% CPU of all data machine.
Use tc cmd to randomly drop packet of one data machine:
The code seems that it doesn't deduplicate shard failed tasks. It would be nice if you could provide a patch that deduplicate those task, so that we could check the theory right or not.
@ywelsch We find that checking for dups are based on task identity in TaskBatcher.java, which will lead to too many duplicated shard failed tasks in _cat/pending_tasks. After changing IdentityHashMap to ConcurrentHashMap and overwrite equals method of ShardEntry, the shard failed tasks number drops sharply. Now the test cluster can recover to green.
If checking for dups based on task identity is required for some cluster state update tasks? if it's just a bug?
Task identity is currently used to establish the relation between requests and responses (for each request there's exactly one response). Deduplicating the shard failed events can and should be done on the sender side (ShardStateAction), which is something that has been on my TODO list already. This will not only reduce the load on the master, but also reduce the amount of messages that have to flow between data and master node. I also want to check if there's anything that can be done to make TaskBatcher more efficient when dealing with a large number of tasks. I'll be off for 2 days now, but will open an issue and start working on a fix by the end of the week.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.