Looks like there was a connectivity issue due to which the nodes were not able to communicate with each other for some time and the cluster went into not available state.
There are 3 master, 6 client and 156 data nodes. There are close to 60k shards most of them active (indexing and search happening). After multiple restart attempts of master nodes, they are not forming cluster.
What I have seen is one of the master node becomes active master, all node bombard with requests/pins and it just dies. So far 2 of the master nodes have become active master 3 times for span of 2 - 5 minutes in last 1 hour.
Hi @mosiddi, glad to hear you got your cluster up and running again. When you hit that java.lang.OutOfMemoryError the JVM would normally write a heap dump. Would it be possible for you to share that? It'd help greatly in determining what exactly went wrong and whether we can improve things in future. I can organise a way to share it privately if that'd help.
@DavidTurner - Unfortunately we lost the dumps during the master node migration process where we were adding new master nodes and deleting old master nodes. Wish we would have taken a backup somewhere. Sorry again.
@DavidTurner - I got hold of the latest dump and I see this as the prime suspect -
Problem Suspect 1
One instance of "org.elasticsearch.transport.TransportService" loaded by "sun.misc.Launcher$AppClassLoader @ 0x6ea660000" occupies 3,152,038,736 (84.85%) bytes. The memory is accumulated in one instance of "java.util.concurrent.ConcurrentHashMap$Node" loaded by "" .
Ok, first up, your master nodes only have ~4GB of heap? That seems rather small for a cluster with 100+ nodes.
The clientHandlers map contains a handler for every in-flight request: requests that the node has sent but for which it has not yet received a response. 3GB does seem like a lot for this. My first question would be whether this is an enormous number of tiny entries, or whether there's a smaller number of entries that are themselves enormous. The entries themselves should be of type TransportService$RequestHolder, which has an action field that tells us the type of each request.
Can you dig further into the client handlers and tell us a bit more about the in-flight requests?
Out of 510 there are 5 which are outliers (rest all of them are ~200 to ~500 bytes). Out of these 5, there are 4 from same data node and one from a different node. They all are of type cluster:monitor/nodes/stats.
3 of them showed retained heap >200MB and 2 of them >135MB.
Those numbers don't add up to 3GB, which suggests that something large is being shared between these handlers. This can be quite tricky to pin down and I'm not sure it'll be very simple to explain the process. Would it be possible for you to share the heap dump privately with me and the Elastic team at the following link?
Thanks. It looks like those cluster:monitor/nodes/stats requests are the majority of the problem. Each one is trying to collect a few kB of stats about every shard on every node, and there look to be quite a few in flight, so it is all adding up. I would recommend:
sending fewer of these stats requests
sending them to a coordinating node rather than to the master