Cluster is down and master nodes are not coming up

Looks like there was a connectivity issue due to which the nodes were not able to communicate with each other for some time and the cluster went into not available state.

There are 3 master, 6 client and 156 data nodes. There are close to 60k shards most of them active (indexing and search happening). After multiple restart attempts of master nodes, they are not forming cluster.

Some of the errors seen -

[es-m02-rm] send message failed [channel: NettyTcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/172.16.0.53:58051}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.close(...)(Unknown Source) ~[?:?]

[es-m01-rm] fatal error in thread [elasticsearch[es-m01-rm][generic][T#184]], exiting
java.lang.OutOfMemoryError: Java heap space

Any inputs on how to fix the same?

What version are you on?

Have you considered splitting your cluster up?

The version is 6.2 and the cluster is PROD - so looking for something that can help right now. Splitting is a good suggestion and I think we should do that after the cluster is up at least.

Should I do this for now -

  • Stop ES on all nodes
  • Start master nodes
  • Let they form the cluster
  • Start query nodes
  • Add data nodes in batches

What I have seen is one of the master node becomes active master, all node bombard with requests/pins and it just dies. So far 2 of the master nodes have become active master 3 times for span of 2 - 5 minutes in last 1 hour.

Closing the loop - the above steps worked.

Hi @mosiddi, glad to hear you got your cluster up and running again. When you hit that java.lang.OutOfMemoryError the JVM would normally write a heap dump. Would it be possible for you to share that? It'd help greatly in determining what exactly went wrong and whether we can improve things in future. I can organise a way to share it privately if that'd help.

Sure, let me check and get back.

1 Like

@DavidTurner - Unfortunately we lost the dumps during the master node migration process where we were adding new master nodes and deleting old master nodes. Wish we would have taken a backup somewhere. Sorry again.

@DavidTurner - I got hold of the latest dump and I see this as the prime suspect -

Problem Suspect 1

One instance of "org.elasticsearch.transport.TransportService" loaded by "sun.misc.Launcher$AppClassLoader @ 0x6ea660000" occupies 3,152,038,736 (84.85%) bytes. The memory is accumulated in one instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by "" .

Keywords
sun.misc.Launcher$AppClassLoader @ 0x6ea660000
java.util.concurrent.ConcurrentHashMap$Node
org.elasticsearch.transport.TransportService

Looks like most of the heap was used by transport service -

Let me know if there is any specific information I can share?
Thanks
Imran

Ok, first up, your master nodes only have ~4GB of heap? That seems rather small for a cluster with 100+ nodes.

The clientHandlers map contains a handler for every in-flight request: requests that the node has sent but for which it has not yet received a response. 3GB does seem like a lot for this. My first question would be whether this is an enormous number of tiny entries, or whether there's a smaller number of entries that are themselves enormous. The entries themselves should be of type TransportService$RequestHolder, which has an action field that tells us the type of each request.

Can you dig further into the client handlers and tell us a bit more about the in-flight requests?

1 Like

My first question would be whether this is an enormous number of tiny entries, or whether there's a smaller number of entries that are themselves enormous.

There are 510 of them.

The entries themselves should be of type TransportService$RequestHolder , which has an action field that tells us the type of each request.

Most of them have this 'internal:discovery/zen/fd/ping'

Can you dig further into the client handlers and tell us a bit more about the in-flight requests?

All of them are from different data nodes.

Ok, 510 requests is about 3 per node so that's not ridiculous. Yet, 3GB over 510 requests averages out at 5MB per request.

Hmm, I would expect these to be quite small. How much are these requests retaining? Are there other requests that are less numerous but which retain more heap?

Out of 510 there are 5 which are outliers (rest all of them are ~200 to ~500 bytes). Out of these 5, there are 4 from same data node and one from a different node. They all are of type cluster:monitor/nodes/stats.

3 of them showed retained heap >200MB and 2 of them >135MB.

Those numbers don't add up to 3GB, which suggests that something large is being shared between these handlers. This can be quite tricky to pin down and I'm not sure it'll be very simple to explain the process. Would it be possible for you to share the heap dump privately with me and the Elastic team at the following link?

https://upload-staging.elstc.co/u/24d127fd-0634-4f5c-9a20-2aee259f5cfa

Done!

Thanks. It looks like those cluster:monitor/nodes/stats requests are the majority of the problem. Each one is trying to collect a few kB of stats about every shard on every node, and there look to be quite a few in flight, so it is all adding up. I would recommend:

  • sending fewer of these stats requests
  • sending them to a coordinating node rather than to the master
  • limiting the stats that are returned to cut down on how expensive these requests are.
1 Like

Thank you, this is great information. Let me see what all changes we can do in our cluster.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.