Cluster is down and master nodes are not coming up

mosiddi · May 2, 2019, 10:48pm

Looks like there was a connectivity issue due to which the nodes were not able to communicate with each other for some time and the cluster went into not available state.

There are 3 master, 6 client and 156 data nodes. There are close to 60k shards most of them active (indexing and search happening). After multiple restart attempts of master nodes, they are not forming cluster.

Some of the errors seen -

[es-m02-rm] send message failed [channel: NettyTcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/172.16.0.53:58051}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.close(...)(Unknown Source) ~[?:?]

[es-m01-rm] fatal error in thread [elasticsearch[es-m01-rm][generic][T#184]], exiting
java.lang.OutOfMemoryError: Java heap space

Any inputs on how to fix the same?

warkolm · May 2, 2019, 11:13pm

What version are you on?

Have you considered splitting your cluster up?

mosiddi · May 2, 2019, 11:26pm

The version is 6.2 and the cluster is PROD - so looking for something that can help right now. Splitting is a good suggestion and I think we should do that after the cluster is up at least.

mosiddi · May 2, 2019, 11:35pm

Should I do this for now -

Stop ES on all nodes
Start master nodes
Let they form the cluster
Start query nodes
Add data nodes in batches

What I have seen is one of the master node becomes active master, all node bombard with requests/pins and it just dies. So far 2 of the master nodes have become active master 3 times for span of 2 - 5 minutes in last 1 hour.

mosiddi · May 3, 2019, 10:30pm

Closing the loop - the above steps worked.

DavidTurner · May 4, 2019, 10:06am

Hi @mosiddi, glad to hear you got your cluster up and running again. When you hit that java.lang.OutOfMemoryError the JVM would normally write a heap dump. Would it be possible for you to share that? It'd help greatly in determining what exactly went wrong and whether we can improve things in future. I can organise a way to share it privately if that'd help.

mosiddi · May 5, 2019, 5:28am

Sure, let me check and get back.

mosiddi · May 16, 2019, 10:08pm

@DavidTurner - Unfortunately we lost the dumps during the master node migration process where we were adding new master nodes and deleting old master nodes. Wish we would have taken a backup somewhere. Sorry again.

mosiddi · May 28, 2019, 7:54pm

@DavidTurner - I got hold of the latest dump and I see this as the prime suspect -

Problem Suspect 1

One instance of "org.elasticsearch.transport.TransportService" loaded by "sun.misc.Launcher$AppClassLoader @ 0x6ea660000" occupies 3,152,038,736 (84.85%) bytes. The memory is accumulated in one instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by "" .

Keywords
sun.misc.Launcher$AppClassLoader @ 0x6ea660000
java.util.concurrent.ConcurrentHashMap$Node
org.elasticsearch.transport.TransportService

Looks like most of the heap was used by transport service -

Let me know if there is any specific information I can share?
Thanks
Imran

DavidTurner · May 28, 2019, 9:36pm

Ok, first up, your master nodes only have ~4GB of heap? That seems rather small for a cluster with 100+ nodes.

The clientHandlers map contains a handler for every in-flight request: requests that the node has sent but for which it has not yet received a response. 3GB does seem like a lot for this. My first question would be whether this is an enormous number of tiny entries, or whether there's a smaller number of entries that are themselves enormous. The entries themselves should be of type TransportService$RequestHolder, which has an action field that tells us the type of each request.

Can you dig further into the client handlers and tell us a bit more about the in-flight requests?

mosiddi · May 28, 2019, 9:53pm

My first question would be whether this is an enormous number of tiny entries, or whether there's a smaller number of entries that are themselves enormous.

There are 510 of them.

The entries themselves should be of type TransportService$RequestHolder , which has an action field that tells us the type of each request.

Most of them have this 'internal:discovery/zen/fd/ping'

Can you dig further into the client handlers and tell us a bit more about the in-flight requests?

All of them are from different data nodes.

DavidTurner · May 28, 2019, 9:58pm

Ok, 510 requests is about 3 per node so that's not ridiculous. Yet, 3GB over 510 requests averages out at 5MB per request.

Hmm, I would expect these to be quite small. How much are these requests retaining? Are there other requests that are less numerous but which retain more heap?

mosiddi · May 28, 2019, 10:12pm

Out of 510 there are 5 which are outliers (rest all of them are ~200 to ~500 bytes). Out of these 5, there are 4 from same data node and one from a different node. They all are of type cluster:monitor/nodes/stats.

3 of them showed retained heap >200MB and 2 of them >135MB.

DavidTurner · May 29, 2019, 10:40am

Those numbers don't add up to 3GB, which suggests that something large is being shared between these handlers. This can be quite tricky to pin down and I'm not sure it'll be very simple to explain the process. Would it be possible for you to share the heap dump privately with me and the Elastic team at the following link?

https://upload-staging.elstc.co/u/24d127fd-0634-4f5c-9a20-2aee259f5cfa

mosiddi · May 29, 2019, 5:38pm

Done!

DavidTurner · May 29, 2019, 7:14pm

Thanks. It looks like those cluster:monitor/nodes/stats requests are the majority of the problem. Each one is trying to collect a few kB of stats about every shard on every node, and there look to be quite a few in flight, so it is all adding up. I would recommend:

sending fewer of these stats requests
sending them to a coordinating node rather than to the master
limiting the stats that are returned to cut down on how expensive these requests are.

mosiddi · May 29, 2019, 7:34pm

Thank you, this is great information. Let me see what all changes we can do in our cluster.

system · June 26, 2019, 7:34pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch cluster instability Elasticsearch	13	2821	July 6, 2017
Elasticsearch cluster of 4 nodes has "master not discovered exception" Elasticsearch	18	28555	May 18, 2018
Node not connected Elasticsearch	4	11897	July 6, 2017
MasterNotDiscoveredException Elasticsearch	1	291	July 6, 2017
Es node failed to send join request to each other，how to solve it？ Elasticsearch	4	405	June 6, 2018

Cluster is down and master nodes are not coming up

Related topics