Cluster is down and master nodes are not coming up

(Imran Siddique) #1

Looks like there was a connectivity issue due to which the nodes were not able to communicate with each other for some time and the cluster went into not available state.

There are 3 master, 6 client and 156 data nodes. There are close to 60k shards most of them active (indexing and search happening). After multiple restart attempts of master nodes, they are not forming cluster.

Some of the errors seen -

[es-m02-rm] send message failed [channel: NettyTcpChannel{localAddress=, remoteAddress=/}]
java.nio.channels.ClosedChannelException: null
at$AbstractUnsafe.close(...)(Unknown Source) ~[?:?]

[es-m01-rm] fatal error in thread [elasticsearch[es-m01-rm][generic][T#184]], exiting
java.lang.OutOfMemoryError: Java heap space

Any inputs on how to fix the same?

(Mark Walkom) #2

What version are you on?

Have you considered splitting your cluster up?

(Imran Siddique) #3

The version is 6.2 and the cluster is PROD - so looking for something that can help right now. Splitting is a good suggestion and I think we should do that after the cluster is up at least.

(Imran Siddique) #4

Should I do this for now -

  • Stop ES on all nodes
  • Start master nodes
  • Let they form the cluster
  • Start query nodes
  • Add data nodes in batches

What I have seen is one of the master node becomes active master, all node bombard with requests/pins and it just dies. So far 2 of the master nodes have become active master 3 times for span of 2 - 5 minutes in last 1 hour.

(Imran Siddique) #5

Closing the loop - the above steps worked.

(David Turner) #6

Hi @mosiddi, glad to hear you got your cluster up and running again. When you hit that java.lang.OutOfMemoryError the JVM would normally write a heap dump. Would it be possible for you to share that? It'd help greatly in determining what exactly went wrong and whether we can improve things in future. I can organise a way to share it privately if that'd help.

(Imran Siddique) #7

Sure, let me check and get back.

1 Like
(Imran Siddique) #8

@DavidTurner - Unfortunately we lost the dumps during the master node migration process where we were adding new master nodes and deleting old master nodes. Wish we would have taken a backup somewhere. Sorry again.