Help for removing a crashed node?

boreal · March 3, 2016, 11:46pm

Hi,

I have a two node cluster, and master node crashed while running update on an index.
Number of replicas is set to 1. I want to remove sick node, then add new node. I saw a page for how to shutdown a node. I didn't see complete step-by-step guide for removing a sick node, so I wanted to ask here:

After I ran "shutdown" on master node, I can just halt elasticsearch process on that node?
2.If I ran "shutdown" on master node, will other node start throwing error because there is a number of replicas that it is set to?
Can I delete problematic index on a healthy node as nothing have happened?

Thanks a lot!
b

warkolm · March 4, 2016, 1:27am

If you shutdown the bad node you can just replace it. As you have replicas your data will be safe and it will copy part of it over to the new node when it joins.

boreal · March 4, 2016, 7:27am

Thanks for your advice.

Well, the sick node seemed to have recovered itself. However, now the healthy node cannot connect to the previously sick node anymore. I'm getting an error below.

Node 1(previously had OOM error)

[2016-03-04 01:18:46,067][WARN ][shield.transport.netty ] [node-1] exception caught on transport layer [[id: 0x5d126fb9, /MYIPForNode2:38674 :> /MyIPForNode1:9300]], closing connection
java.io.StreamCorruptedException: invalid internal transport message format, got (ff,f4,ff,fd)
at org.elasticsearch.transport.netty.SizeHeaderFrameDecoder.decode(SizeHeaderFrameDecoder.java:64)
at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425)

Node 2 (healthy node)

[2016-03-04 01:18:26,842][INFO ][discovery.zen ] [node-2] failed to send join request to master [{node-1}{MUm8WoRyS9Gi5IdIncCe-w}{MyIPForNode1}{MyIPForNode1:9300}{master=true}], reason [RemoteTransportException[[node-1][MyIPForNode1:9300][internal:discovery/zen/join]]; nested: ConnectTransportException[[node-2][MyIPForNode2:9300] connect_timeout[30s]]; ]

Also, on the previously sick node I deleted the index that was problematic. However, it seems that update to that deleted index is still happening, as I see these exceptions in the log.

[2016-03-04 00:38:03,813][INFO ][rest.suppressed ] /.marvel-es-data/cluster_info/_search Params: {index=.marvel-es-data, type=cluster_info}
[.marvel-es-data] IndexNotFoundException[no such index]

Any idea as to how to stop this? I am thinking maybe I should still shut down the previously bad node anyway.

Thanks in advance!

warkolm · March 4, 2016, 7:45am

Are you on Windows? That looks like an IP Windows assigns to itself when it cannot get a valid one.

boreal · March 4, 2016, 7:51am

No, this is Ubuntu launched through Softlayer. Thanks!

I at least found why updating index caused OOM -- it was "Mapping Explosion" problem mentioned here:

But at least, I deleted the entire index which had too many mapping though.. but the effect still seems to be continuing.

Topic		Replies	Views
Node stuck in cluster after it crashed Elasticsearch	2	336	July 6, 2017
Failed to connect to node [..], removed from node list Elasticsearch	3	3293	July 6, 2017
Very Strange Master Node Issue - Closed nodes not being removed Elasticsearch	5	360	July 6, 2017
Master node hangs when multiple data nodes are shutdown at the same time Elasticsearch	6	954	July 6, 2017
Elasticsearch cluster crashed when 1 node got IO issues Elasticsearch	1	365	December 16, 2019

Help for removing a crashed node?

Related topics