Help for removing a crashed node?


#1

Hi,

I have a two node cluster, and master node crashed while running update on an index.
Number of replicas is set to 1. I want to remove sick node, then add new node. I saw a page for how to shutdown a node. I didn't see complete step-by-step guide for removing a sick node, so I wanted to ask here:

  1. After I ran "shutdown" on master node, I can just halt elasticsearch process on that node?
    2.If I ran "shutdown" on master node, will other node start throwing error because there is a number of replicas that it is set to?
  2. Can I delete problematic index on a healthy node as nothing have happened?

Thanks a lot!
b


(Mark Walkom) #2

If you shutdown the bad node you can just replace it. As you have replicas your data will be safe and it will copy part of it over to the new node when it joins.


#3

Thanks for your advice.

Well, the sick node seemed to have recovered itself. However, now the healthy node cannot connect to the previously sick node anymore. I'm getting an error below.

Node 1(previously had OOM error)

[2016-03-04 01:18:46,067][WARN ][shield.transport.netty ] [node-1] exception caught on transport layer [[id: 0x5d126fb9, /MYIPForNode2:38674 :> /MyIPForNode1:9300]], closing connection
java.io.StreamCorruptedException: invalid internal transport message format, got (ff,f4,ff,fd)
at org.elasticsearch.transport.netty.SizeHeaderFrameDecoder.decode(SizeHeaderFrameDecoder.java:64)
at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425)

Node 2 (healthy node)

[2016-03-04 01:18:26,842][INFO ][discovery.zen ] [node-2] failed to send join request to master [{node-1}{MUm8WoRyS9Gi5IdIncCe-w}{MyIPForNode1}{MyIPForNode1:9300}{master=true}], reason [RemoteTransportException[[node-1][MyIPForNode1:9300][internal:discovery/zen/join]]; nested: ConnectTransportException[[node-2][MyIPForNode2:9300] connect_timeout[30s]]; ]

Also, on the previously sick node I deleted the index that was problematic. However, it seems that update to that deleted index is still happening, as I see these exceptions in the log.

[2016-03-04 00:38:03,813][INFO ][rest.suppressed ] /.marvel-es-data/cluster_info/_search Params: {index=.marvel-es-data, type=cluster_info}
[.marvel-es-data] IndexNotFoundException[no such index]

Any idea as to how to stop this? I am thinking maybe I should still shut down the previously bad node anyway.

Thanks in advance!


(Mark Walkom) #4

Are you on Windows? That looks like an IP Windows assigns to itself when it cannot get a valid one.


#5

No, this is Ubuntu launched through Softlayer. Thanks!

I at least found why updating index caused OOM -- it was "Mapping Explosion" problem mentioned here:

But at least, I deleted the entire index which had too many mapping though.. but the effect still seems to be continuing.


(system) #6