New clean ES data node cannot find master during cluster discovery


#1

Hi guys, appreciate your attention on this thread. This morning I terminated elastic service on one data node of my cluster, which contains altogether 2 data nodes and 2 master-only nodes plus one kibana node. After I unzipped ES v2.2.0 on this node and started (the cluster is on v2.1), I found out replica shards cannot get init and synced.
Attaching logs as follows:
[2016-02-05 12:10:53,858][INFO ][node ] [es-data1] initialized
[2016-02-05 12:10:53,858][INFO ][node ] [es-data1] starting ...
[2016-02-05 12:10:53,923][INFO ][transport ] [es-data1] publish_address {10.241.32.77:9300}, bound_addresses {10.241.32.77:9300}
[2016-02-05 12:10:53,931][INFO ][discovery ] [es-data1] ODW-ES/aGDNdgejSYGjouHahgDTeQ
[2016-02-05 12:11:15,180][DEBUG][action.admin.indices.stats] [es-data1] [indices:monitor/stats] failed to execute operation for shard [[tractor-test][1], node[aGDNdgejSYGjouHahgDTeQ], [R], v[6], s[INITIALIZING], a[id=dbbohATyQ5-wFrTcRIVF-Q], unassigned_info[[reason=NODE_LEFT], at[2016-02-05T02:11:05.251Z], details[node_left[fYC-MCjzRaOJy65SjVeKWA]]]]
[tractor-test][[tractor-test][1]] BroadcastShardOperationFailedException[operation indices:monitor/stats failed]; nested: IndexNotFoundException[no such index];
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:299)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: [tractor-test] IndexNotFoundException[no such index]
at org.elasticsearch.indices.IndicesService.indexServiceSafe(IndicesService.java:310)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:97)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
... 7 more

There are 2 lines of config about gateway and discovery which I don't think will bother the work of node. And everything else follows default:
gateway.recover_after_nodes: 4
discovery.zen.ping.unicast.hosts: ["elasticsearch", "elasticsearchbk"]

And I did spot the log about it joining the cluster on master nodes:
[2016-02-05 12:10:56,967][INFO ][cluster.service ] [elasticsearchbk] added {{es-data1}{aGDNdgejSYGjouHahgDTeQ}{10.241.32.77}{10.241.32.77:9300}{master=false},}, reason: zen-disco-join(join from node[{es-data1}{aGDNdgejSYGjouHahgDTeQ}{10.241.32.77}{10.241.32.77:9300}{master=false}])

I also spotted those transport connections had been established between 9300 ports
elasticsearchbk ODW [:~] 1010$ sudo netstat -anp | grep 11779 | grep 32.77
tcp 0 0 ::ffff:10.241.32.149:36179 ::ffff:10.241.32.77:9300 ESTABLISHED 11779/java
tcp 0 0 ::ffff:10.241.32.149:9300 ::ffff:10.241.32.77:58373 ESTABLISHED 11779/java
tcp 0 0 ::ffff:10.241.32.149:36169 ::ffff:10.241.32.77:9300 ESTABLISHED 11779/java
tcp 0 0 ::ffff:10.241.32.149:36170 ::ffff:10.241.32.77:9300 ESTABLISHED 11779/java
tcp 0 0 ::ffff:10.241.32.149:9300 ::ffff:10.241.32.77:58367 ESTABLISHED 11779/java
tcp 0 0 ::ffff:10.241.32.149:9300 ::ffff:10.241.32.77:58374 ESTABLISHED 11779/java
tcp 0 0 ::ffff:10.241.32.149:36175 ::ffff:10.241.32.77:9300 ESTABLISHED 11779/java

So all the proofs I collected point to the fact that it's joined the cluster, and started interacting with other nodes. But what's wrong, especially the weird "NODE_LEFT" in error debug information? It seems too contradictory with those proofs to make me ease off upon this problem.

Regardless of this irritating issue, allow me to bring forth you guys a since blessing on the incoming 2016, as it's now the time for traditional new year in China.


(Nik Everett) #2

The master eligible nodes should be 3 or maybe 5. Never 2.

These are fine.

I'm not really sure about this. The first log line is the node's name. I haven't read enough source here to know for sure why this message came. It might be the new node trying to talk to a ghost of itself. The nodes are pinged to check for going down every 30 seconds I believe the master may not have noticed that old data node went down. Did you kill -9 the data node when you shut it down by chance? This is allowed but it is more likely to cause other parts of the network to complain.

One thing: when you unzipped the new version did you keep the old data directory? You need to do that. The package install does it for you so it is much less troublesome.


#3

Hi nik,
Thanks for the reply. I didn't use -9 to force close the process as it's the last choice I would make. In fact, I did a fresh installation of v2.2.0 after the new node failed to sync with old data, as the cluster is supposed to copy and put replica shards to new empty node normally after init.
Further, I kicked up a new virtual machine and was trying to bring up a fresh v2.1 node to see if shards could get allocated correctly, but got the same debug error as above.


#4

Well I think I just met the old Replica shard stuck at initializing problem. Now the error log looks like this:

[2016-02-06 11:54:27,216][DEBUG][action.admin.indices.stats] [es-data1] [indices:monitor/stats] failed to execute operation for shard [[tractor-test][0], node[Ikk03GBuRR-j56vNDWgkdg], [R], v[11], s[INITIALIZING], a[id=DMw727TBR6eTKTVeuddqaA], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-02-06T03:44:46.097Z]]]

Now there're 2 replica shards initializing with total number 30+, the rest of them remaining in UNASSIGNED.
Maybe I could try wiping those replica and create them then.


#5

The issue was resolved after installing license plugin. The startup process went smoothly without any warning/error.
May I suggest putting some warning msg in logs when license plugin is absent? It really costs lots of time and energy to dig without any useful logs on this.


(system) #6