Hi guys, appreciate your attention on this thread. This morning I terminated elastic service on one data node of my cluster, which contains altogether 2 data nodes and 2 master-only nodes plus one kibana node. After I unzipped ES v2.2.0 on this node and started (the cluster is on v2.1), I found out replica shards cannot get init and synced.
Attaching logs as follows:
[2016-02-05 12:10:53,858][INFO ][node ] [es-data1] initialized
[2016-02-05 12:10:53,858][INFO ][node ] [es-data1] starting ...
[2016-02-05 12:10:53,923][INFO ][transport ] [es-data1] publish_address {10.241.32.77:9300}, bound_addresses {10.241.32.77:9300}
[2016-02-05 12:10:53,931][INFO ][discovery ] [es-data1] ODW-ES/aGDNdgejSYGjouHahgDTeQ
[2016-02-05 12:11:15,180][DEBUG][action.admin.indices.stats] [es-data1] [indices:monitor/stats] failed to execute operation for shard [[tractor-test][1], node[aGDNdgejSYGjouHahgDTeQ], [R], v[6], s[INITIALIZING], a[id=dbbohATyQ5-wFrTcRIVF-Q], unassigned_info[[reason=NODE_LEFT], at[2016-02-05T02:11:05.251Z], details[node_left[fYC-MCjzRaOJy65SjVeKWA]]]]
[tractor-test][[tractor-test][1]] BroadcastShardOperationFailedException[operation indices:monitor/stats failed]; nested: IndexNotFoundException[no such index];
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:299)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: [tractor-test] IndexNotFoundException[no such index]
at org.elasticsearch.indices.IndicesService.indexServiceSafe(IndicesService.java:310)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:97)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
... 7 more
There are 2 lines of config about gateway and discovery which I don't think will bother the work of node. And everything else follows default:
gateway.recover_after_nodes: 4
discovery.zen.ping.unicast.hosts: ["elasticsearch", "elasticsearchbk"]
And I did spot the log about it joining the cluster on master nodes:
[2016-02-05 12:10:56,967][INFO ][cluster.service ] [elasticsearchbk] added {{es-data1}{aGDNdgejSYGjouHahgDTeQ}{10.241.32.77}{10.241.32.77:9300}{master=false},}, reason: zen-disco-join(join from node[{es-data1}{aGDNdgejSYGjouHahgDTeQ}{10.241.32.77}{10.241.32.77:9300}{master=false}])
I also spotted those transport connections had been established between 9300 ports
elasticsearchbk ODW [:~] 1010$ sudo netstat -anp | grep 11779 | grep 32.77
tcp 0 0 ::ffff:10.241.32.149:36179 ::ffff:10.241.32.77:9300 ESTABLISHED 11779/java
tcp 0 0 ::ffff:10.241.32.149:9300 ::ffff:10.241.32.77:58373 ESTABLISHED 11779/java
tcp 0 0 ::ffff:10.241.32.149:36169 ::ffff:10.241.32.77:9300 ESTABLISHED 11779/java
tcp 0 0 ::ffff:10.241.32.149:36170 ::ffff:10.241.32.77:9300 ESTABLISHED 11779/java
tcp 0 0 ::ffff:10.241.32.149:9300 ::ffff:10.241.32.77:58367 ESTABLISHED 11779/java
tcp 0 0 ::ffff:10.241.32.149:9300 ::ffff:10.241.32.77:58374 ESTABLISHED 11779/java
tcp 0 0 ::ffff:10.241.32.149:36175 ::ffff:10.241.32.77:9300 ESTABLISHED 11779/java
So all the proofs I collected point to the fact that it's joined the cluster, and started interacting with other nodes. But what's wrong, especially the weird "NODE_LEFT" in error debug information? It seems too contradictory with those proofs to make me ease off upon this problem.
Regardless of this irritating issue, allow me to bring forth you guys a since blessing on the incoming 2016, as it's now the time for traditional new year in China.