Nodes silently leaving and returning?

We recently upgraded to 2.4.2 and are now seeing nodes mysteriously leaving and coming back. That is the cluster becomes yellow for a while and then goes back to green.

In the log on the master node I see this:
[2016-12-21 07:20:41,821][WARN ][cluster.action.shard ] [es-150e.foo.bar] [reference_2015-04-01_2][5] received shard failed for target shard [[reference_2015-04-01_2][5], node[-wygWN2DQG6FibS7MoW25g], [R], v[183], s[STARTED], a[id=Qpn3IcmGQPWMlotMpyZPYg]], indexUUID [MNw2RvCPSTeEnNNJYoUIxw], message [failed to perform indices:data/write/bulk[s] on replica on node {es-247d.foo.bar}{-wygWN2DQG6FibS7MoW25g}{10.0.69.125}{10.0.69.125:9300}{aws_availability_zone=us-east-1d, index_set=reference_partitioned, max_local_storage_nodes=1, master=false}], failure [NodeDisconnectedException[[es-247d.foo.bar][10.0.69.125:9300][indices:data/write/bulk[s][r]] disconnected]]
NodeDisconnectedException[[es-247d.foo.bar][10.0.69.125:9300][indices:data/write/bulk[s][r]] disconnected]
[2016-12-21 07:20:41,831][INFO ][cluster.routing.allocation] [es-150e.foo.bar] Cluster health status changed from [GREEN] to [YELLOW] (reason: [shards failed [[reference_2015-04-01_2][5]] ...]).
[2016-12-21 07:54:50,472][INFO ][cluster.routing.allocation] [es-150e.foo.bar] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[reference_2015-04-01_2][5]] ...]).

The strange thing is that I do not see any relevant log entries on the client (in this case e-247d). During the time the cluster was yellow that client only logged a few lines like this:
[2016-12-21 07:28:20,360][WARN ][index.fielddata ] [es-247d.foo.bar] [reference_2015-09-22_1] failed to find format [compressed] for field [attributes.document_position], will use default
This is a separate problem which we will fix.

But can somebody explain to me why the luster went yellow for a while? There were no interruptions in network traffic or spikes in CPU-usage. We have also seen the same behavior at other times then involving other nodes.

Check the logs on the node that left, there should be something before/during/after the time of drop out.

As I wrote in my original post there are no relevant log messages on the node that the master considered to be disconnected.

This turned out to be caused by a network hardware issue on one of the nodes. About 5% of all TCP connections to/from that node failed. The tricky thing that al lt of other nodes also unexpectedly left the es-cluster. But all problems stopped when we decommissioned that one node.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.