Nodes silently leaving and returning?

maf · December 21, 2016, 10:03am

We recently upgraded to 2.4.2 and are now seeing nodes mysteriously leaving and coming back. That is the cluster becomes yellow for a while and then goes back to green.

In the log on the master node I see this:
[2016-12-21 07:20:41,821][WARN ][cluster.action.shard ] [es-150e.foo.bar] [reference_2015-04-01_2][5] received shard failed for target shard [[reference_2015-04-01_2][5], node[-wygWN2DQG6FibS7MoW25g], [R], v[183], s[STARTED], a[id=Qpn3IcmGQPWMlotMpyZPYg]], indexUUID [MNw2RvCPSTeEnNNJYoUIxw], message [failed to perform indices:data/write/bulk[s] on replica on node {es-247d.foo.bar}{-wygWN2DQG6FibS7MoW25g}{10.0.69.125}{10.0.69.125:9300}{aws_availability_zone=us-east-1d, index_set=reference_partitioned, max_local_storage_nodes=1, master=false}], failure [NodeDisconnectedException[[es-247d.foo.bar][10.0.69.125:9300][indices:data/write/bulk[s][r]] disconnected]]
NodeDisconnectedException[[es-247d.foo.bar][10.0.69.125:9300][indices:data/write/bulk[s][r]] disconnected]
[2016-12-21 07:20:41,831][INFO ][cluster.routing.allocation] [es-150e.foo.bar] Cluster health status changed from [GREEN] to [YELLOW] (reason: [shards failed [[reference_2015-04-01_2][5]] ...]).
[2016-12-21 07:54:50,472][INFO ][cluster.routing.allocation] [es-150e.foo.bar] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[reference_2015-04-01_2][5]] ...]).

The strange thing is that I do not see any relevant log entries on the client (in this case e-247d). During the time the cluster was yellow that client only logged a few lines like this:
[2016-12-21 07:28:20,360][WARN ][index.fielddata ] [es-247d.foo.bar] [reference_2015-09-22_1] failed to find format [compressed] for field [attributes.document_position], will use default
This is a separate problem which we will fix.

But can somebody explain to me why the luster went yellow for a while? There were no interruptions in network traffic or spikes in CPU-usage. We have also seen the same behavior at other times then involving other nodes.

warkolm · December 21, 2016, 11:26pm

Check the logs on the node that left, there should be something before/during/after the time of drop out.

maf · December 22, 2016, 7:09am

As I wrote in my original post there are no relevant log messages on the node that the master considered to be disconnected.

maf · January 16, 2017, 8:39am

This turned out to be caused by a network hardware issue on one of the nodes. About 5% of all TCP connections to/from that node failed. The tricky thing that al lt of other nodes also unexpectedly left the es-cluster. But all problems stopped when we decommissioned that one node.

system · February 13, 2017, 8:39am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Node Leaving Cluster on Reindex Elasticsearch	5	1245	January 5, 2017
Two data nodes: One node left, get stale shards and cluster status goes red Elasticsearch	3	5273	January 17, 2020
Nodes randomly, temporarily, leaving 7.3.2 cluster Elasticsearch	17	4807	May 1, 2020
Node not coming back after offline Elasticsearch	1	680	August 11, 2017
Disappearing Shards Elasticsearch	10	377	July 6, 2017

Nodes silently leaving and returning?

Related topics