ES 5.1.1: Cluster loses a node randomly every few hours. Error: Message not fully read (response) for requestId

Hi all,

I'm wondering if anybody else has seen this, or can help me understand what exactly is going on.
We have a 5 data nodes/3 masters/1client node cluster at version 5.1.1.

The data nodes are usually fairly busy, indexing around 5000 documents/s for primary shards.
They seem to be keeping up with load though.

Although every few hours (not always the same number of hours/times in the day) the cluster loses one of the nodes(not always the same) and as expected the replicas are promoted to primary and take over.
The affected node rejoins almost immediately and the newly unassigned replica shards are restartarted on it and everything is ok, until a few hours later when the cycle repeats itself,

In the master node log I've found the following (part of the logs, please let me know if you require more information) :

[2017-01-11T00:27:15,881][WARN ][o.e.t.n.Netty4Transport ] [an-uk-gs2-02-esmaster-02] exception caught on transport layer [[id: 0x09c28985, L:/192.168.172.13:56516 - R:192.168.172.211/192.168.172.211:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [27957359], handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.action.support.nodes.TransportNodesAction$
AsyncAction$1@26c93b00], error [false]; resetting

Caused by: java.lang.IllegalStateException: No routing state mapped for [0]

[2017-01-11T00:27:15,876][DEBUG][o.e.a.a.c.s.TransportClusterStatsAction] [an-uk-gs2-02-esmaster-02] failed to execute on node [bXGYVdQeTgqKxhHuoyN_Rw]
org.elasticsearch.transport.RemoteTransportException: [Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.stats.ClusterStatsNodeResponse]]
Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.stats.ClusterStatsNodeResponse]

[2017-01-11T00:27:15,911][INFO ][o.e.c.r.a.AllocationService] [an-uk-gs2-02-esmaster-02] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{an-uk-gs2-02-esnode-04}{bXGYVdQeTgqKxhHuoyN_Rw}{EBSPoymKRT6SzijVi2wefw}{192.168.172.211}{192.168.172.211:9300} transport disconnected]).

[2017-01-11T00:27:15,911][INFO ][o.e.c.s.ClusterService ] [an-uk-gs2-02-esmaster-02] removed {{an-uk-gs2-02-esnode-04}{bXGYVdQeTgqKxhHuoyN_Rw}{EBSPoymKRT6SzijVi2wefw}{192.168.172.211}{192.168.172.211:9300},}, reason: zen-disco-node-failed({an-uk-gs2-02-esnode-04}{bXGYVdQeTgqKxhHuoyN_Rw}{EBSPoymKRT6SzijVi2wefw}{192.168.172.211}{192.168.172.211:9300}), reason(transport disconnected)[{an-uk-gs2-02-esnode-04}{bXGYVdQeTgqKxhHuoyN_Rw}{EBSPoymKRT6SzijVi2wefw}{192.168.172.211}{192.168.172.211:9300} transport disconnected]

Any help is appreciated.
Thank you in advance

This is a known issue: https://github.com/elastic/elasticsearch/issues/22285

The fix is: https://github.com/elastic/elasticsearch/issues/22317

The fix will be released soon. The only workaround is to disable monitoring while performing indexing (I recognize that this is a terrible option, but so is losing your nodes).

Thanks, really appreciated

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.