ES 5.1.1: Cluster loses a node randomly every few hours. Error: Message not fully read (response) for requestId

Emanuele_Verga · January 11, 2017, 11:24am

Hi all,

I'm wondering if anybody else has seen this, or can help me understand what exactly is going on.
We have a 5 data nodes/3 masters/1client node cluster at version 5.1.1.

The data nodes are usually fairly busy, indexing around 5000 documents/s for primary shards.
They seem to be keeping up with load though.

Although every few hours (not always the same number of hours/times in the day) the cluster loses one of the nodes(not always the same) and as expected the replicas are promoted to primary and take over.
The affected node rejoins almost immediately and the newly unassigned replica shards are restartarted on it and everything is ok, until a few hours later when the cycle repeats itself,

In the master node log I've found the following (part of the logs, please let me know if you require more information) :

[2017-01-11T00:27:15,881][WARN ][o.e.t.n.Netty4Transport ] [an-uk-gs2-02-esmaster-02] exception caught on transport layer [[id: 0x09c28985, L:/192.168.172.13:56516 - R:192.168.172.211/192.168.172.211:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [27957359], handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.action.support.nodes.TransportNodesAction$
AsyncAction$1@26c93b00], error [false]; resetting

Caused by: java.lang.IllegalStateException: No routing state mapped for [0]

[2017-01-11T00:27:15,876][DEBUG][o.e.a.a.c.s.TransportClusterStatsAction] [an-uk-gs2-02-esmaster-02] failed to execute on node [bXGYVdQeTgqKxhHuoyN_Rw]
org.elasticsearch.transport.RemoteTransportException: [Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.stats.ClusterStatsNodeResponse]]
Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.stats.ClusterStatsNodeResponse]

[2017-01-11T00:27:15,911][INFO ][o.e.c.r.a.AllocationService] [an-uk-gs2-02-esmaster-02] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{an-uk-gs2-02-esnode-04}{bXGYVdQeTgqKxhHuoyN_Rw}{EBSPoymKRT6SzijVi2wefw}{192.168.172.211}{192.168.172.211:9300} transport disconnected]).

[2017-01-11T00:27:15,911][INFO ][o.e.c.s.ClusterService ] [an-uk-gs2-02-esmaster-02] removed {{an-uk-gs2-02-esnode-04}{bXGYVdQeTgqKxhHuoyN_Rw}{EBSPoymKRT6SzijVi2wefw}{192.168.172.211}{192.168.172.211:9300},}, reason: zen-disco-node-failed({an-uk-gs2-02-esnode-04}{bXGYVdQeTgqKxhHuoyN_Rw}{EBSPoymKRT6SzijVi2wefw}{192.168.172.211}{192.168.172.211:9300}), reason(transport disconnected)[{an-uk-gs2-02-esnode-04}{bXGYVdQeTgqKxhHuoyN_Rw}{EBSPoymKRT6SzijVi2wefw}{192.168.172.211}{192.168.172.211:9300} transport disconnected]

Any help is appreciated.
Thank you in advance

jasontedor · January 12, 2017, 2:58am

This is a known issue: https://github.com/elastic/elasticsearch/issues/22285

The fix is: https://github.com/elastic/elasticsearch/issues/22317

The fix will be released soon. The only workaround is to disable monitoring while performing indexing (I recognize that this is a terrible option, but so is losing your nodes).

Emanuele_Verga · January 12, 2017, 10:03am

Thanks, really appreciated

system · February 9, 2017, 10:03am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
"Message not fully read (response)" after upgrading 0.90.0 to 0.90.1 Elasticsearch	7	1242	July 6, 2017
Make cluster more resilient to network failures - how? Elasticsearch	3	1176	September 15, 2017
Java.lang.IllegalStateException: Message not fully read (request) for requestId Elasticsearch	5	3118	July 5, 2017
Message not fully read (response) for / failed to send shard started to Elasticsearch	3	432	July 6, 2017
Elasticsearch cluster instability Elasticsearch	13	2821	July 6, 2017

ES 5.1.1: Cluster loses a node randomly every few hours. Error: Message not fully read (response) for requestId

Related topics