Hi all,
I'm wondering if anybody else has seen this, or can help me understand what exactly is going on.
We have a 5 data nodes/3 masters/1client node cluster at version 5.1.1.
The data nodes are usually fairly busy, indexing around 5000 documents/s for primary shards.
They seem to be keeping up with load though.
Although every few hours (not always the same number of hours/times in the day) the cluster loses one of the nodes(not always the same) and as expected the replicas are promoted to primary and take over.
The affected node rejoins almost immediately and the newly unassigned replica shards are restartarted on it and everything is ok, until a few hours later when the cycle repeats itself,
In the master node log I've found the following (part of the logs, please let me know if you require more information) :
[2017-01-11T00:27:15,881][WARN ][o.e.t.n.Netty4Transport ] [an-uk-gs2-02-esmaster-02] exception caught on transport layer [[id: 0x09c28985, L:/192.168.172.13:56516 - R:192.168.172.211/192.168.172.211:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [27957359], handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.action.support.nodes.TransportNodesAction$
AsyncAction$1@26c93b00], error [false]; resetting
Caused by: java.lang.IllegalStateException: No routing state mapped for [0]
[2017-01-11T00:27:15,876][DEBUG][o.e.a.a.c.s.TransportClusterStatsAction] [an-uk-gs2-02-esmaster-02] failed to execute on node [bXGYVdQeTgqKxhHuoyN_Rw]
org.elasticsearch.transport.RemoteTransportException: [Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.stats.ClusterStatsNodeResponse]]
Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.stats.ClusterStatsNodeResponse]
[2017-01-11T00:27:15,911][INFO ][o.e.c.r.a.AllocationService] [an-uk-gs2-02-esmaster-02] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{an-uk-gs2-02-esnode-04}{bXGYVdQeTgqKxhHuoyN_Rw}{EBSPoymKRT6SzijVi2wefw}{192.168.172.211}{192.168.172.211:9300} transport disconnected]).
[2017-01-11T00:27:15,911][INFO ][o.e.c.s.ClusterService ] [an-uk-gs2-02-esmaster-02] removed {{an-uk-gs2-02-esnode-04}{bXGYVdQeTgqKxhHuoyN_Rw}{EBSPoymKRT6SzijVi2wefw}{192.168.172.211}{192.168.172.211:9300},}, reason: zen-disco-node-failed({an-uk-gs2-02-esnode-04}{bXGYVdQeTgqKxhHuoyN_Rw}{EBSPoymKRT6SzijVi2wefw}{192.168.172.211}{192.168.172.211:9300}), reason(transport disconnected)[{an-uk-gs2-02-esnode-04}{bXGYVdQeTgqKxhHuoyN_Rw}{EBSPoymKRT6SzijVi2wefw}{192.168.172.211}{192.168.172.211:9300} transport disconnected]
Any help is appreciated.
Thank you in advance