Hello all,
I need some advice regarding node to node communication. It happens quite often (several times per day) that individual nodes lose contact with the master, and the cluster goes to yellow and then recovers. Occasionally there are multiple nodes that do this simultaneously and the cluster goes to red.
I saw the following pattern in the logs:
On the affected node, the master leaves:
[2017-08-11T08:37:16,805][WARN ][o.e.d.z.ZenDiscovery ] [x0231se] master left (reason = failed to ping, tried [15] times, each with maximum [30s] timeout), current nodes: {...}
On the master, I see a bunch of communication errors:
[2017-08-11T08:37:05,715][DEBUG][o.e.a.a.c.s.TransportClusterStatsAction] [x0178se] failed to execute on node [bSd3Zy5tQDqmD86h4xsLqA]
org.elasticsearch.transport.RemoteTransportException: [Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.stats.ClusterStatsNodeResponse]]
Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.stats.ClusterStatsNodeResponse]
at [...]
Caused by: java.lang.IllegalStateException: No routing state mapped for [0]
at [...][2017-08-11T08:37:05,720][WARN ][o.e.t.n.Netty4Transport ] [x0178se] exception caught on transport layer [[id: 0xf22323be, L:/138.106.49.8:43395 - R:/138.106.67.61:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [12358927], handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1@240d88e6], error [false]; resetting
at [...][2017-08-11T08:37:05,835][DEBUG][o.e.a.a.c.n.i.TransportNodesInfoAction] [x0178se] failed to execute on node [bSd3Zy5tQDqmD86h4xsLqA]
org.elasticsearch.transport.SendRequestTransportException: [x0231se][138.106.67.61:9300][cluster:monitor/nodes/info[n]]
at [...]
Caused by: [org.elasticsearch.transport.NodeNotConnectedException: [x0231se][138.106.67.61:9300] Node not connected
at [...]
... 62 more[2017-08-11T08:37:05,850][WARN ][o.e.a.a.c.n.i.TransportNodesInfoAction] [x0178se] not accumulating exceptions, excluding exception from response
org.elasticsearch.action.FailedNodeException: Failed node [bSd3Zy5tQDqmD86h4xsLqA]
at [...]
Caused by: org.elasticsearch.transport.SendRequestTransportException: [x0231se][138.106.67.61:9300][cluster:monitor/nodes/info[n]]
at [...]
... 1 more
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [x0231se][138.106.67.61:9300] Node not connected
at [...]
... 1 more[2017-08-11T08:37:05,958][INFO ][o.e.c.r.a.AllocationService] [x0178se] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{x0231se}{bSd3Zy5tQDqmD86h4xsLqA}{YS6vsuiQTjClBtU8uAaMug}{138.106.67.61}{138.106.67.61:9300}{type=warm} transport disconnected]).
[2017-08-11T08:37:05,959][INFO ][o.e.c.s.ClusterService ] [x0178se] removed {{x0231se}{bSd3Zy5tQDqmD86h4xsLqA}{YS6vsuiQTjClBtU8uAaMug}{138.106.67.61}{138.106.67.61:9300}{type=warm},}, reason: zen-disco-node-failed({x0231se}{bSd3Zy5tQDqmD86h4xsLqA}{YS6vsuiQTjClBtU8uAaMug}{138.106.67.61}{138.106.67.61:9300}{type=warm}), reason(transport disconnected)[{x0231se}{bSd3Zy5tQDqmD86h4xsLqA}{YS6vsuiQTjClBtU8uAaMug}{138.106.67.61}{138.106.67.61:9300}{type=warm} transport disconnected]
I know that our network is struggling with spotty performance; could the above be indicative of dropped packets/connections on the network layer? If yes, the network problems should be rare and intermittent - are there some tweakables that could make the cluster more resilient to such transient network failures? As you can see I already raised ZEN ping_retries
to 15, without any result.
Thanks for your help!
//Dan