Make cluster more resilient to network failures - how?

Hello all,

I need some advice regarding node to node communication. It happens quite often (several times per day) that individual nodes lose contact with the master, and the cluster goes to yellow and then recovers. Occasionally there are multiple nodes that do this simultaneously and the cluster goes to red.

I saw the following pattern in the logs:

On the affected node, the master leaves:

[2017-08-11T08:37:16,805][WARN ][o.e.d.z.ZenDiscovery ] [x0231se] master left (reason = failed to ping, tried [15] times, each with maximum [30s] timeout), current nodes: {...}

On the master, I see a bunch of communication errors:

[2017-08-11T08:37:05,715][DEBUG][o.e.a.a.c.s.TransportClusterStatsAction] [x0178se] failed to execute on node [bSd3Zy5tQDqmD86h4xsLqA]
org.elasticsearch.transport.RemoteTransportException: [Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.stats.ClusterStatsNodeResponse]]
Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.stats.ClusterStatsNodeResponse]
at [...]
Caused by: java.lang.IllegalStateException: No routing state mapped for [0]
at [...]

[2017-08-11T08:37:05,720][WARN ][o.e.t.n.Netty4Transport ] [x0178se] exception caught on transport layer [[id: 0xf22323be, L:/138.106.49.8:43395 - R:/138.106.67.61:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [12358927], handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1@240d88e6], error [false]; resetting
at [...]

[2017-08-11T08:37:05,835][DEBUG][o.e.a.a.c.n.i.TransportNodesInfoAction] [x0178se] failed to execute on node [bSd3Zy5tQDqmD86h4xsLqA]
org.elasticsearch.transport.SendRequestTransportException: [x0231se][138.106.67.61:9300][cluster:monitor/nodes/info[n]]
at [...]
Caused by: [org.elasticsearch.transport.NodeNotConnectedException: [x0231se][138.106.67.61:9300] Node not connected
at [...]
... 62 more

[2017-08-11T08:37:05,850][WARN ][o.e.a.a.c.n.i.TransportNodesInfoAction] [x0178se] not accumulating exceptions, excluding exception from response
org.elasticsearch.action.FailedNodeException: Failed node [bSd3Zy5tQDqmD86h4xsLqA]
at [...]
Caused by: org.elasticsearch.transport.SendRequestTransportException: [x0231se][138.106.67.61:9300][cluster:monitor/nodes/info[n]]
at [...]
... 1 more
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [x0231se][138.106.67.61:9300] Node not connected
at [...]
... 1 more

[2017-08-11T08:37:05,958][INFO ][o.e.c.r.a.AllocationService] [x0178se] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{x0231se}{bSd3Zy5tQDqmD86h4xsLqA}{YS6vsuiQTjClBtU8uAaMug}{138.106.67.61}{138.106.67.61:9300}{type=warm} transport disconnected]).

[2017-08-11T08:37:05,959][INFO ][o.e.c.s.ClusterService ] [x0178se] removed {{x0231se}{bSd3Zy5tQDqmD86h4xsLqA}{YS6vsuiQTjClBtU8uAaMug}{138.106.67.61}{138.106.67.61:9300}{type=warm},}, reason: zen-disco-node-failed({x0231se}{bSd3Zy5tQDqmD86h4xsLqA}{YS6vsuiQTjClBtU8uAaMug}{138.106.67.61}{138.106.67.61:9300}{type=warm}), reason(transport disconnected)[{x0231se}{bSd3Zy5tQDqmD86h4xsLqA}{YS6vsuiQTjClBtU8uAaMug}{138.106.67.61}{138.106.67.61:9300}{type=warm} transport disconnected]

I know that our network is struggling with spotty performance; could the above be indicative of dropped packets/connections on the network layer? If yes, the network problems should be rare and intermittent - are there some tweakables that could make the cluster more resilient to such transient network failures? As you can see I already raised ZEN ping_retries to 15, without any result.

Thanks for your help!

//Dan

Your best bet is to try to solve those known network issues.
High latency or super flaky networks are not a high performance cluster's best friend.
Are the nodes in your cluster far apart in a networking context, for example in different data centres?
You want to aim to get the nodes in your cluster as low latency from each other as possible.

Other factors that might be a play could be if your cluster is currently overloaded or needs more resources or tuning to manage the current data and workload.

With your current setting of 15 attempts at a 30 second timeout, the fact that this is failing means that the nodes are requiring longer than 30 seconds to successfully ping each other and 15 attempts at this still don't seem to work within that 30 second timeout limit. It also means that a validly sick/dead node will make the cluster wait a full 7.5 minutes before ejecting the node from the cluster which is also not optimal if you could have ejected a dead node sooner and started a recovery earlier.

Hi Peter, thanks for responding.

Yes, I know, but this really doesn't help me right now. :frowning:

Could you elaborate on this? How could an overloaded cluster result in the OS not replying to ICMP pings?

I believe this is false. ping_retries may be set to 15 seconds, but ping_interval is still at the 1 second default. Considering ping_timeout is 30 seconds (also by default), that would make the cluster eject the sick node after 45 seconds.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.