Fd pings start timing out, causing multiple nodes to be kicked out and cluster going red

shwetathareja · May 19, 2020, 10:18am

Hi Team,

ES version: 6.8
No. of nodes: 103
No. of shards: ~10K

We are observing an issue in this cluster where fd pings will start failing (due to timeout) with 3 retries on both master and data node. This would result in master kicking the data node out. But _cat/master and _bulk APIs were succeeding on the same data node just few secs before it was kicked out. From the logs these fd pings are not dropped, they are getting delayed response on either side (master -> data & data -> master). It looks like some issue with dedicated ping connection or transport_worker threadpool being busy. JVM was below 80%, no warn logs for old GC collection and CPU was below 40%. And when this happens, sometime data node doesn’t join the cluster at all until its process is restarted. There is a ripple effect happening which starts with one data node, then multiple nodes start to drop off.

Have you seen similar issue before or any thoughts on what could be happening here?

Logs

    master
    [2020-05-04T19:15:52,830][INFO ][o.e.c.s.MasterService    ] [A6b9h9s] zen-disco-node-failed({1QgcRQl}{1QgcRQlBRCGjBqNdE8MCiw}{wkJVg8pYSuePx-DvpE07yw}{x.x.x.x}{x.x.x.x:9300}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{1QgcRQl}{1QgcRQlBRCGjBqNdE8MCiw}{wkJVg8pYSuePx-DvpE07yw}{x.x.x.x}{x.x.x.x:9300} failed to ping, tried [3] times, each with maximum [30s] timeout], reason: removed {{1QgcRQl}{1QgcRQlBRCGjBqNdE8MCiw}{wkJVg8pYSuePx-DvpE07yw}{x.x.x.x}{x.x.x.x:9300}}

    data node
    [2020-05-04T19:15:51,463][INFO ][c.a.c.e.logger           ] [1QgcRQl] GET /_cat/master 200 OK 14 2
    [2020-05-04T19:15:53,326][INFO ][c.a.c.e.logger           ] [1QgcRQl] POST /_bulk - 200 OK 1978 45
    [2020-05-04T19:15:53,880][INFO ][o.e.d.z.ZenDiscovery     ] [1QgcRQl] master_left [{A6b9h9s}{A6b9h9sIRj-um3wiMEaw0g}{VsDvuwCySb6YyyTGGG2qFA}{y.y.y.y}{y.y.y.y:9300}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
    [2020-05-04T19:15:53,880][WARN ][o.e.d.z.ZenDiscovery     ] [1QgcRQl] master left (reason = failed to ping, tried [3] times, each with  maximum [30s] timeout), current nodes: nodes:  ….
    …
    [2020-05-04T19:16:00,764][WARN ][o.e.d.z.UnicastZenPing   ] [1QgcRQl] failed to send ping to [{4EP-6AA}{4EP-6AAlRHiDKxuRi8Q1vA}{a0icmxkVTpmdA9WQ73yFxw}{z.z.z.z}{z.z.z.z:9300}]
    org.elasticsearch.transport.ReceiveTimeoutTransportException: [4EP-6AA][z.z.z.z:9300][internal:discovery/zen/unicast] request_id [162017265] timed out after [6252ms]
            at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1016) [elasticsearch-6.8.0.jar:6.8.0]
            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:690) [elasticsearch-6.8.0.jar:6.8.0]
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
            at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
    ….
    [2020-05-04T19:17:31,139][WARN ][o.e.t.TransportService   ] [1QgcRQl] Received response for a request that has timed out, sent [90058ms] ago, timed out [60239ms] ago, action [internal:discovery/zen/fd/master_ping], node [{A6b9h9s}{A6b9h9sIRj-um3wiMEaw0g}{VsDvuwCySb6YyyTGGG2qFA}{y.y.y.y}{y.y.y.y:9300}], id [162017569]
    …
    [2020-05-04T19:17:39,257][WARN ][o.e.t.TransportService   ] [1QgcRQl] Received response for a request that has timed out, sent [104716ms] ago, timed out [98464ms] ago, action [internal:discovery/zen/unicast], node [{4EP-6AA}{4EP-6AAlRHiDKxuRi8Q1vA}{a0icmxkVTpmdA9WQ73yFxw}{z.z.z.z}{z.z.z.z:9300}], id [162017265]

system · June 16, 2020, 10:18am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster failures Elasticsearch	2	282	July 6, 2017
Cluter removed timeout Coordinating node Elasticsearch	2	227	February 13, 2023
Data node left the cluster due to `master_left - failed to ping, tried [3] times, each with maximum [30s] timeout]` Elasticsearch	6	5482	February 4, 2019
Cluster nodes get disconnected and out of sync due to ping timeouts caused by transport load Elasticsearch	4	3206	July 5, 2017
ES nodes disconnects intermittently from the cluster Elasticsearch	1	628	February 8, 2018

Fd pings start timing out, causing multiple nodes to be kicked out and cluster going red

Related topics