Hi Team,
ES version: 6.8
No. of nodes: 103
No. of shards: ~10K
We are observing an issue in this cluster where fd pings will start failing (due to timeout) with 3 retries on both master and data node. This would result in master kicking the data node out. But _cat/master and _bulk APIs were succeeding on the same data node just few secs before it was kicked out. From the logs these fd pings are not dropped, they are getting delayed response on either side (master -> data & data -> master). It looks like some issue with dedicated ping connection or transport_worker threadpool being busy. JVM was below 80%, no warn logs for old GC collection and CPU was below 40%. And when this happens, sometime data node doesn’t join the cluster at all until its process is restarted. There is a ripple effect happening which starts with one data node, then multiple nodes start to drop off.
Have you seen similar issue before or any thoughts on what could be happening here?
Logs
master
[2020-05-04T19:15:52,830][INFO ][o.e.c.s.MasterService ] [A6b9h9s] zen-disco-node-failed({1QgcRQl}{1QgcRQlBRCGjBqNdE8MCiw}{wkJVg8pYSuePx-DvpE07yw}{x.x.x.x}{x.x.x.x:9300}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{1QgcRQl}{1QgcRQlBRCGjBqNdE8MCiw}{wkJVg8pYSuePx-DvpE07yw}{x.x.x.x}{x.x.x.x:9300} failed to ping, tried [3] times, each with maximum [30s] timeout], reason: removed {{1QgcRQl}{1QgcRQlBRCGjBqNdE8MCiw}{wkJVg8pYSuePx-DvpE07yw}{x.x.x.x}{x.x.x.x:9300}}
data node
[2020-05-04T19:15:51,463][INFO ][c.a.c.e.logger ] [1QgcRQl] GET /_cat/master 200 OK 14 2
[2020-05-04T19:15:53,326][INFO ][c.a.c.e.logger ] [1QgcRQl] POST /_bulk - 200 OK 1978 45
[2020-05-04T19:15:53,880][INFO ][o.e.d.z.ZenDiscovery ] [1QgcRQl] master_left [{A6b9h9s}{A6b9h9sIRj-um3wiMEaw0g}{VsDvuwCySb6YyyTGGG2qFA}{y.y.y.y}{y.y.y.y:9300}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2020-05-04T19:15:53,880][WARN ][o.e.d.z.ZenDiscovery ] [1QgcRQl] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes: ….
…
[2020-05-04T19:16:00,764][WARN ][o.e.d.z.UnicastZenPing ] [1QgcRQl] failed to send ping to [{4EP-6AA}{4EP-6AAlRHiDKxuRi8Q1vA}{a0icmxkVTpmdA9WQ73yFxw}{z.z.z.z}{z.z.z.z:9300}]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [4EP-6AA][z.z.z.z:9300][internal:discovery/zen/unicast] request_id [162017265] timed out after [6252ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1016) [elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:690) [elasticsearch-6.8.0.jar:6.8.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
….
[2020-05-04T19:17:31,139][WARN ][o.e.t.TransportService ] [1QgcRQl] Received response for a request that has timed out, sent [90058ms] ago, timed out [60239ms] ago, action [internal:discovery/zen/fd/master_ping], node [{A6b9h9s}{A6b9h9sIRj-um3wiMEaw0g}{VsDvuwCySb6YyyTGGG2qFA}{y.y.y.y}{y.y.y.y:9300}], id [162017569]
…
[2020-05-04T19:17:39,257][WARN ][o.e.t.TransportService ] [1QgcRQl] Received response for a request that has timed out, sent [104716ms] ago, timed out [98464ms] ago, action [internal:discovery/zen/unicast], node [{4EP-6AA}{4EP-6AAlRHiDKxuRi8Q1vA}{a0icmxkVTpmdA9WQ73yFxw}{z.z.z.z}{z.z.z.z:9300}], id [162017265]