Delay in processing fault detection pings

Bukhtawar_Khan · June 3, 2020, 2:15pm

We wanted to know if there are known cases where the network threads can potentially delay sending back the ping response after having received it or delay in processing received responses. Based on our understanding pings have dedicated channel and each channel is bound to a worker thread. We have a large size 70-100 node cluster with 100 shards per node, ES version 6.8. We saw the following trace logs on master nodes

[2020-06-02T14:38:30,043][TRACE][o.e.t.T.tracer           ] [O2-A3HR] [381675299][internal:discovery/zen/fd/ping] sent to [{axo1BoC}{axo1BoCxRqWJRA4BrsY9Ow}{mHMVeQaSSRCD8wxRlLV29w}{1xx.xx.xx.xxx}{1xx.xx.xx.xxx:9300}{ zone=us-east-1a, distributed_snapshot_deletion_enabled=true}] (timeout: [30s])
[2020-06-02T14:39:00,043][TRACE][o.e.t.T.tracer           ] [O2-A3HR] [381679140][internal:discovery/zen/fd/ping] sent to [{axo1BoC}{axo1BoCxRqWJRA4BrsY9Ow}{mHMVeQaSSRCD8wxRlLV29w}{1xx.xx.xx.xxx}{1xx.xx.xx.xxx:9300}{ zone=us-east-1a, distributed_snapshot_deletion_enabled=true}] (timeout: [30s])
[2020-06-02T14:39:30,043][TRACE][o.e.t.T.tracer           ] [O2-A3HR] [381682372][internal:discovery/zen/fd/ping] sent to [{axo1BoC}{axo1BoCxRqWJRA4BrsY9Ow}{mHMVeQaSSRCD8wxRlLV29w}{1xx.xx.xx.xxx}{1xx.xx.xx.xxx:9300}{ zone=us-east-1a, distributed_snapshot_deletion_enabled=true}] (timeout: [30s])
[2020-06-02T14:40:00,151][INFO ][o.e.c.s.MasterService    ] [O2-A3HR] zen-disco-node-failed({axo1BoC}{axo1BoCxRqWJRA4BrsY9Ow}{mHMVeQaSSRCD8wxRlLV29w}{1xx.xx.xx.xxx}{1xx.xx.xx.xxx:9300}{ zone=us-east-1a, distributed_snapshot_deletion_enabled=true}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{axo1BoC}{axo1BoCxRqWJRA4BrsY9Ow}{mHMVeQaSSRCD8wxRlLV29w}{1xx.xx.xx.xxx}{1xx.xx.xx.xxx:9300}{ zone=us-east-1a, distributed_snapshot_deletion_enabled=true} failed to ping, tried [3] times, each with maximum [30s] timeout], reason: removed {{axo1BoC}{axo1BoCxRqWJRA4BrsY9Ow}{mHMVeQaSSRCD8wxRlLV29w}{1xx.xx.xx.xxx}{1xx.xx.xx.xxx:9300}{ zone=us-east-1a, distributed_snapshot_deletion_enabled=true},}

Corresponding data node entry which indicates response was processed late

[2020-06-02T14:41:29,898][TRACE][o.e.t.T.tracer           ] [axo1BoC] [381675299][internal:discovery/zen/fd/ping] received request
[2020-06-02T14:41:29,898][TRACE][o.e.t.T.tracer           ] [axo1BoC] [381675299][internal:discovery/zen/fd/ping] sent response
[2020-06-02T14:41:29,898][TRACE][o.e.t.T.tracer           ] [axo1BoC] [381679140][internal:discovery/zen/fd/ping] received request
[2020-06-02T14:41:29,898][TRACE][o.e.t.T.tracer           ] [axo1BoC] [381679140][internal:discovery/zen/fd/ping] sent response
[2020-06-02T14:41:29,898][TRACE][o.e.t.T.tracer           ] [axo1BoC] [381682372][internal:discovery/zen/fd/ping] received request
[2020-06-02T14:41:29,898][TRACE][o.e.t.T.tracer           ] [axo1BoC] [381682372][internal:discovery/zen/fd/ping] sent response

On other times we noticed delays in node fault detection pings on master after they were reeceived. The delay is significant

[2020-06-02T15:45:01,939][TRACE][o.e.t.T.tracer           ] [O2-A3HR] [997979195][internal:discovery/zen/fd/master_ping] received request
[2020-06-02T15:45:31,939][TRACE][o.e.t.T.tracer           ] [O2-A3HR] [997979244][internal:discovery/zen/fd/master_ping] received request
[2020-06-02T15:49:34,925][TRACE][o.e.t.T.tracer           ] [O2-A3HR] [997979195][internal:discovery/zen/fd/master_ping] sent response
[2020-06-02T15:49:34,967][TRACE][o.e.t.T.tracer           ] [O2-A3HR] [997979244][internal:discovery/zen/fd/master_ping] sent response

system · July 1, 2020, 2:15pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Data node left the cluster due to `master_left - failed to ping, tried [3] times, each with maximum [30s] timeout]` Elasticsearch	6	5641	February 4, 2019
Multicast ping vs port 9300 Elasticsearch	4	831	July 6, 2017
Fd pings start timing out, causing multiple nodes to be kicked out and cluster going red Elasticsearch	1	456	June 16, 2020
Failed to send ping to Elasticsearch	11	20339	July 5, 2017
Does fault detection process work sequentially? Elasticsearch	3	519	October 31, 2017

Delay in processing fault detection pings

Related topics