Delay in processing fault detection pings

We wanted to know if there are known cases where the network threads can potentially delay sending back the ping response after having received it or delay in processing received responses. Based on our understanding pings have dedicated channel and each channel is bound to a worker thread. We have a large size 70-100 node cluster with 100 shards per node, ES version 6.8. We saw the following trace logs on master nodes

[2020-06-02T14:38:30,043][TRACE][o.e.t.T.tracer           ] [O2-A3HR] [381675299][internal:discovery/zen/fd/ping] sent to [{axo1BoC}{axo1BoCxRqWJRA4BrsY9Ow}{mHMVeQaSSRCD8wxRlLV29w}{1xx.xx.xx.xxx}{1xx.xx.xx.xxx:9300}{ zone=us-east-1a, distributed_snapshot_deletion_enabled=true}] (timeout: [30s])
[2020-06-02T14:39:00,043][TRACE][o.e.t.T.tracer           ] [O2-A3HR] [381679140][internal:discovery/zen/fd/ping] sent to [{axo1BoC}{axo1BoCxRqWJRA4BrsY9Ow}{mHMVeQaSSRCD8wxRlLV29w}{1xx.xx.xx.xxx}{1xx.xx.xx.xxx:9300}{ zone=us-east-1a, distributed_snapshot_deletion_enabled=true}] (timeout: [30s])
[2020-06-02T14:39:30,043][TRACE][o.e.t.T.tracer           ] [O2-A3HR] [381682372][internal:discovery/zen/fd/ping] sent to [{axo1BoC}{axo1BoCxRqWJRA4BrsY9Ow}{mHMVeQaSSRCD8wxRlLV29w}{1xx.xx.xx.xxx}{1xx.xx.xx.xxx:9300}{ zone=us-east-1a, distributed_snapshot_deletion_enabled=true}] (timeout: [30s])
[2020-06-02T14:40:00,151][INFO ][o.e.c.s.MasterService    ] [O2-A3HR] zen-disco-node-failed({axo1BoC}{axo1BoCxRqWJRA4BrsY9Ow}{mHMVeQaSSRCD8wxRlLV29w}{1xx.xx.xx.xxx}{1xx.xx.xx.xxx:9300}{ zone=us-east-1a, distributed_snapshot_deletion_enabled=true}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{axo1BoC}{axo1BoCxRqWJRA4BrsY9Ow}{mHMVeQaSSRCD8wxRlLV29w}{1xx.xx.xx.xxx}{1xx.xx.xx.xxx:9300}{ zone=us-east-1a, distributed_snapshot_deletion_enabled=true} failed to ping, tried [3] times, each with maximum [30s] timeout], reason: removed {{axo1BoC}{axo1BoCxRqWJRA4BrsY9Ow}{mHMVeQaSSRCD8wxRlLV29w}{1xx.xx.xx.xxx}{1xx.xx.xx.xxx:9300}{ zone=us-east-1a, distributed_snapshot_deletion_enabled=true},}

Corresponding data node entry which indicates response was processed late

[2020-06-02T14:41:29,898][TRACE][o.e.t.T.tracer           ] [axo1BoC] [381675299][internal:discovery/zen/fd/ping] received request
[2020-06-02T14:41:29,898][TRACE][o.e.t.T.tracer           ] [axo1BoC] [381675299][internal:discovery/zen/fd/ping] sent response
[2020-06-02T14:41:29,898][TRACE][o.e.t.T.tracer           ] [axo1BoC] [381679140][internal:discovery/zen/fd/ping] received request
[2020-06-02T14:41:29,898][TRACE][o.e.t.T.tracer           ] [axo1BoC] [381679140][internal:discovery/zen/fd/ping] sent response
[2020-06-02T14:41:29,898][TRACE][o.e.t.T.tracer           ] [axo1BoC] [381682372][internal:discovery/zen/fd/ping] received request
[2020-06-02T14:41:29,898][TRACE][o.e.t.T.tracer           ] [axo1BoC] [381682372][internal:discovery/zen/fd/ping] sent response

On other times we noticed delays in node fault detection pings on master after they were reeceived. The delay is significant

[2020-06-02T15:45:01,939][TRACE][o.e.t.T.tracer           ] [O2-A3HR] [997979195][internal:discovery/zen/fd/master_ping] received request
[2020-06-02T15:45:31,939][TRACE][o.e.t.T.tracer           ] [O2-A3HR] [997979244][internal:discovery/zen/fd/master_ping] received request
[2020-06-02T15:49:34,925][TRACE][o.e.t.T.tracer           ] [O2-A3HR] [997979195][internal:discovery/zen/fd/master_ping] sent response
[2020-06-02T15:49:34,967][TRACE][o.e.t.T.tracer           ] [O2-A3HR] [997979244][internal:discovery/zen/fd/master_ping] sent response

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.