We wanted to know if there are known cases where the network threads can potentially delay sending back the ping response after having received it or delay in processing received responses. Based on our understanding pings have dedicated channel and each channel is bound to a worker thread. We have a large size 70-100 node cluster with 100 shards per node, ES version 6.8. We saw the following trace logs on master nodes
[2020-06-02T14:38:30,043][TRACE][o.e.t.T.tracer ] [O2-A3HR] [381675299][internal:discovery/zen/fd/ping] sent to [{axo1BoC}{axo1BoCxRqWJRA4BrsY9Ow}{mHMVeQaSSRCD8wxRlLV29w}{1xx.xx.xx.xxx}{1xx.xx.xx.xxx:9300}{ zone=us-east-1a, distributed_snapshot_deletion_enabled=true}] (timeout: [30s])
[2020-06-02T14:39:00,043][TRACE][o.e.t.T.tracer ] [O2-A3HR] [381679140][internal:discovery/zen/fd/ping] sent to [{axo1BoC}{axo1BoCxRqWJRA4BrsY9Ow}{mHMVeQaSSRCD8wxRlLV29w}{1xx.xx.xx.xxx}{1xx.xx.xx.xxx:9300}{ zone=us-east-1a, distributed_snapshot_deletion_enabled=true}] (timeout: [30s])
[2020-06-02T14:39:30,043][TRACE][o.e.t.T.tracer ] [O2-A3HR] [381682372][internal:discovery/zen/fd/ping] sent to [{axo1BoC}{axo1BoCxRqWJRA4BrsY9Ow}{mHMVeQaSSRCD8wxRlLV29w}{1xx.xx.xx.xxx}{1xx.xx.xx.xxx:9300}{ zone=us-east-1a, distributed_snapshot_deletion_enabled=true}] (timeout: [30s])
[2020-06-02T14:40:00,151][INFO ][o.e.c.s.MasterService ] [O2-A3HR] zen-disco-node-failed({axo1BoC}{axo1BoCxRqWJRA4BrsY9Ow}{mHMVeQaSSRCD8wxRlLV29w}{1xx.xx.xx.xxx}{1xx.xx.xx.xxx:9300}{ zone=us-east-1a, distributed_snapshot_deletion_enabled=true}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{axo1BoC}{axo1BoCxRqWJRA4BrsY9Ow}{mHMVeQaSSRCD8wxRlLV29w}{1xx.xx.xx.xxx}{1xx.xx.xx.xxx:9300}{ zone=us-east-1a, distributed_snapshot_deletion_enabled=true} failed to ping, tried [3] times, each with maximum [30s] timeout], reason: removed {{axo1BoC}{axo1BoCxRqWJRA4BrsY9Ow}{mHMVeQaSSRCD8wxRlLV29w}{1xx.xx.xx.xxx}{1xx.xx.xx.xxx:9300}{ zone=us-east-1a, distributed_snapshot_deletion_enabled=true},}
Corresponding data node entry which indicates response was processed late
[2020-06-02T14:41:29,898][TRACE][o.e.t.T.tracer ] [axo1BoC] [381675299][internal:discovery/zen/fd/ping] received request
[2020-06-02T14:41:29,898][TRACE][o.e.t.T.tracer ] [axo1BoC] [381675299][internal:discovery/zen/fd/ping] sent response
[2020-06-02T14:41:29,898][TRACE][o.e.t.T.tracer ] [axo1BoC] [381679140][internal:discovery/zen/fd/ping] received request
[2020-06-02T14:41:29,898][TRACE][o.e.t.T.tracer ] [axo1BoC] [381679140][internal:discovery/zen/fd/ping] sent response
[2020-06-02T14:41:29,898][TRACE][o.e.t.T.tracer ] [axo1BoC] [381682372][internal:discovery/zen/fd/ping] received request
[2020-06-02T14:41:29,898][TRACE][o.e.t.T.tracer ] [axo1BoC] [381682372][internal:discovery/zen/fd/ping] sent response
On other times we noticed delays in node fault detection pings on master after they were reeceived. The delay is significant
[2020-06-02T15:45:01,939][TRACE][o.e.t.T.tracer ] [O2-A3HR] [997979195][internal:discovery/zen/fd/master_ping] received request
[2020-06-02T15:45:31,939][TRACE][o.e.t.T.tracer ] [O2-A3HR] [997979244][internal:discovery/zen/fd/master_ping] received request
[2020-06-02T15:49:34,925][TRACE][o.e.t.T.tracer ] [O2-A3HR] [997979195][internal:discovery/zen/fd/master_ping] sent response
[2020-06-02T15:49:34,967][TRACE][o.e.t.T.tracer ] [O2-A3HR] [997979244][internal:discovery/zen/fd/master_ping] sent response