We have a largish production cluster that is experiencing a high number of node timeouts (on our hot indexing nodes) since we upgraded to 7.10.1 (from 7.7.0). Prior to the upgrade, we did not experience these timeouts. Our hot data nodes are sitting around 50% CPU idle, and our master nodes are sitting around 93% idle throughout the day. We are not experiencing any GC pauses and heap is doing great.
From reading the logs before and after the timeout event, it appears everything is normal on indexing and searching front, then suddenly a node times out. It also seems that both the data node and master node think the other has gone down. The node then quickly rejoins after timeing out. Having a node leave then rejoin causes a small delay in indexing that creates noticeable indexing lag that we want to try to avoid.
Any ideas? Since the cluster is fine, we are thinking of increasing the follower and leader timeouts to deal with the issue until an issue is found and fixed.
Master node logs of a timeout event:
[2020-12-16T14:34:18,189][ERROR][o.e.x.m.c.m.JobStatsCollector] [prod.es5.master.3] collector [job_stats] timed out when collecting data
[2020-12-16T14:34:29,207][WARN ][o.e.t.TransportService ] [prod.es5.master.3] Received response for a request that has timed out, sent [45817ms] ago, timed out [30811ms] ago, action [cluster:monitor/nodes/stats[n]], node [{prod.es5.data.ssd.21}{4noROdVGRx6zHhuhqVeAEw}{gO9TVOKmRcezxFd8c4EvHA}{10.10.1.145}{10.10.1.145:9300}{cdhilrstw}{ml.machine_memory=128835817472, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, transform.node=true}], id [6659467]
[2020-12-16T14:34:30,469][WARN ][o.e.t.TransportService ] [prod.es5.master.3] Received response for a request that has timed out, sent [52820ms] ago, timed out [42816ms] ago, action [internal:coordination/fault_detection/follower_check], node [{prod.es5.data.ssd.17}{IELvgNxfSOyogs878UUz7Q}{Bg01_BCmQYSmNxfim3xz-Q}{10.10.1.227}{10.10.1.227:9300}{cdhilrstw}{ml.machine_memory=128728207360, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, transform.node=true}], id [6658931]
[2020-12-16T14:34:30,469][WARN ][o.e.t.TransportService ] [prod.es5.master.3] Received response for a request that has timed out, sent [41816ms] ago, timed out [31812ms] ago, action [internal:coordination/fault_detection/follower_check], node [{prod.es5.data.ssd.17}{IELvgNxfSOyogs878UUz7Q}{Bg01_BCmQYSmNxfim3xz-Q}{10.10.1.227}{10.10.1.227:9300}{cdhilrstw}{ml.machine_memory=128728207360, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, transform.node=true}], id [6660049]
[2020-12-16T14:34:30,469][WARN ][o.e.t.TransportService ] [prod.es5.master.3] Received response for a request that has timed out, sent [30812ms] ago, timed out [20808ms] ago, action [internal:coordination/fault_detection/follower_check], node [{prod.es5.data.ssd.17}{IELvgNxfSOyogs878UUz7Q}{Bg01_BCmQYSmNxfim3xz-Q}{10.10.1.227}{10.10.1.227:9300}{cdhilrstw}{ml.machine_memory=128728207360, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, transform.node=true}], id [6660877]
[2020-12-16T14:34:31,509][INFO ][o.e.c.r.a.AllocationService] [prod.es5.master.3] updating number_of_replicas to [73] for indices [.security-6]
[2020-12-16T14:34:32,413][INFO ][o.e.c.s.MasterService ] [prod.es5.master.3] node-left[{prod.es5.data.ssd.17}{IELvgNxfSOyogs878UUz7Q}{Bg01_BCmQYSmNxfim3xz-Q}{10.10.1.227}{10.10.1.227:9300}{cdhilrstw}{ml.machine_memory=128728207360, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, transform.node=true} reason: followers check retry count exceeded], term: 31, version: 1122578, delta: removed {{prod.es5.data.ssd.17}{IELvgNxfSOyogs878UUz7Q}{Bg01_BCmQYSmNxfim3xz-Q}{10.10.1.227}{10.10.1.227:9300}{cdhilrstw}{ml.machine_memory=128728207360, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, transform.node=true}}
[2020-12-16T14:34:34,519][INFO ][o.e.c.s.ClusterApplierService] [prod.es5.master.3] removed {{prod.es5.data.ssd.17}{IELvgNxfSOyogs878UUz7Q}{Bg01_BCmQYSmNxfim3xz-Q}{10.10.1.227}{10.10.1.227:9300}{cdhilrstw}{ml.machine_memory=128728207360, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, transform.node=true}}, term: 31, version: 1122578, reason: Publication{term=31, version=1122578}
[2020-12-16T14:34:34,689][INFO ][o.e.c.r.DelayedAllocationService] [prod.es5.master.3] scheduling reroute for delayed shards in [56.6s] (472 delayed shards)
Hot data node log of timeout event:
[2020-12-16T14:34:34,892][INFO ][o.e.c.c.Coordinator ] [prod.es5.data.ssd.17] master node [{prod.es5.master.3}{b8y_KSE-Q667ImQ8-d-UWQ}{niqjOGRLTZ-AsPUuKr5QTg}{10.10.1.48}{10.10.1.48:9300}{lmr}{ml.machine_memory=32893591552, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{prod.es5.master.3}{b8y_KSE-Q667ImQ8-d-UWQ}{niqjOGRLTZ-AsPUuKr5QTg}{10.10.1.48}{10.10.1.48:9300}{lmr}{ml.machine_memory=32893591552, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}] failed [3] consecutive checks
Thanks for your help!