Increased node timeouts after 7.7.0 to 7.10.1 upgrade

We have a largish production cluster that is experiencing a high number of node timeouts (on our hot indexing nodes) since we upgraded to 7.10.1 (from 7.7.0). Prior to the upgrade, we did not experience these timeouts. Our hot data nodes are sitting around 50% CPU idle, and our master nodes are sitting around 93% idle throughout the day. We are not experiencing any GC pauses and heap is doing great.

From reading the logs before and after the timeout event, it appears everything is normal on indexing and searching front, then suddenly a node times out. It also seems that both the data node and master node think the other has gone down. The node then quickly rejoins after timeing out. Having a node leave then rejoin causes a small delay in indexing that creates noticeable indexing lag that we want to try to avoid.

Any ideas? Since the cluster is fine, we are thinking of increasing the follower and leader timeouts to deal with the issue until an issue is found and fixed.

Master node logs of a timeout event:

    [2020-12-16T14:34:18,189][ERROR][o.e.x.m.c.m.JobStatsCollector] [prod.es5.master.3] collector [job_stats] timed out when collecting data
    [2020-12-16T14:34:29,207][WARN ][o.e.t.TransportService   ] [prod.es5.master.3] Received response for a request that has timed out, sent [45817ms] ago, timed out [30811ms] ago, action [cluster:monitor/nodes/stats[n]], node [{prod.es5.data.ssd.21}{4noROdVGRx6zHhuhqVeAEw}{gO9TVOKmRcezxFd8c4EvHA}{10.10.1.145}{10.10.1.145:9300}{cdhilrstw}{ml.machine_memory=128835817472, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, transform.node=true}], id [6659467]
    [2020-12-16T14:34:30,469][WARN ][o.e.t.TransportService   ] [prod.es5.master.3] Received response for a request that has timed out, sent [52820ms] ago, timed out [42816ms] ago, action [internal:coordination/fault_detection/follower_check], node [{prod.es5.data.ssd.17}{IELvgNxfSOyogs878UUz7Q}{Bg01_BCmQYSmNxfim3xz-Q}{10.10.1.227}{10.10.1.227:9300}{cdhilrstw}{ml.machine_memory=128728207360, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, transform.node=true}], id [6658931]
    [2020-12-16T14:34:30,469][WARN ][o.e.t.TransportService   ] [prod.es5.master.3] Received response for a request that has timed out, sent [41816ms] ago, timed out [31812ms] ago, action [internal:coordination/fault_detection/follower_check], node [{prod.es5.data.ssd.17}{IELvgNxfSOyogs878UUz7Q}{Bg01_BCmQYSmNxfim3xz-Q}{10.10.1.227}{10.10.1.227:9300}{cdhilrstw}{ml.machine_memory=128728207360, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, transform.node=true}], id [6660049]
    [2020-12-16T14:34:30,469][WARN ][o.e.t.TransportService   ] [prod.es5.master.3] Received response for a request that has timed out, sent [30812ms] ago, timed out [20808ms] ago, action [internal:coordination/fault_detection/follower_check], node [{prod.es5.data.ssd.17}{IELvgNxfSOyogs878UUz7Q}{Bg01_BCmQYSmNxfim3xz-Q}{10.10.1.227}{10.10.1.227:9300}{cdhilrstw}{ml.machine_memory=128728207360, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, transform.node=true}], id [6660877]
    [2020-12-16T14:34:31,509][INFO ][o.e.c.r.a.AllocationService] [prod.es5.master.3] updating number_of_replicas to [73] for indices [.security-6]
    [2020-12-16T14:34:32,413][INFO ][o.e.c.s.MasterService    ] [prod.es5.master.3] node-left[{prod.es5.data.ssd.17}{IELvgNxfSOyogs878UUz7Q}{Bg01_BCmQYSmNxfim3xz-Q}{10.10.1.227}{10.10.1.227:9300}{cdhilrstw}{ml.machine_memory=128728207360, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, transform.node=true} reason: followers check retry count exceeded], term: 31, version: 1122578, delta: removed {{prod.es5.data.ssd.17}{IELvgNxfSOyogs878UUz7Q}{Bg01_BCmQYSmNxfim3xz-Q}{10.10.1.227}{10.10.1.227:9300}{cdhilrstw}{ml.machine_memory=128728207360, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, transform.node=true}}
    [2020-12-16T14:34:34,519][INFO ][o.e.c.s.ClusterApplierService] [prod.es5.master.3] removed {{prod.es5.data.ssd.17}{IELvgNxfSOyogs878UUz7Q}{Bg01_BCmQYSmNxfim3xz-Q}{10.10.1.227}{10.10.1.227:9300}{cdhilrstw}{ml.machine_memory=128728207360, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, transform.node=true}}, term: 31, version: 1122578, reason: Publication{term=31, version=1122578}
    [2020-12-16T14:34:34,689][INFO ][o.e.c.r.DelayedAllocationService] [prod.es5.master.3] scheduling reroute for delayed shards in [56.6s] (472 delayed shards)

Hot data node log of timeout event:

    [2020-12-16T14:34:34,892][INFO ][o.e.c.c.Coordinator      ] [prod.es5.data.ssd.17] master node [{prod.es5.master.3}{b8y_KSE-Q667ImQ8-d-UWQ}{niqjOGRLTZ-AsPUuKr5QTg}{10.10.1.48}{10.10.1.48:9300}{lmr}{ml.machine_memory=32893591552, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}] failed, restarting discovery
    org.elasticsearch.ElasticsearchException: node [{prod.es5.master.3}{b8y_KSE-Q667ImQ8-d-UWQ}{niqjOGRLTZ-AsPUuKr5QTg}{10.10.1.48}{10.10.1.48:9300}{lmr}{ml.machine_memory=32893591552, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}] failed [3] consecutive checks

Thanks for your help!

Is there anything in hot threads on that node?

We fixed the issue. After trying about everything we could think of, we decided to remove our hot nodes from coordination and added a dozen coordination only nodes. We were amazed to see that coordination uses about 20-40% CPU on the new coordination nodes and the hot nodes stopped timing out.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.