Increased node timeouts after 7.7.0 to 7.10.1 upgrade

Aharon · December 16, 2020, 3:47pm

We have a largish production cluster that is experiencing a high number of node timeouts (on our hot indexing nodes) since we upgraded to 7.10.1 (from 7.7.0). Prior to the upgrade, we did not experience these timeouts. Our hot data nodes are sitting around 50% CPU idle, and our master nodes are sitting around 93% idle throughout the day. We are not experiencing any GC pauses and heap is doing great.

From reading the logs before and after the timeout event, it appears everything is normal on indexing and searching front, then suddenly a node times out. It also seems that both the data node and master node think the other has gone down. The node then quickly rejoins after timeing out. Having a node leave then rejoin causes a small delay in indexing that creates noticeable indexing lag that we want to try to avoid.

Any ideas? Since the cluster is fine, we are thinking of increasing the follower and leader timeouts to deal with the issue until an issue is found and fixed.

Master node logs of a timeout event:

    [2020-12-16T14:34:18,189][ERROR][o.e.x.m.c.m.JobStatsCollector] [prod.es5.master.3] collector [job_stats] timed out when collecting data
    [2020-12-16T14:34:29,207][WARN ][o.e.t.TransportService   ] [prod.es5.master.3] Received response for a request that has timed out, sent [45817ms] ago, timed out [30811ms] ago, action [cluster:monitor/nodes/stats[n]], node [{prod.es5.data.ssd.21}{4noROdVGRx6zHhuhqVeAEw}{gO9TVOKmRcezxFd8c4EvHA}{10.10.1.145}{10.10.1.145:9300}{cdhilrstw}{ml.machine_memory=128835817472, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, transform.node=true}], id [6659467]
    [2020-12-16T14:34:30,469][WARN ][o.e.t.TransportService   ] [prod.es5.master.3] Received response for a request that has timed out, sent [52820ms] ago, timed out [42816ms] ago, action [internal:coordination/fault_detection/follower_check], node [{prod.es5.data.ssd.17}{IELvgNxfSOyogs878UUz7Q}{Bg01_BCmQYSmNxfim3xz-Q}{10.10.1.227}{10.10.1.227:9300}{cdhilrstw}{ml.machine_memory=128728207360, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, transform.node=true}], id [6658931]
    [2020-12-16T14:34:30,469][WARN ][o.e.t.TransportService   ] [prod.es5.master.3] Received response for a request that has timed out, sent [41816ms] ago, timed out [31812ms] ago, action [internal:coordination/fault_detection/follower_check], node [{prod.es5.data.ssd.17}{IELvgNxfSOyogs878UUz7Q}{Bg01_BCmQYSmNxfim3xz-Q}{10.10.1.227}{10.10.1.227:9300}{cdhilrstw}{ml.machine_memory=128728207360, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, transform.node=true}], id [6660049]
    [2020-12-16T14:34:30,469][WARN ][o.e.t.TransportService   ] [prod.es5.master.3] Received response for a request that has timed out, sent [30812ms] ago, timed out [20808ms] ago, action [internal:coordination/fault_detection/follower_check], node [{prod.es5.data.ssd.17}{IELvgNxfSOyogs878UUz7Q}{Bg01_BCmQYSmNxfim3xz-Q}{10.10.1.227}{10.10.1.227:9300}{cdhilrstw}{ml.machine_memory=128728207360, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, transform.node=true}], id [6660877]
    [2020-12-16T14:34:31,509][INFO ][o.e.c.r.a.AllocationService] [prod.es5.master.3] updating number_of_replicas to [73] for indices [.security-6]
    [2020-12-16T14:34:32,413][INFO ][o.e.c.s.MasterService    ] [prod.es5.master.3] node-left[{prod.es5.data.ssd.17}{IELvgNxfSOyogs878UUz7Q}{Bg01_BCmQYSmNxfim3xz-Q}{10.10.1.227}{10.10.1.227:9300}{cdhilrstw}{ml.machine_memory=128728207360, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, transform.node=true} reason: followers check retry count exceeded], term: 31, version: 1122578, delta: removed {{prod.es5.data.ssd.17}{IELvgNxfSOyogs878UUz7Q}{Bg01_BCmQYSmNxfim3xz-Q}{10.10.1.227}{10.10.1.227:9300}{cdhilrstw}{ml.machine_memory=128728207360, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, transform.node=true}}
    [2020-12-16T14:34:34,519][INFO ][o.e.c.s.ClusterApplierService] [prod.es5.master.3] removed {{prod.es5.data.ssd.17}{IELvgNxfSOyogs878UUz7Q}{Bg01_BCmQYSmNxfim3xz-Q}{10.10.1.227}{10.10.1.227:9300}{cdhilrstw}{ml.machine_memory=128728207360, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, transform.node=true}}, term: 31, version: 1122578, reason: Publication{term=31, version=1122578}
    [2020-12-16T14:34:34,689][INFO ][o.e.c.r.DelayedAllocationService] [prod.es5.master.3] scheduling reroute for delayed shards in [56.6s] (472 delayed shards)

Hot data node log of timeout event:

    [2020-12-16T14:34:34,892][INFO ][o.e.c.c.Coordinator      ] [prod.es5.data.ssd.17] master node [{prod.es5.master.3}{b8y_KSE-Q667ImQ8-d-UWQ}{niqjOGRLTZ-AsPUuKr5QTg}{10.10.1.48}{10.10.1.48:9300}{lmr}{ml.machine_memory=32893591552, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}] failed, restarting discovery
    org.elasticsearch.ElasticsearchException: node [{prod.es5.master.3}{b8y_KSE-Q667ImQ8-d-UWQ}{niqjOGRLTZ-AsPUuKr5QTg}{10.10.1.48}{10.10.1.48:9300}{lmr}{ml.machine_memory=32893591552, rack=us-east-1a, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}] failed [3] consecutive checks

Thanks for your help!

warkolm · December 16, 2020, 10:51pm

Is there anything in hot threads on that node?

Aharon · December 19, 2020, 7:58pm

We fixed the issue. After trying about everything we could think of, we decided to remove our hot nodes from coordination and added a dozen coordination only nodes. We were amazed to see that coordination uses about 20-40% CPU on the new coordination nodes and the hot nodes stopped timing out.

system · January 16, 2021, 7:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Long period of querying failure during node timeout Elasticsearch	4	1039	May 15, 2020
Cluster is frequently lost with no master and data nodes removed after upgrade from 7.2.1 to 7.10.2 Elasticsearch	6	1080	May 31, 2021
Elasticsearch cluster request timeout and slow response time Elasticsearch	1	1588	March 2, 2021
Data Node Error [Timed out when collecting data error] Elasticsearch	1	851	February 25, 2020
Timeouts in cluster management requests ES 7.11.2 leading to nodes in the cluster freezing Elasticsearch	8	1917	July 5, 2021

Increased node timeouts after 7.7.0 to 7.10.1 upgrade

Related topics