Lots of "cluster state update task [zen-disco-receive(from master " above the warn threshold of 30s"

hi,
We use ES for log management and ES cluster is built by hot-cold architecture. One physical host holds one hot node and one cold node. Hot node and cold node share Cpu, memory, but use different storage( ssd for hot node, sata for cold node).

One cold node of one cluster lots of errors happened recently.

[2019-04-09T12:46:21,482][WARN ][o.e.c.s.ClusterService   ] [jssz-billions-es-05-datanode_stale] cluster state update task [zen-disco-receive(from master [master {jssz-billions-es-01-masternode}{IfiUfj6nRKSRpQqSL2tmkQ}{D22n7s9YRwieNT67hyPbUg}{10.69.23.23}{10.69.23.23:9310} committed version [1458408]])] took [40.5s] above the warn threshold of 30s
[2019-04-09T12:55:45,538][WARN ][o.e.c.s.ClusterService   ] [jssz-billions-es-05-datanode_stale] cluster state update task [zen-disco-receive(from master [master {jssz-billions-es-01-masternode}{IfiUfj6nRKSRpQqSL2tmkQ}{D22n7s9YRwieNT67hyPbUg}{10.69.23.23}{10.69.23.23:9310} committed version [1458422]])] took [43.9s] above the warn threshold of 30s
[2019-04-09T12:57:49,492][WARN ][o.e.c.s.ClusterService   ] [jssz-billions-es-05-datanode_stale] cluster state update task [zen-disco-receive(from master [master {jssz-billions-es-01-masternode}{IfiUfj6nRKSRpQqSL2tmkQ}{D22n7s9YRwieNT67hyPbUg}{10.69.23.23}{10.69.23.23:9310} committed version [1458423]])] took [31.1s] above the warn threshold of 30s
[2019-04-09T12:59:22,582][WARN ][o.e.c.s.ClusterService   ] [jssz-billions-es-05-datanode_stale] cluster state update task [zen-disco-receive(from master [master {jssz-billions-es-01-masternode}{IfiUfj6nRKSRpQqSL2tmkQ}{D22n7s9YRwieNT67hyPbUg}{10.69.23.23}{10.69.23.23:9310} committed version [1458425]])] took [55.1s] above the warn threshold of 30s
[2019-04-09T13:00:00,721][WARN ][o.e.c.s.ClusterService   ] [jssz-billions-es-05-datanode_stale] cluster state update task [zen-disco-receive(from master [master {jssz-billions-es-01-masternode}{IfiUfj6nRKSRpQqSL2tmkQ}{D22n7s9YRwieNT67hyPbUg}{10.69.23.23}{10.69.23.23:9310} committed version [1458426]])] took [38.1s] above the warn threshold of 30s
[2019-04-09T13:35:41,830][WARN ][o.e.c.s.ClusterService   ] [jssz-billions-es-05-datanode_stale] cluster state update task [zen-disco-receive(from master [master {jssz-billions-es-01-masternode}{IfiUfj6nRKSRpQqSL2tmkQ}{D22n7s9YRwieNT67hyPbUg}{10.69.23.23}{10.69.23.23:9310} committed version [1458624]])] took [32.4s] above the warn threshold of 30s
[2019-04-09T14:08:07,766][WARN ][o.e.c.s.ClusterService   ] [jssz-billions-es-05-datanode_stale] cluster state update task [zen-disco-receive(from master [master {jssz-billions-es-01-masternode}{IfiUfj6nRKSRpQqSL2tmkQ}{D22n7s9YRwieNT67hyPbUg}{10.69.23.23}{10.69.23.23:9310} committed version [1458960]])] took [30.1s] above the warn threshold of 30s

The other nodes were normal beside this node. I have removed index and search request, but these log still happened.

So how to debug this problem? Any suggestions are welcome.

I would start by setting logger.org.elasticsearch.cluster.service: TRACE on this node, i.e. add that line to elasticsearch.yml and restart the node. This will give us a lot more information about the steps that this node is going through when receiving cluster state updates.

Those took times include the time it takes to write the cluster state to disk; if this node's disk is failing then it might be running much more slowly and causing these timeouts. That's just a hunch, we will be able to see more once we can look at the detailed logs.

one event related logs:

any unusual things? :joy:

Yes, the time is all spent here:

[2019-04-09T17:58:55,177][TRACE][o.e.c.s.ClusterService   ] [jssz-billions-es-05-datanode_stale] calling [org.elasticsearch.gateway.DanglingIndicesState@6902aca7] with change to version [1461362]
[2019-04-09T17:59:31,260]

What I find puzzling is that the offending class, org.elasticsearch.gateway.DanglingIndicesState, usually logs a lot more detail about what it's doing by default:

You can remove logger.org.elasticsearch.cluster.service: TRACE from your config file, but can you investigate why there are no log messages coming from org.elasticsearch.gateway.DanglingIndicesState?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.