Lots of "cluster state update task [zen-disco-receive(from master " above the warn threshold of 30s"

shjdwxy · April 9, 2019, 6:40am

hi,
We use ES for log management and ES cluster is built by hot-cold architecture. One physical host holds one hot node and one cold node. Hot node and cold node share Cpu, memory, but use different storage( ssd for hot node, sata for cold node).

One cold node of one cluster lots of errors happened recently.

[2019-04-09T12:46:21,482][WARN ][o.e.c.s.ClusterService   ] [jssz-billions-es-05-datanode_stale] cluster state update task [zen-disco-receive(from master [master {jssz-billions-es-01-masternode}{IfiUfj6nRKSRpQqSL2tmkQ}{D22n7s9YRwieNT67hyPbUg}{10.69.23.23}{10.69.23.23:9310} committed version [1458408]])] took [40.5s] above the warn threshold of 30s
[2019-04-09T12:55:45,538][WARN ][o.e.c.s.ClusterService   ] [jssz-billions-es-05-datanode_stale] cluster state update task [zen-disco-receive(from master [master {jssz-billions-es-01-masternode}{IfiUfj6nRKSRpQqSL2tmkQ}{D22n7s9YRwieNT67hyPbUg}{10.69.23.23}{10.69.23.23:9310} committed version [1458422]])] took [43.9s] above the warn threshold of 30s
[2019-04-09T12:57:49,492][WARN ][o.e.c.s.ClusterService   ] [jssz-billions-es-05-datanode_stale] cluster state update task [zen-disco-receive(from master [master {jssz-billions-es-01-masternode}{IfiUfj6nRKSRpQqSL2tmkQ}{D22n7s9YRwieNT67hyPbUg}{10.69.23.23}{10.69.23.23:9310} committed version [1458423]])] took [31.1s] above the warn threshold of 30s
[2019-04-09T12:59:22,582][WARN ][o.e.c.s.ClusterService   ] [jssz-billions-es-05-datanode_stale] cluster state update task [zen-disco-receive(from master [master {jssz-billions-es-01-masternode}{IfiUfj6nRKSRpQqSL2tmkQ}{D22n7s9YRwieNT67hyPbUg}{10.69.23.23}{10.69.23.23:9310} committed version [1458425]])] took [55.1s] above the warn threshold of 30s
[2019-04-09T13:00:00,721][WARN ][o.e.c.s.ClusterService   ] [jssz-billions-es-05-datanode_stale] cluster state update task [zen-disco-receive(from master [master {jssz-billions-es-01-masternode}{IfiUfj6nRKSRpQqSL2tmkQ}{D22n7s9YRwieNT67hyPbUg}{10.69.23.23}{10.69.23.23:9310} committed version [1458426]])] took [38.1s] above the warn threshold of 30s
[2019-04-09T13:35:41,830][WARN ][o.e.c.s.ClusterService   ] [jssz-billions-es-05-datanode_stale] cluster state update task [zen-disco-receive(from master [master {jssz-billions-es-01-masternode}{IfiUfj6nRKSRpQqSL2tmkQ}{D22n7s9YRwieNT67hyPbUg}{10.69.23.23}{10.69.23.23:9310} committed version [1458624]])] took [32.4s] above the warn threshold of 30s
[2019-04-09T14:08:07,766][WARN ][o.e.c.s.ClusterService   ] [jssz-billions-es-05-datanode_stale] cluster state update task [zen-disco-receive(from master [master {jssz-billions-es-01-masternode}{IfiUfj6nRKSRpQqSL2tmkQ}{D22n7s9YRwieNT67hyPbUg}{10.69.23.23}{10.69.23.23:9310} committed version [1458960]])] took [30.1s] above the warn threshold of 30s

The other nodes were normal beside this node. I have removed index and search request, but these log still happened.

So how to debug this problem? Any suggestions are welcome.

DavidTurner · April 9, 2019, 7:45am

I would start by setting logger.org.elasticsearch.cluster.service: TRACE on this node, i.e. add that line to elasticsearch.yml and restart the node. This will give us a lot more information about the steps that this node is going through when receiving cluster state updates.

Those took times include the time it takes to write the cluster state to disk; if this node's disk is failing then it might be running much more slowly and causing these timeouts. That's just a hunch, we will be able to see more once we can look at the detailed logs.

shjdwxy · April 9, 2019, 10:27am

one event related logs：

gist.github.com

https://gist.github.com/wangxiangyu/548fedec87560a5cf5fc6cf80c75d285

gistfile1.txt

[2019-04-09T17:58:54,867][TRACE][o.e.c.s.ClusterService   ] [jssz-billions-es-05-datanode_stale] will process [zen-disco-receive(from master [master {jssz-billions-es-01-masternode}{IfiUfj6nRKSRpQqSL2tmkQ}{D22n7s9YRwieNT67hyPbUg}{10.69.23.23}{10.69.23.23:9310} committed ve362]])]
[2019-04-09T17:58:54,867][DEBUG][o.e.c.s.ClusterService   ] [jssz-billions-es-05-datanode_stale] processing [zen-disco-receive(from master [master {jssz-billions-es-01-masternode}{IfiUfj6nRKSRpQqSL2tmkQ}{D22n7s9YRwieNT67hyPbUg}{10.69.23.23}{10.69.23.23:9310} committed vers2]])]: execute
[2019-04-09T17:58:55,054][TRACE][o.e.c.s.ClusterService   ] [jssz-billions-es-05-datanode_stale] cluster state updated, source [zen-disco-receive(from master [master {jssz-billions-es-01-masternode}{IfiUfj6nRKSRpQqSL2tmkQ}{D22n7s9YRwieNT67hyPbUg}{10.69.23.23}{10.69.23.23:9ted version [1461362]])]
cluster uuid: lf3z-gwATBiM-tTTtdTtAg
version: 1461362
state uuid: BtuRtWbNS-O2H6YsBpa78g
from_diff: false
meta data version: 1460066
   [billions-link.im.apply-dao-@2019.04.05-jssz01-0/ag5wKDu8QKGVzk4XnTP2pQ]: v[36]
      0: p_term [1], isa_ids [wFPjBFKxRK64lZ01FJke3Q, QgVhSygUSI6uL55WK0gFXg]

This file has been truncated. show original

any unusual things?

DavidTurner · April 9, 2019, 10:36am

Yes, the time is all spent here:

[2019-04-09T17:58:55,177][TRACE][o.e.c.s.ClusterService   ] [jssz-billions-es-05-datanode_stale] calling [org.elasticsearch.gateway.DanglingIndicesState@6902aca7] with change to version [1461362]
[2019-04-09T17:59:31,260]

What I find puzzling is that the offending class, org.elasticsearch.gateway.DanglingIndicesState, usually logs a lot more detail about what it's doing by default:

github.com

elastic/elasticsearch/blob/v5.6.16/core/src/main/java/org/elasticsearch/gateway/DanglingIndicesState.java#L135-L146


if (metaData.hasIndex(indexMetaData.getIndex().getName())) {
    logger.warn("[{}] can not be imported as a dangling index, as index with same name already exists in cluster metadata",
        indexMetaData.getIndex());
} else if (graveyard.containsIndex(indexMetaData.getIndex())) {
    logger.warn("[{}] can not be imported as a dangling index, as an index with the same name and UUID exist in the " +
                "index tombstones.  This situation is likely caused by copying over the data directory for an index " +
                "that was previously deleted.", indexMetaData.getIndex());
} else {
    logger.info("[{}] dangling index exists on local file system, but not in cluster metadata, " +
                "auto import to cluster state", indexMetaData.getIndex());
    newIndices.put(indexMetaData.getIndex(), indexMetaData);
}

You can remove logger.org.elasticsearch.cluster.service: TRACE from your config file, but can you investigate why there are no log messages coming from org.elasticsearch.gateway.DanglingIndicesState?

system · May 7, 2019, 10:45am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TimeTaken by a Cluster State Update Task Elasticsearch	12	7243	July 5, 2017
Nodes leaves cluster and rejoin after sometime Elasticsearch	3	665	March 20, 2018
Zen-disco-receive Logstash	5	4632	July 6, 2017
WARN messages in log followed by Exception Elasticsearch	2	364	July 6, 2017
Cluster state often yellow: data node failed to ping master Elasticsearch	4	1532	October 5, 2018

Lots of "cluster state update task [zen-disco-receive(from master " above the warn threshold of 30s"

Related topics