Discovery.zen.publish reports pending node, cluster gets red at index create

Suny · June 30, 2015, 9:19am

Hi. Our cluster went red a week ago. Sorry in advance, I wasn't able to investigate any better.
In the evening, the cluster got a pending node:
[2015-06-23 21:12:39,839][WARN ][discovery.zen.publish ] [xx104] timed out waiting for all nodes to process published state [649] (timeout [30s], pending nodes: [[xx114-y2][9YHYIAl6T82ZuMfg5a7oaA][xx114-y2][inet[/x.x.x.x:9300]]{disk_type=ssd, machine_id=es114, master=false}])
This message was repeated every couple of minutes until [2015-06-24 04:26:03,513], when the node was restarted.
Some hours later, logstash tried to create a new index. At this point, the cluster went red.
[2015-06-24 02:01:01,193][DEBUG][action.admin.indices.create] [xx104] [logstash-2015.06.24] failed to create

org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (acquire index lock) within 1m

    at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$1.doRun(MetaDataCreateIndexService.java:150)

    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

    at java.lang.Thread.run(Thread.java:745)

Then the pending node was restarted and the cluster went green again.
I couldn't find anything else in the log, not about how that node got half lost, and not why the cluster went red. I wasn't able to reproduce that pending situation.
It can't have been split brain, we have minimum_master_nodes = 2.
Is it possible that the cluster didn't realize that it lost that node, tried to assign it a primary shard and went red because of this?

warkolm · July 1, 2015, 1:36am

Could be the node was overloaded, anything regarding GC in the logs before this happened?

Suny · July 1, 2015, 1:08pm

It's a bit spooky, but there's no garbage collection log entry in the master's logs. These pending nodes occurred again some days later, but not for so long.
For the unresponsive non-master nodes, their logfile simply end at the time when they have their first timeout. They do log garbage collections, but at other times.

Topic		Replies	Views
Elastic errors in logs Elasticsearch	2	1588	July 5, 2017
ELK data note index creating taking to much time Elasticsearch	10	832	March 8, 2018
Cluster state red, requests timed out, no error in log? Elasticsearch	2	364	July 6, 2017
Inserting new index fails, cluster goes Red Elasticsearch	1	388	November 2, 2018
Receiving "timed out waiting for all nodes to process published state" Elasticsearch	14	1417	January 18, 2022

Discovery.zen.publish reports pending node, cluster gets red at index create

Related topics