Discovery.zen.publish reports pending node, cluster gets red at index create


(Suny Kim) #1

Hi. Our cluster went red a week ago. Sorry in advance, I wasn't able to investigate any better.
In the evening, the cluster got a pending node:
[2015-06-23 21:12:39,839][WARN ][discovery.zen.publish ] [xx104] timed out waiting for all nodes to process published state [649] (timeout [30s], pending nodes: [[xx114-y2][9YHYIAl6T82ZuMfg5a7oaA][xx114-y2][inet[/x.x.x.x:9300]]{disk_type=ssd, machine_id=es114, master=false}])
This message was repeated every couple of minutes until [2015-06-24 04:26:03,513], when the node was restarted.
Some hours later, logstash tried to create a new index. At this point, the cluster went red.
[2015-06-24 02:01:01,193][DEBUG][action.admin.indices.create] [xx104] [logstash-2015.06.24] failed to create

org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (acquire index lock) within 1m

    at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$1.doRun(MetaDataCreateIndexService.java:150)

    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

    at java.lang.Thread.run(Thread.java:745)

Then the pending node was restarted and the cluster went green again.
I couldn't find anything else in the log, not about how that node got half lost, and not why the cluster went red. I wasn't able to reproduce that pending situation.
It can't have been split brain, we have minimum_master_nodes = 2.
Is it possible that the cluster didn't realize that it lost that node, tried to assign it a primary shard and went red because of this?


(Mark Walkom) #2

Could be the node was overloaded, anything regarding GC in the logs before this happened?


(Suny Kim) #3

It's a bit spooky, but there's no garbage collection log entry in the master's logs. These pending nodes occurred again some days later, but not for so long.
For the unresponsive non-master nodes, their logfile simply end at the time when they have their first timeout. They do log garbage collections, but at other times.


(system) #4