Hi. Our cluster went red a week ago. Sorry in advance, I wasn't able to investigate any better.
In the evening, the cluster got a pending node:
[2015-06-23 21:12:39,839][WARN ][discovery.zen.publish ] [xx104] timed out waiting for all nodes to process published state [649] (timeout [30s], pending nodes: [[xx114-y2][9YHYIAl6T82ZuMfg5a7oaA][xx114-y2][inet[/x.x.x.x:9300]]{disk_type=ssd, machine_id=es114, master=false}])
This message was repeated every couple of minutes until [2015-06-24 04:26:03,513], when the node was restarted.
Some hours later, logstash tried to create a new index. At this point, the cluster went red.
[2015-06-24 02:01:01,193][DEBUG][action.admin.indices.create] [xx104] [logstash-2015.06.24] failed to create
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (acquire index lock) within 1m
at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$1.doRun(MetaDataCreateIndexService.java:150)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Then the pending node was restarted and the cluster went green again.
I couldn't find anything else in the log, not about how that node got half lost, and not why the cluster went red. I wasn't able to reproduce that pending situation.
It can't have been split brain, we have minimum_master_nodes = 2.
Is it possible that the cluster didn't realize that it lost that node, tried to assign it a primary shard and went red because of this?