Cluster stuck in Initializing

Hi,

EDIT: running ES 1.7
EDIT2: I launched a new node with the same settings as the nodes not receiving any data. The new node also gets no shards. The cluster is still stuck. One node is producing the outofMemory Errors. I'm not sure what will happen if I restart this node..

we run a large logging cluster with a few billion documents.
This night our cluster had some issues. I'm still looking for the reason.
But that's not my main concern at the moment.

The cluster had a few yellow shards left this morning, after doing its best to recover.
I tried everything to allocate them.
-Turning off/on allocation on shard and cluster level
-replica -> 0 replica -> 1

  • restart node
  • force allocation with a script
    But nothing seems to work.

After restarting one node with stuck shards. the unassigned count went from 35 to 485(all on this node). Now no replica shard can be assigned.
Disk usage is fine.

{ "cluster_name" : "prod", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 12, "number_of_data_nodes" : 6, "active_primary_shards" : 4947, "active_shards" : 9380, "relocating_shards" : 4, "initializing_shards" : 38, "unassigned_shards" : 447, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0 }

The 38 in initializing is from me forcing them. They are stuck now.

I'm also seeing this exception we've never had before:
Failed to send error message back to client for action [internal:index/shard/recovery/start_recovery] java.lang.OutOfMemoryError: Java heap space

there are no logs regarding long GC times. The cluster is responsive and still working perfectly fine.

And also this one

Actual Exception org.elasticsearch.indices.recovery.DelayRecoveryException: source node does not have the shard listed in its state as allocated on the node at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:108) at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:49) at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:146) at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:132) at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

It would be ok to restart the cluster, IF I know that it wont get stuck for all shards. (because afte restarting one node all shards of this node are stuck)

Help would be appreciated.