Unresponsive cluster: weird fluctuating behavior

Hello
Since a few days ago, my ES cluster is unresponsive and I noticed a weird fluctuating behavior.

If a periodically check the status, I see number_of_pending_tasks increasing (3 millions and more) and at the same time unassigned_shards decreasing (5k). And this is what I expected, but the weird thing is that, at a certain point it falls down and unassigned_shards goes back to 10k and number_of_pending_tasks back to 4k... and so on again and again...

15 minutes ago my cluster status was

{
  "cluster_name" : "sods",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 5,
  "number_of_data_nodes" : 5,
  "active_primary_shards" : 17969,
  "active_shards" : 30613,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 5333,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 3018887,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 19152497,
  "active_shards_percent_as_number" : 85.15911872705018
}

10 minutes ago

{
  "cluster_name" : "sods",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 5,
  "number_of_data_nodes" : 5,
  "active_primary_shards" : 17969,
  "active_shards" : 30857,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 5089,               <==== decreasing, ok
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 3385989,               <==== increasing, ok 
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 20548203,
  "active_shards_percent_as_number" : 85.83787693334817
}

And now

{
  "cluster_name" : "sods",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 4,
  "number_of_data_nodes" : 4,
  "active_primary_shards" : 17969,
  "active_shards" : 24204,
  "relocating_shards" : 0,
  "initializing_shards" : 8,
  "unassigned_shards" : 11736,                     <======== up again!
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 4660,                     <======== fallen down!
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 96240,
  "active_shards_percent_as_number" : 67.33058862801825
}

The same thing occurred several times in the last days. Is it a right behavior?

I don't know what is going on. In the log files I see lots of ProcessClusterEventTimeoutException. I'm using Elasticsearch 2.0

Thanks for any advice

You have way too many shards for a cluster that size, which is most likely contributing a lot to your cluster issues. Each shard is an instance of a Lucene index and carries with it a certain amount of overhead. Having a very large number of shards can therefore be very inefficient as it ties up a lot of cluster resources.

I would recommend reducing the number of shards so that you have hundreds per node rather than thousands and see how that affects the stability and behaviour of the cluster. There is unfortunately no exact limit to the number of shards a node can handle and it will depend on your use case.

That's an understatement!