Unresponsive cluster: weird fluctuating behavior

vicvega · December 24, 2015, 9:14am

Hello
Since a few days ago, my ES cluster is unresponsive and I noticed a weird fluctuating behavior.

If a periodically check the status, I see number_of_pending_tasks increasing (3 millions and more) and at the same time unassigned_shards decreasing (5k). And this is what I expected, but the weird thing is that, at a certain point it falls down and unassigned_shards goes back to 10k and number_of_pending_tasks back to 4k... and so on again and again...

15 minutes ago my cluster status was

{
  "cluster_name" : "sods",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 5,
  "number_of_data_nodes" : 5,
  "active_primary_shards" : 17969,
  "active_shards" : 30613,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 5333,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 3018887,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 19152497,
  "active_shards_percent_as_number" : 85.15911872705018
}

10 minutes ago

{
  "cluster_name" : "sods",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 5,
  "number_of_data_nodes" : 5,
  "active_primary_shards" : 17969,
  "active_shards" : 30857,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 5089,               <==== decreasing, ok
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 3385989,               <==== increasing, ok 
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 20548203,
  "active_shards_percent_as_number" : 85.83787693334817
}

And now

{
  "cluster_name" : "sods",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 4,
  "number_of_data_nodes" : 4,
  "active_primary_shards" : 17969,
  "active_shards" : 24204,
  "relocating_shards" : 0,
  "initializing_shards" : 8,
  "unassigned_shards" : 11736,                     <======== up again!
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 4660,                     <======== fallen down!
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 96240,
  "active_shards_percent_as_number" : 67.33058862801825
}

The same thing occurred several times in the last days. Is it a right behavior?

I don't know what is going on. In the log files I see lots of ProcessClusterEventTimeoutException. I'm using Elasticsearch 2.0

Thanks for any advice

Christian_Dahlqvist · December 25, 2015, 9:46am

You have way too many shards for a cluster that size, which is most likely contributing a lot to your cluster issues. Each shard is an instance of a Lucene index and carries with it a certain amount of overhead. Having a very large number of shards can therefore be very inefficient as it ties up a lot of cluster resources.

I would recommend reducing the number of shards so that you have hundreds per node rather than thousands and see how that affects the stability and behaviour of the cluster. There is unfortunately no exact limit to the number of shards a node can handle and it will depend on your use case.

warkolm · December 25, 2015, 8:24pm

That's an understatement!

Topic		Replies	Views
My health status is red Elasticsearch	6	882	July 5, 2017
Elasticsearch unresponsive with too many users Elasticsearch	8	1041	October 17, 2017
Elastic Search Cluster has unassigned Shards Elasticsearch	4	3322	May 25, 2017
Elasticsearch cluster status RED and not responding Elasticsearch	6	3677	March 3, 2017
Problem with unassigned Shards Elasticsearch	5	544	February 6, 2017

Unresponsive cluster: weird fluctuating behavior

Related topics