Hello
Since a few days ago, my ES cluster is unresponsive and I noticed a weird fluctuating behavior.
If a periodically check the status, I see number_of_pending_tasks
increasing (3 millions and more) and at the same time unassigned_shards
decreasing (5k). And this is what I expected, but the weird thing is that, at a certain point it falls down and unassigned_shards
goes back to 10k and number_of_pending_tasks
back to 4k... and so on again and again...
15 minutes ago my cluster status was
{
"cluster_name" : "sods",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 5,
"active_primary_shards" : 17969,
"active_shards" : 30613,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 5333,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 3018887,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 19152497,
"active_shards_percent_as_number" : 85.15911872705018
}
10 minutes ago
{
"cluster_name" : "sods",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 5,
"active_primary_shards" : 17969,
"active_shards" : 30857,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 5089, <==== decreasing, ok
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 3385989, <==== increasing, ok
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 20548203,
"active_shards_percent_as_number" : 85.83787693334817
}
And now
{
"cluster_name" : "sods",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 17969,
"active_shards" : 24204,
"relocating_shards" : 0,
"initializing_shards" : 8,
"unassigned_shards" : 11736, <======== up again!
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 4660, <======== fallen down!
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 96240,
"active_shards_percent_as_number" : 67.33058862801825
}
The same thing occurred several times in the last days. Is it a right behavior?
I don't know what is going on. In the log files I see lots of ProcessClusterEventTimeoutException
. I'm using Elasticsearch 2.0
Thanks for any advice