Hi all,
In my Dev environment of two data nodes, primaries = 1, replicas = 1, when I shut down one of the data nodes, all remaining shards are promoted to primaries (as to be expected I believe), the cluster state turns yellow (again as expected), indexing and searches continue.
What is interesting here is that the replicas remain unallocated. They don't get re-created to be on the one remaining active data node. Is this expected behavior?
What is more interesting is in my Prod environment of 6 data nodes, primaries = 6, replicas = 1, and I stop one data node, the cluster goes immediately red, not yellow, indexing stops (bad) and shards are VERY slowly initialized. After 5 minutes, the cluster stats api did not change from this:
{
"cluster_name": "elasticsearch-prod",
"status": "red",
"timed_out": false,
"number_of_nodes": 12,
"number_of_data_nodes": 5,
"active_primary_shards": 2033,
"active_shards": 3371,
"relocating_shards": 0,
"initializing_shards": 10,
"unassigned_shards": 611,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 8
}
My most urgent question is why the active replicas on the remaining 5 data nodes did not instantly become primaries, keeping the cluster yellow instead of red, and allowing indexing to continue uninterrupted?
Secondary, why would the re-initializing of the shards be so slow? Even after I restarted the stopped data node, initializing was extremely slow for what should be (I think) a local operation?
I very much appreciate your thoughts on this. As it stands now, the cluster is very fragile when just one node going offline can cause all indexing to stop!
After about 10 minutes of initializing, the cluster finally got to a yellow state again, but there were still 515 unassigned shards:
{
"cluster_name": "elasticsearch-prod",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 13,
"number_of_data_nodes": 6,
"active_primary_shards": 2047,
"active_shards": 3465,
"relocating_shards": 0,
"initializing_shards": 12,
"unassigned_shards": 515,
"number_of_pending_tasks": 2,
"number_of_in_flight_fetch": 8
}
100 shards in 10 minutes seems quite slow to me, and the fact that the cluster went red at all is very concerning. I must be doing something wrong.
Thank you again for your time and thoughts.
Chris