Our cluster consists of 17 nodes: 16 data-nodes, 2 of which are also masters, and a dedicated master. Due to the human error the cluster suffered an almost simultaneous restart of 2 of 3 master nodes. Since we have discovery.zen.minimum_master_nodes:
set to 3
, the cluster went down. When the restarted masters re-joined the cluster, it marked 3000 shards (of 50k) as unassigned
, and started recovery process. Tasks queue began to increase, growing up to 1.5k tasks and 2.5 hours of lag in next five hours.
These five hours later, the third master node was restarted. After that all shards were marked as unassigned
and recovery was started from the beginning. Pending task queue shrunk to 0, and began to grow steadily, this time growing in six hours up to 20k tasks and more than 3 hours of lag. Most of the tasks were like this:
{
"insert_order":3374,
"priority":"URGENT",
"source":"shard-started ([foo-2016-11-10-19][0], node[nefPZKFxS9SCIO2Q9KsmzA], [R], v[10], s[INITIALIZING], a[id=6PcxFdOlQlqVytZiSow8dg], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-11-30T01:41:15.481Z]]), reason [after recovery (replica) from node [{elastic-node-dc1-t3}{TJAsGT59RPiq9dCSztAuUg}{x.x.x.178}{x.x.x.178:9300}{rack_id=c1_a, master=false}]]",
"executing":false,
"time_in_queue_millis":771072,
"time_in_queue":"12.8m"
}
Due to the task queue saturation, new indices weren't created, which was a big issue, since we're using per-hour indexing for data.
As a result we've decided to move 2 master nodes to dedicated hardware. We've done it gradually, so this time the cluster suffered no downtime. So the topology became the following: 16 dedicated data nodes and 3 dedicated master nodes. However, this didn't fix the issue: the task queue shrunk to 0 again, but instantly began to grow at the same pace as before.
Given that we've set cluster.routing.allocation.enable
to primaries
(it took some time for cluster to apply this setting since it looks like it's also done via tasks/events mechanics). After all shards were recovered, we've finally seen new indices creating. Since then we've tried setting cluster.routing.allocation.enable
back to all
, but the task queue began to grow again, some weird relocations were started (we have rack awareness enabled; the cluster was re-balancing shards between racks back and forth), and new indices were unable to be created again.
So right now we have cluster.routing.allocation.enable
set to primaries
, several hundreds of shards underreplicated, and no idea what to do next. Is waiting for cluster to re-balance the only option? Why the re-balance process stuffs the task queue so hardly anything else could be done?
Some useful info:
- Elasticsearch:
Version: 2.3.2, Build: b9e4a6a/2016-04-21T16:03:47Z, JVM: 1.7.0_79
- Approximate number of indices:
6200
- Approximate number of shards:
50000