Number of pending tasks grows infinitely after cluster crash

Our cluster consists of 17 nodes: 16 data-nodes, 2 of which are also masters, and a dedicated master. Due to the human error the cluster suffered an almost simultaneous restart of 2 of 3 master nodes. Since we have discovery.zen.minimum_master_nodes: set to 3, the cluster went down. When the restarted masters re-joined the cluster, it marked 3000 shards (of 50k) as unassigned, and started recovery process. Tasks queue began to increase, growing up to 1.5k tasks and 2.5 hours of lag in next five hours.

These five hours later, the third master node was restarted. After that all shards were marked as unassigned and recovery was started from the beginning. Pending task queue shrunk to 0, and began to grow steadily, this time growing in six hours up to 20k tasks and more than 3 hours of lag. Most of the tasks were like this:

{
    "insert_order":3374,
    "priority":"URGENT",
    "source":"shard-started ([foo-2016-11-10-19][0], node[nefPZKFxS9SCIO2Q9KsmzA], [R], v[10], s[INITIALIZING], a[id=6PcxFdOlQlqVytZiSow8dg], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-11-30T01:41:15.481Z]]), reason [after recovery (replica) from node [{elastic-node-dc1-t3}{TJAsGT59RPiq9dCSztAuUg}{x.x.x.178}{x.x.x.178:9300}{rack_id=c1_a, master=false}]]",
    "executing":false,
    "time_in_queue_millis":771072,
    "time_in_queue":"12.8m"
}

Due to the task queue saturation, new indices weren't created, which was a big issue, since we're using per-hour indexing for data.

As a result we've decided to move 2 master nodes to dedicated hardware. We've done it gradually, so this time the cluster suffered no downtime. So the topology became the following: 16 dedicated data nodes and 3 dedicated master nodes. However, this didn't fix the issue: the task queue shrunk to 0 again, but instantly began to grow at the same pace as before.

Given that we've set cluster.routing.allocation.enable to primaries (it took some time for cluster to apply this setting since it looks like it's also done via tasks/events mechanics). After all shards were recovered, we've finally seen new indices creating. Since then we've tried setting cluster.routing.allocation.enable back to all, but the task queue began to grow again, some weird relocations were started (we have rack awareness enabled; the cluster was re-balancing shards between racks back and forth), and new indices were unable to be created again.

So right now we have cluster.routing.allocation.enable set to primaries, several hundreds of shards underreplicated, and no idea what to do next. Is waiting for cluster to re-balance the only option? Why the re-balance process stuffs the task queue so hardly anything else could be done?

Some useful info:

  • Elasticsearch: Version: 2.3.2, Build: b9e4a6a/2016-04-21T16:03:47Z, JVM: 1.7.0_79
  • Approximate number of indices: 6200
  • Approximate number of shards: 50000

Having only 16 data nodes, 50k shards is way too much (by at least a hundredfold). Each shard has to be managed by the master in terms of allocation/balancing and requires per-shard resources on the data nodes (explosion of file handles). The fix here is to dramatically reduce the number of shards.

Thanks for your reply! What are general recommendations here? Should we keep the number of shards under 500 per node? Or even less?

BTW, the number above is the total amount of shards across all indices, if it makes any difference.

How large is the average shard?

Most large shards have around 100M docs and consume around 60Gb of storage. Average docs count per shard is 1M and 750Mb of storage.

We recommend having a max of 50GB per shard, so sounds like you need to consolidate some things.

Think about weekly/monthly indices for low volume data.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.