Shard re-allocation taking a very long time


(Steve) #1

We had a situation where an engineer on my team recycled the es service on a cluster node without disabling shard allocation or doing a sync/flush, and the node failed to come back online within the delay period. Once it did come back, the cluster starting reassigning shards to that node, however, 5 days later it is still going (a little over half done) and it seems to be putting a significant amount of stress on the cluster, so much so that we seem to not be able to index due to time outs. We are also seeing a high number of queued tasks, and a very high task wait time. Is there anything I can do to speed up the shard recovery/reallocation?


(David Turner) #2

Can you tell us a bit more about your cluster? What version are you using? How many nodes do you have? How much data? How many indices and shards do you have?

Have you adjusted any settings like indices.recovery.max_bytes_per_sec or cluster.routing.allocation.node_concurrent_recoveries?

Generally if the cluster is struggling to do its usual workload and the recoveries then slowing down the recoveries is a better plan than speeding them up.

Where are you seeing this information? Are you seeing anything in the server logs about things being slow or otherwise failing?


(Steve) #3

Hi David,

Thanks for the reply. it is a 3 node cluster running ES 6.2.4 on AWS EC2 M5.2xl's (kibana and logstash run on separate instances with 3 ES nodes). The store size is 219.7gb, 8362 indices, 48459 shards (20667 primaries). The only error I am seeing is:

    "type" : "failed_node_exception",
    "reason" : "Failed node [SViAF3eISjKdDKjq8JFiGA]",
    "caused_by" : {
      "type" : "exception",
      "reason" : "failed to refresh store stats",
      "caused_by" : {
        "type" : "file_system_exception",
        "reason" : "/var/lib/elasticsearch/nodes/0/indices/4VA3uADcRzOqLkWBRlm-Mw/1/index: Too many open files"

All 3 nodes have ulimit set to unlimited. I viewed the queued tasks and wait time number via the ES api (/_cat/health)


(Christian Dahlqvist) #4

Wow. That is among the worst cases of oversharding I have seen. You should look to dramatically reduce your shard count, which will require reindexing. I do not understand how you have managed to get it this far without having it fall over earlier.

I would recommend the following blog posts:


(Steve) #5

Yea I didnt set it up, I just inherited it. Thanks for the links.


(David Turner) #6

Woah, yes, 50k shards is a lot. To put it mildly.

I viewed the queued tasks and wait time number via the ES api (/_cat/health)

Ok, those numbers represent the tasks that the master node is trying to perform. Each time a shard changes state (e.g. unassigned -> recovering -> started) it enqueues a task to recompute the shard assignment. This is normally fine, but with 50k shards and a lot of recoveries ongoing I wouldn't be surprised if this is taking rather too long and the tasks are piling up.

How many tasks are there?

I think it'd be worth trying to disable rebalancing, because I think rebalancing is the slow bit of that computation. Set cluster.routing.rebalance.enable: none and see if the master starts to work its way through that backlog.

Too many open files

That's worrying. Is your cluster actually making progress towards health or is it stuck in a spiral of failures caused by hitting this limit, resulting in another attempt at recovery?

It may be necessary to start closing some indices to get things under control.


(Steve) #7

This cluster houses syslog, filebeat, and metricbeat data (which is massive), We create indexes on a daily basis, and we keep everything on 90-day retention (I have a python script that runs daily to purge indices over 90 days old, 30 days for metricbeat due to the size).

Currently, there are 500+ tasks, but that has been growing and shrinking the last few days (this all started last Friday), and I've seen as high as 1200 pending tasks and as low as 150 just today.

The cluster seems to be making progress as the number of unassigned shards continues to drop (~7K atm down from 24K last Friday) - so yea it is moving very slowly, but does seem to be making progress.

I'm going to try turning off re-balancing and clean up some of the data...we'll see what effect that has.

I really appreciate all your help, I'm pretty new to elasticsearch so I am getting a crash course this week :smile:


(David Turner) #8

Ok, I think for now the best thing is to wait it out. 1200 tasks is not completely ludicrous given the shard count in this cluster. I think it's best just to look at how to stop this from happening again - see below.

Ok, looking at the numbers you have 48459 shards for 90 days of retention which means you're creating around 500 shards per day. Is that right? This seems very high. Can you tell us a bit more about the indices that are created each day? Are there lots of different syslog/filebeat/metricbeat indices created daily or is it a small number of indices each with far too many shards?

Also you only have 219.7GB of data so the average shard size is around 4.5MB. The recommended shard size is in the tens-of-GB range. Also this averages out as around 2.5GB per day, or 20GB per week. Could you move to weekly (or even monthly) indices to reduce your shard count and bring the size up to something a bit more efficient?


(Steve) #9

I was able to change the rebalancing to null and I've been monitoring the cluster, at first it looked like it was helping, but now I have two nodes out of the cluster with the too many files error.

I can make the necessary changes, and I can even completely delete a good chunk of the data as it isn't critical to us, although I can via the api right ow, it just times out. The indexing strategy is basically each application creates an index per day, and the beats modules are all creating 1 index per day each, we create approx 20-25 indices per day, some are in the 10's of GB's range, others are much smaller.


(Steve) #10

Given the current situation, what's your advice on the fastest way to get this cluster back to functioning? Should I add more nodes? Will that buy me some time to implement some changes, or just make things worse?


(David Turner) #11

Adding more nodes might help, since rebalancing is disabled, but might cause more pain.

I looked into how indices are deleted. The deletion is enqueued on the master at the same priority as a shard state change, so given that we think your master is swamped by shard state changes I think it makes sense that the deletions are taking a long time.

One possibility is that we could completely disable allocation, which I think would allow any ongoing recoveries to complete but would stop adding more to the pile. Once the master has worked through its pending task queue it should then be free to close or delete all the indices you no longer need. Once the cluster is down to a more manageable size you can then re-enable allocation and hope that it goes more smoothly.


(Steve) #12

So, it appears as though 1 node will join the cluster for a few mins, then drop, then rejoin again a few mins later.


(Steve) #13

Wanted to give you an update. The default indexing settings (5 shards per index with 1 replica) was the root cause, however, the trigger was some bad code deployed by an intern that was creating a new index every 15 secs or so, with only one tiny document each, this code was deployed a few weeks ago, so you can do the sharding math there. We were definitely over-sharding to begin with, but we honestly hadn't seen any issues prior to this, then this little gem exposed and exploited the default settings and brought the entire cluster down.

At one point the cluster was just too busy to keep indexing and we stopped capturing data, so I quickly stood up a new cluster (using the savvy developed over the past week of reading ;)) and redirected traffic to it so we wouldn't lose too much data. This worked, and I am now almost finished reindexing the data on the old cluster to the new cluster.

I put a default template in place that creates 1 shard and 1 replica per index, and I am monitoring the indices so we can determine which ones need more shards/replicas based on usage patterns and HA needs as some of this data isn't worth holding for very long and doesn't need to be HA (a few of these get 1 shard and no replicas per), and we've managed to squeeze the cluster down to 20% of what it was previously. I also took the opportunity to upgrade to 6.6.2, and some of the new tools in Kibana are amazing.

Long story, but, the relevant bits are I got the cluster back to stable without losing very much data at all, built a new cluster that is more robust than it previously was, and learned a metric ton about elasticsearch. Thanks again for all your help.


(David Turner) #14

Thanks for the update @stevemw and I'm glad to hear you've managed to get things under control.

I feel for your poor intern. I too made a production-halting mistake once when I was just starting out and it was not a pleasant experience, even though my colleagues were very supportive. It might be useful to let them know know that you're not the only people to hit this kind of issue, and we've made some changes to try and improve the defaults to stop this from happening so often:

20% of 50k shards is 10k shards which still sounds like a lot. You should be aiming for a shard size in the 10s-of-GB range. If you're using time-based indices, perhaps consider using a longer time-range, or using rollover to create new indices based on their size rather than just their age.


(Steve) #15

My apologies, when I said 20% I meant in actual size of the data store. From a shard perspective, we are far more effcient, current cluster stats are as follows:

5 Nodes
Documents: 181,512,556
Disk Usage: 106.7 GB
Primary Shards: 341
Replica Shards: 189

There is still work to be done in terms of getting the average shard size up, we're working on that.

I dont know if I'm going to tell the intern about it or not, I just casually asked her to update that block of code and redeploy...she did it in about 15 mins. I wrote an RCA and mentioned it, but the root cause I defined was failure to properly manage the cluster, and thats on me...but it wont happen again :wink:


(David Turner) #16

Looks great, nice work.