Shard replication/recovery going slow

(Ant) #1

I recently had a node fail completely and had to rebuild it, there are only 3 nodes in the cluster and it took 5 days for the cluster to get back to a status of green. There are some indexes which are a few MB and others that are 300GB and it looks like it was rate limiting on how many indexes or shards it would do in an hour so if it hit a patch of the smaller indexes it would pretty much be sitting idle as it would send all it was happy to then just wait. In contrast when it hit the bigger indexes you would just see a flurry of activity and looking as the disk use chart it would suddenly start shooting up.

I'm guessing there are some settings to help control this and I would like to configure them so that this blend of indexes isn't such an issue for me as I also recently had to restart a node (which I may have to rebuild) and it took a day to mark all the shards as active as each node hosts some 12k shards. If anyone knows what settings I need to look at to resolve this that would be great. I know there is one to limit the speed of transfer but as when it hit the larger indexes it did use the NIC I don't feel that's it. I guess I'm looking to increase frequency of checking if it's ready to send something else and maybe concurrency.

Thanks in advance

(Christian Dahlqvist) #2

It sounds like you simply have far too many shards given the size of your cluster. Have a look at this blog post about shards and sharding for guidance. A large number of indices and shards will lead to a large cluster state that can get slow to update for every change to shard allocation.

(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.