Feature request to re-balance shards

Hi - I've been running an 8 node cluster with a 14.5 GB index made of
24 shards with 2 replicas. Here's how they all end up getting
distributed:

            host	  shards     primary   size (MB)        docs
   index1.la.sol          16                    9933    13021626
   index2.la.sol          13           9        8057    10575338
   index3.la.sol           8           4        4940     6485001
   index4.la.sol           6           1        3668     4779460
   index5.la.sol          11           5        6806     8924327
   index6.la.sol          11           5        6865     9015712
   index7.la.sol           4                    2474     3267000
   index8.la.sol           3                    1852     2440371

The thing that bothers me is that over time, a single node or two ends
up with much more of the data than the others, and the overall query
performance suffers because this one node is working a lot harder than
the others and appears to be dragging them all down even though the
others are much more lightly used.

To solve this I've been forcing nodes to re-allocate shards by
restarting them one at a time. This is not always ideal, however,
because while the nodes are restarting the moving of the shards puts
even an even bigger strain on the network and for a few minutes,
performnance gets a lot worse. Also, it appears I end up having to
restart at least half of the nodes eventually because instead of
evenly distributing the data across the nodes when a single node goes
down, for some reason it ends up piling up the data onto the remaining
nodes that have the most data (probably because they have been up the
longest). So in the above example, if I restart index1, index2 will
likely end up with close to 10 GB of data and I'll have to restart
that one next.

So my feature request would be to either have something that re-
allocates shards evenly, taking into account the size of each shard in
terms of documents or megabytes. This can be something that gets
called on-demand, to minimize the amount of time spent re-balancing
things that may be not ideal but still acceptable.

If that's too complicated to implement, a stop-gap measure would be
something that lets me signal to index1, "hey, re-allocate only a few
of your shards someplace else" that way I don't have to end up with
all of the shards leaving index1 by using my crude restart method. :slight_smile:

Another possibility is enabling a switch which changes the algorithm
by which a node which just went down gets its shards re-allocated.
Right now it appears to do something that ends up favoring the oldest
node; maybe it can be tweaked so that it dumps the shards
preferentially on the ones which has the most free memory.

Let me know if you think any of this is reasonable to do or if there
is some better way I can solve this problem I missed! :slight_smile:

BTW I'm running 0.15.2 on linux x86_64

Hi,

The logic used is to take all shards that exists (from all different indices) and try and get to an even number of shards per node. There is no preference based on shard size or something else. How did you check the number of shards allocated per node? Did oyu do it based on the cluster state? Can you gist it (with pretty set to true)?

Any logic implemented should be something that the system always strives for constantly, not a single custom movement of one shard from one node to the other (cause what will happen the next time to rebalancing logic will run).

Thats not to say that other logics can't be implemented. Ones that take the shard size into account with rebalancing (this is mainly relevant for cases where multiple indices are running on the system). Also, the ability to control per index maximum number of shards per node can also come in handy. Its planned, just not there yet.

-shay.banon
On Wednesday, April 13, 2011 at 6:12 PM, jalano wrote:
Hi - I've been running an 8 node cluster with a 14.5 GB index made of

24 shards with 2 replicas. Here's how they all end up getting
distributed:

host shards primary size (MB) docs
index1.la.sol 16 9933 13021626
index2.la.sol 13 9 8057 10575338
index3.la.sol 8 4 4940 6485001
index4.la.sol 6 1 3668 4779460
index5.la.sol 11 5 6806 8924327
index6.la.sol 11 5 6865 9015712
index7.la.sol 4 2474 3267000
index8.la.sol 3 1852 2440371

The thing that bothers me is that over time, a single node or two ends
up with much more of the data than the others, and the overall query
performance suffers because this one node is working a lot harder than
the others and appears to be dragging them all down even though the
others are much more lightly used.

To solve this I've been forcing nodes to re-allocate shards by
restarting them one at a time. This is not always ideal, however,
because while the nodes are restarting the moving of the shards puts
even an even bigger strain on the network and for a few minutes,
performnance gets a lot worse. Also, it appears I end up having to
restart at least half of the nodes eventually because instead of
evenly distributing the data across the nodes when a single node goes
down, for some reason it ends up piling up the data onto the remaining
nodes that have the most data (probably because they have been up the
longest). So in the above example, if I restart index1, index2 will
likely end up with close to 10 GB of data and I'll have to restart
that one next.

So my feature request would be to either have something that re-
allocates shards evenly, taking into account the size of each shard in
terms of documents or megabytes. This can be something that gets
called on-demand, to minimize the amount of time spent re-balancing
things that may be not ideal but still acceptable.

If that's too complicated to implement, a stop-gap measure would be
something that lets me signal to index1, "hey, re-allocate only a few
of your shards someplace else" that way I don't have to end up with
all of the shards leaving index1 by using my crude restart method. :slight_smile:

Another possibility is enabling a switch which changes the algorithm
by which a node which just went down gets its shards re-allocated.
Right now it appears to do something that ends up favoring the oldest
node; maybe it can be tweaked so that it dumps the shards
preferentially on the ones which has the most free memory.

Let me know if you think any of this is reasonable to do or if there
is some better way I can solve this problem I missed! :slight_smile:

BTW I'm running 0.15.2 on linux x86_64

Ok, managed to find the problem with the rebalancing. Rebalancing does not take into affect closed indices properly. A fix is being tested now. Here is the issue: Shard Allocation: Closed indices are not properly taken into account when rebalancing · Issue #858 · elastic/elasticsearch · GitHub.

-shay.banon
On Thursday, April 14, 2011 at 1:03 AM, Shay Banon wrote:

Hi,

The logic used is to take all shards that exists (from all different indices) and try and get to an even number of shards per node. There is no preference based on shard size or something else. How did you check the number of shards allocated per node? Did oyu do it based on the cluster state? Can you gist it (with pretty set to true)?

Any logic implemented should be something that the system always strives for constantly, not a single custom movement of one shard from one node to the other (cause what will happen the next time to rebalancing logic will run).

Thats not to say that other logics can't be implemented. Ones that take the shard size into account with rebalancing (this is mainly relevant for cases where multiple indices are running on the system). Also, the ability to control per index maximum number of shards per node can also come in handy. Its planned, just not there yet.

-shay.banon
On Wednesday, April 13, 2011 at 6:12 PM, jalano wrote:

Hi - I've been running an 8 node cluster with a 14.5 GB index made of
24 shards with 2 replicas. Here's how they all end up getting
distributed:

host shards primary size (MB) docs
index1.la.sol 16 9933 13021626
index2.la.sol 13 9 8057 10575338
index3.la.sol 8 4 4940 6485001
index4.la.sol 66 1 3668 4779460
index5.la.sol 11 5 6806 8924327
index6.la.sol 11 5 6865 9015712
index7.la.sol 4 2474 3267000
index8.la.sol 3 1852 2440371

The thing that bothers me is thatt over time, a single node or two ends
up with much more of the data than the others, and the overall query
performance suffers because this one node is working a lot harder than
the others and appears to be dragging them all down even though the
others are much more lightly used.

To solve this I've been forcing nodes to re-allocate shards by
restarting them one at a time. This is not always ideal, however,
because while the nodes are restarting the moving of the shards puts
even an even bigger strain on the network and for a few minutes,
performnance gets a lot worse. Also, it appears I end up having to
restart at least half of the nodes eventually because instead of
evenly distributing the data across the nodes when a single node goes
down, for some reason it ends up piling up the data onto the remaining
nodes that have the most data (probably because they have been up the
longest). So in the above example, if I restart index1, index2 will
likely end up with close to 10 GB of data and I'll have to restart
that one next.

So my feature request would be to either have something that re-
allocates shards evenly, taking into account the size of each shard in
terms of documents or megabytes. This can be something that gets
called on-demand, to minimize the amount of time spent re-balancing
things that may be not ideal but still acceptable.

If that's too complicated to implement, a stop-gap measure would be
something that lets me signal to index1, "hey, re-allocate only a few
of your shards someplace else" that way I don't have to end up with
all of the shards leaving index1 by using my crude restart method. :slight_smile:

Another possibility is enabling a switch which changes the algorithm
by which a node which just went down gets its shards re-allocated.
Right now it appears to do something that ends up favoring the oldest
node; maybe it can be tweaked so that it dumps the shards
preferentially on the ones which has the most free memory.

Let me know if you think any of this is reasonable to do or if there
is some better way I can solve this problem I missed! :slight_smile:

BTW I'm running 0.15.2 on linux x86_64

Thanks for finding this Shay! As an aside, I noticed we can't delete
a closed index. So in order to delete the other index we need to
first open it then delete it.

Pushed a fix to allow to delete a closed index.
On Thursday, April 14, 2011 at 8:16 PM, jalano wrote:

Thanks for finding this Shay! As an aside, I noticed we can't delete
a closed index. So in order to delete the other index we need to
first open it then delete it.

Just to follow up - after deletion of the old closed index, the shard
balancing is perfect!

            host	  shards     primary   size (MB)        docs
   index1.la.sol           9           3        5592     7311479
   index2.la.sol           9           2        5578     7312038
   index3.la.sol           9           5        5577     7314298
   index4.la.sol           9           4        5526     7169249
   index5.la.sol           9           4        5612     7409529
   index6.la.sol           9           1        5559     7312326
   index7.la.sol           9           4        5614     7362910
   index8.la.sol           9           1        5589     7317234