Hi - I've been running an 8 node cluster with a 14.5 GB index made of
24 shards with 2 replicas. Here's how they all end up getting
host shards primary size (MB) docs index1.la.sol 16 9933 13021626 index2.la.sol 13 9 8057 10575338 index3.la.sol 8 4 4940 6485001 index4.la.sol 6 1 3668 4779460 index5.la.sol 11 5 6806 8924327 index6.la.sol 11 5 6865 9015712 index7.la.sol 4 2474 3267000 index8.la.sol 3 1852 2440371
The thing that bothers me is that over time, a single node or two ends
up with much more of the data than the others, and the overall query
performance suffers because this one node is working a lot harder than
the others and appears to be dragging them all down even though the
others are much more lightly used.
To solve this I've been forcing nodes to re-allocate shards by
restarting them one at a time. This is not always ideal, however,
because while the nodes are restarting the moving of the shards puts
even an even bigger strain on the network and for a few minutes,
performnance gets a lot worse. Also, it appears I end up having to
restart at least half of the nodes eventually because instead of
evenly distributing the data across the nodes when a single node goes
down, for some reason it ends up piling up the data onto the remaining
nodes that have the most data (probably because they have been up the
longest). So in the above example, if I restart index1, index2 will
likely end up with close to 10 GB of data and I'll have to restart
that one next.
So my feature request would be to either have something that re-
allocates shards evenly, taking into account the size of each shard in
terms of documents or megabytes. This can be something that gets
called on-demand, to minimize the amount of time spent re-balancing
things that may be not ideal but still acceptable.
If that's too complicated to implement, a stop-gap measure would be
something that lets me signal to index1, "hey, re-allocate only a few
of your shards someplace else" that way I don't have to end up with
all of the shards leaving index1 by using my crude restart method.
Another possibility is enabling a switch which changes the algorithm
by which a node which just went down gets its shards re-allocated.
Right now it appears to do something that ends up favoring the oldest
node; maybe it can be tweaked so that it dumps the shards
preferentially on the ones which has the most free memory.
Let me know if you think any of this is reasonable to do or if there
is some better way I can solve this problem I missed!
BTW I'm running 0.15.2 on linux x86_64