Hi - I've been running an 8 node cluster with a 14.5 GB index made of
24 shards with 2 replicas. Here's how they all end up getting
distributed:
host shards primary size (MB) docs
index1.la.sol 16 9933 13021626
index2.la.sol 13 9 8057 10575338
index3.la.sol 8 4 4940 6485001
index4.la.sol 6 1 3668 4779460
index5.la.sol 11 5 6806 8924327
index6.la.sol 11 5 6865 9015712
index7.la.sol 4 2474 3267000
index8.la.sol 3 1852 2440371
The thing that bothers me is that over time, a single node or two ends
up with much more of the data than the others, and the overall query
performance suffers because this one node is working a lot harder than
the others and appears to be dragging them all down even though the
others are much more lightly used.
To solve this I've been forcing nodes to re-allocate shards by
restarting them one at a time. This is not always ideal, however,
because while the nodes are restarting the moving of the shards puts
even an even bigger strain on the network and for a few minutes,
performnance gets a lot worse. Also, it appears I end up having to
restart at least half of the nodes eventually because instead of
evenly distributing the data across the nodes when a single node goes
down, for some reason it ends up piling up the data onto the remaining
nodes that have the most data (probably because they have been up the
longest). So in the above example, if I restart index1, index2 will
likely end up with close to 10 GB of data and I'll have to restart
that one next.
So my feature request would be to either have something that re-
allocates shards evenly, taking into account the size of each shard in
terms of documents or megabytes. This can be something that gets
called on-demand, to minimize the amount of time spent re-balancing
things that may be not ideal but still acceptable.
If that's too complicated to implement, a stop-gap measure would be
something that lets me signal to index1, "hey, re-allocate only a few
of your shards someplace else" that way I don't have to end up with
all of the shards leaving index1 by using my crude restart method.
Another possibility is enabling a switch which changes the algorithm
by which a node which just went down gets its shards re-allocated.
Right now it appears to do something that ends up favoring the oldest
node; maybe it can be tweaked so that it dumps the shards
preferentially on the ones which has the most free memory.
Let me know if you think any of this is reasonable to do or if there
is some better way I can solve this problem I missed!
BTW I'm running 0.15.2 on linux x86_64