We've been indexing documents to an index with 20 primary shards, but the shards have grown to have uneven sizes (smallest is 3.2 GB, largest 31.7GB, so 10 times as large)
Investigating, I found out that while the total set of documents is reasonably evenly spread, subsets of documents are not. E.g. when looking at a set of documents from one source, 143810 documents ended up on one shard, with only 2215 on another.
While we are using auto-generated ids, that I believe should result in random assignment of docs to a shard (and as a result, for larger number a more even distribution).
Interesting to note: we are running several processes that send bulk-indexing-requests in parallel. Could it be that there is some beat rythm interfering with balancing?
It sounds like your different sources have documents that vary quite a lot in size and that the overall number of documents is reasonably evenly spread across the shards, is that correct? How many documents do each shard approximately hold?
The sizes do vary, but it's not like there's a few documents that hugely increase the shard.
As I said, when looking at one "stream" of documents (i.e. documents from a single process, indexed over a continuous time period), I get a distribution (document-count) of:
35104
14943
2215
57419
29889
13848
36612
14910
5590
14604
36055
80231
6882
14052
143810
13911
13995
10975
2402
13507
That's 65 times as many documents for shard 14 as for shard 2.
Overall, our different documents also have different time periods during which they must be stored. Currently, the total number of documents is somewhat spread, from 1742000 to 2016000 documents, but as we're removing documents from some sources this unbalance will increase.
I do not think there is any guarantee that any specific subset of data is evenly distributed. If the overall distribution of data across the shards in the index is reasonably even I would say it is working as expected.
Do bulk requests contain data from multiple sources or are each source indexed independently?
I was expecting a hash function to give an even spread. Are you suggesting the auto-generated ids intentionally create hash collisions??
The problem is that if we index, and then remove some documents, the spread gets very uneven. And we're already seeing a very uneven spread of file sizes.
Why would documents that come from a single process all end up on a few shards? Because the processes are running independently, they don't see each other indexing.
One thing that can cause imbalances is if you are using parent-child relationships as this requires the use of routing. Is this something you are using?
Are you running an aggregation to get the document counts?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.