Very uneven distribution of docs accross shards

We've been indexing documents to an index with 20 primary shards, but the shards have grown to have uneven sizes (smallest is 3.2 GB, largest 31.7GB, so 10 times as large)

Investigating, I found out that while the total set of documents is reasonably evenly spread, subsets of documents are not. E.g. when looking at a set of documents from one source, 143810 documents ended up on one shard, with only 2215 on another.
While we are using auto-generated ids, that I believe should result in random assignment of docs to a shard (and as a result, for larger number a more even distribution).

Interesting to note: we are running several processes that send bulk-indexing-requests in parallel. Could it be that there is some beat rythm interfering with balancing?

Or am I overlooking something else?

Are you using routing at index time?

No, just using the defaults (which would route based on a hash on id if I understand the docs correctly).

It sounds like your different sources have documents that vary quite a lot in size and that the overall number of documents is reasonably evenly spread across the shards, is that correct? How many documents do each shard approximately hold?

The sizes do vary, but it's not like there's a few documents that hugely increase the shard.

As I said, when looking at one "stream" of documents (i.e. documents from a single process, indexed over a continuous time period), I get a distribution (document-count) of:
35104
14943
2215
57419
29889
13848
36612
14910
5590
14604
36055
80231
6882
14052
143810
13911
13995
10975
2402
13507

That's 65 times as many documents for shard 14 as for shard 2.

Overall, our different documents also have different time periods during which they must be stored. Currently, the total number of documents is somewhat spread, from 1742000 to 2016000 documents, but as we're removing documents from some sources this unbalance will increase.

I do not think there is any guarantee that any specific subset of data is evenly distributed. If the overall distribution of data across the shards in the index is reasonably even I would say it is working as expected.

Do bulk requests contain data from multiple sources or are each source indexed independently?

I was expecting a hash function to give an even spread. Are you suggesting the auto-generated ids intentionally create hash collisions??

The problem is that if we index, and then remove some documents, the spread gets very uneven. And we're already seeing a very uneven spread of file sizes.

Why would documents that come from a single process all end up on a few shards? Because the processes are running independently, they don't see each other indexing.

No, I am not.

One thing that can cause imbalances is if you are using parent-child relationships as this requires the use of routing. Is this something you are using?

Are you running an aggregation to get the document counts?

No, my setup is quite simple.

I'm not using custom ids, not using custom routing, not using any parent/child relationships.

To get the counts, I'm using a simple count-query, manually checking for each shard:

GET myindex/_count?preference=_shards:0
{"query": ...}

where the query just selects documents from a single source

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.