Very uneven distribution of docs accross shards

Emil · February 16, 2024, 12:22pm

We've been indexing documents to an index with 20 primary shards, but the shards have grown to have uneven sizes (smallest is 3.2 GB, largest 31.7GB, so 10 times as large)

Investigating, I found out that while the total set of documents is reasonably evenly spread, subsets of documents are not. E.g. when looking at a set of documents from one source, 143810 documents ended up on one shard, with only 2215 on another.
While we are using auto-generated ids, that I believe should result in random assignment of docs to a shard (and as a result, for larger number a more even distribution).

Interesting to note: we are running several processes that send bulk-indexing-requests in parallel. Could it be that there is some beat rythm interfering with balancing?

Or am I overlooking something else?

Christian_Dahlqvist · February 16, 2024, 12:43pm

Are you using routing at index time?

Emil · February 16, 2024, 12:45pm

No, just using the defaults (which would route based on a hash on id if I understand the docs correctly).

Christian_Dahlqvist · February 16, 2024, 12:54pm

It sounds like your different sources have documents that vary quite a lot in size and that the overall number of documents is reasonably evenly spread across the shards, is that correct? How many documents do each shard approximately hold?

Emil · February 16, 2024, 1:03pm

The sizes do vary, but it's not like there's a few documents that hugely increase the shard.

As I said, when looking at one "stream" of documents (i.e. documents from a single process, indexed over a continuous time period), I get a distribution (document-count) of:
35104
14943
2215
57419
29889
13848
36612
14910
5590
14604
36055
80231
6882
14052
143810
13911
13995
10975
2402
13507

That's 65 times as many documents for shard 14 as for shard 2.

Overall, our different documents also have different time periods during which they must be stored. Currently, the total number of documents is somewhat spread, from 1742000 to 2016000 documents, but as we're removing documents from some sources this unbalance will increase.

Christian_Dahlqvist · February 16, 2024, 1:06pm

I do not think there is any guarantee that any specific subset of data is evenly distributed. If the overall distribution of data across the shards in the index is reasonably even I would say it is working as expected.

Do bulk requests contain data from multiple sources or are each source indexed independently?

Emil · February 16, 2024, 1:11pm

I was expecting a hash function to give an even spread. Are you suggesting the auto-generated ids intentionally create hash collisions??

The problem is that if we index, and then remove some documents, the spread gets very uneven. And we're already seeing a very uneven spread of file sizes.

Why would documents that come from a single process all end up on a few shards? Because the processes are running independently, they don't see each other indexing.

Christian_Dahlqvist · February 16, 2024, 1:24pm

No, I am not.

One thing that can cause imbalances is if you are using parent-child relationships as this requires the use of routing. Is this something you are using?

Are you running an aggregation to get the document counts?

Emil · February 16, 2024, 1:30pm

No, my setup is quite simple.

I'm not using custom ids, not using custom routing, not using any parent/child relationships.

To get the counts, I'm using a simple count-query, manually checking for each shard:

GET myindex/_count?preference=_shards:0
{"query": ...}

where the query just selects documents from a single source

system · March 15, 2024, 1:31pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shards are not equal size in one index Elasticsearch	5	1370	July 5, 2017
Documents distributions across shards Elasticsearch	2	320	October 17, 2019
Unexpected and uneven per-index shard allocation Elasticsearch	8	2056	May 4, 2017
Does forcing/manually setting the id of the document makes the shards unevenly distributed? Elasticsearch	8	617	November 26, 2022
How data is distributed in shards Elasticsearch	5	6242	February 5, 2017

Very uneven distribution of docs accross shards

Related topics