Does forcing/manually setting the id of the document makes the shards unevenly distributed?

Java2avaj · October 25, 2022, 7:03am

Currently, we are forcing the values of id when storing documents in the index. Is this a bad practice? Does it make the shards unevenly distributed?
Does having elasticsearch autogenerate its own id makes the shards evenly distributed?

Christian_Dahlqvist · October 25, 2022, 7:25am

Using an external document ID can make updates and deletes more efficient as you do not need to search for the document as you know the ID. As every insert can potentially be an update it typically also slows down indexing throughput, especially as shards grow in size.

The shard is determined based on a hash of the document ID so a custom document ID scheme could result in unevenly balanced shards.

This typically results in quite evenly distributed shards.

Java2avaj · October 25, 2022, 7:57am

thanks for your response. Just a follow up question, is it possible to "re calculate" or re-assign documents to other shards to achieve an even distribution of data between the shards while still using an external document ID?

warkolm · October 26, 2022, 1:18am

Only if you change the ID. Otherwise it'll allocate it to the same shard.

Java2avaj · October 27, 2022, 3:46am

I saw that the formula for determining which shard the data is stored is:

routing = _routing != null ? _routing : _id
routing_factor = num_routing_shards / num_primary_shards
shard_num = (hash(_routing) % num_routing_shards) / routing_factor

So that means, even if we manually force the id of the document, still, it will undergo to a hash function and will not be distributed on one shard only, thus also achieving an even distribution of shards. Is this correct?

In other words, even if the id is autogenerated by elasticsearch, still we cannot guarantee which shard will the document be placed because of hash function.

Christian_Dahlqvist · October 27, 2022, 5:46am

You can use routing to override shard selection by document ID hash when indexing a document. This however means that you also will need to provide the same routing value every time you want to update or delete the document in the future. It does not necessarily solve the problem though as you can still get uneven distribution. The best way would probably be to improve the way you create document IDs.

What is the problem you are looking to solve? How uneven is your current distribution? What is your current document ID format?

Java2avaj · October 27, 2022, 6:13am

Since we will be applying this to all of our existing indices, the format of document id varies. Some are in string format, some are in auto incremented format.

warkolm · October 29, 2022, 5:32am

Why not just use the default approach for that and make another custom field with your id value?

system · November 26, 2022, 5:32am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How does Elasticsearch map Integer doc IDs to shards Elasticsearch	8	1182	February 14, 2021
Very uneven distribution of docs accross shards Elasticsearch	9	142	March 15, 2024
What algorithm is ElasticSearch create Document _Id based on?Could somebody answer me，plz Elasticsearch	3	6663	February 28, 2019
Shards/routing documents imbalance problem Elasticsearch	9	745	July 6, 2017
Equality shard index distribution Elasticsearch	4	290	May 18, 2021

Does forcing/manually setting the id of the document makes the shards unevenly distributed?

Related topics