Does forcing/manually setting the id of the document makes the shards unevenly distributed?

Does forcing/manually setting the id of the document makes the shards unevenly distributed?

  1. Currently, we are forcing the values of id when storing documents in the index. Is this a bad practice? Does it make the shards unevenly distributed?

  2. Does having elasticsearch autogenerate its own id makes the shards evenly distributed?

Using an external document ID can make updates and deletes more efficient as you do not need to search for the document as you know the ID. As every insert can potentially be an update it typically also slows down indexing throughput, especially as shards grow in size.

The shard is determined based on a hash of the document ID so a custom document ID scheme could result in unevenly balanced shards.

This typically results in quite evenly distributed shards.

thanks for your response. Just a follow up question, is it possible to "re calculate" or re-assign documents to other shards to achieve an even distribution of data between the shards while still using an external document ID?

Only if you change the ID. Otherwise it'll allocate it to the same shard.

I saw that the formula for determining which shard the data is stored is:

routing = _routing != null ? _routing : _id
routing_factor = num_routing_shards / num_primary_shards
shard_num = (hash(_routing) % num_routing_shards) / routing_factor

So that means, even if we manually force the id of the document, still, it will undergo to a hash function and will not be distributed on one shard only, thus also achieving an even distribution of shards. Is this correct?

In other words, even if the id is autogenerated by elasticsearch, still we cannot guarantee which shard will the document be placed because of hash function.

You can use routing to override shard selection by document ID hash when indexing a document. This however means that you also will need to provide the same routing value every time you want to update or delete the document in the future. It does not necessarily solve the problem though as you can still get uneven distribution. The best way would probably be to improve the way you create document IDs.

What is the problem you are looking to solve? How uneven is your current distribution? What is your current document ID format?

Since we will be applying this to all of our existing indices, the format of document id varies. Some are in string format, some are in auto incremented format.

Why not just use the default approach for that and make another custom field with your id value?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.