Using an external document ID can make updates and deletes more efficient as you do not need to search for the document as you know the ID. As every insert can potentially be an update it typically also slows down indexing throughput, especially as shards grow in size.
The shard is determined based on a hash of the document ID so a custom document ID scheme could result in unevenly balanced shards.
This typically results in quite evenly distributed shards.
thanks for your response. Just a follow up question, is it possible to "re calculate" or re-assign documents to other shards to achieve an even distribution of data between the shards while still using an external document ID?
So that means, even if we manually force the id of the document, still, it will undergo to a hash function and will not be distributed on one shard only, thus also achieving an even distribution of shards. Is this correct?
In other words, even if the id is autogenerated by elasticsearch, still we cannot guarantee which shard will the document be placed because of hash function.
You can use routing to override shard selection by document ID hash when indexing a document. This however means that you also will need to provide the same routing value every time you want to update or delete the document in the future. It does not necessarily solve the problem though as you can still get uneven distribution. The best way would probably be to improve the way you create document IDs.
What is the problem you are looking to solve? How uneven is your current distribution? What is your current document ID format?
Since we will be applying this to all of our existing indices, the format of document id varies. Some are in string format, some are in auto incremented format.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.