How does Elasticsearch map Integer doc IDs to shards? What algorithm does it use?
I've been digging around but haven't seemed to find the answer.
For context, every document we have has an id that's a row id from postgres. We're seeing skew on our nodes, likely because many rows have been deleted at different id ranges of this table in the past and if it's simply doing a modulo the documents may not evenly distribute.
Afaiu that post is for auto-generated _ids. Our integer ids from postgres are used as the id for the document, which seems to be the same as _id therefore we don't have auto-generated ids.
I'm wondering how these Integer ids/_ids will map to a shard.
A document is routed to a particular shard in an index using the following formula:
shard_num = hash(_routing) % num_primary_shards
The default value used for _routing is the document’s _id.
So in your case you are using your row number as _id if I understand
As to the exact hash function you would need to look at the code, there are a lot of reasons that nodes / shards can skew over time, the deletions you spoke could be part of it. You can reindex etc if it is really causing problems etc...
We've had this indexing for a week so no go on reindexing, we'll just end up where we left off.
We turned off our job though and after waiting some time the shards taking more storage balanced out in size much closer to the same size as the rest of the shards, this makes me think that it's not documents that are unbalanced, but which documents are being updated.
The are design tradeoffs in every system if you need to use the _id for update perhaps you will need to figure out another way to balance the shards... It's hard to say without knowing all your requirement (which I am not asking for )
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.