How does Elasticsearch map Integer doc IDs to shards

How does Elasticsearch map Integer doc IDs to shards? What algorithm does it use?

I've been digging around but haven't seemed to find the answer.

For context, every document we have has an id that's a row id from postgres. We're seeing skew on our nodes, likely because many rows have been deleted at different id ranges of this table in the past and if it's simply doing a modulo the documents may not evenly distribute.

Thanks

Do the responses in this topic help? What algorithm is ElasticSearch create Document _Id based on?Could somebody answer me,plz

Afaiu that post is for auto-generated _ids. Our integer ids from postgres are used as the id for the document, which seems to be the same as _id therefore we don't have auto-generated ids.

I'm wondering how these Integer ids/_ids will map to a shard.

Look here

A document is routed to a particular shard in an index using the following formula:

shard_num = hash(_routing) % num_primary_shards

The default value used for _routing is the document’s _id .

So in your case you are using your row number as _id if I understand
As to the exact hash function you would need to look at the code, there are a lot of reasons that nodes / shards can skew over time, the deletions you spoke could be part of it. You can reindex etc if it is really causing problems etc...

What is the hash function?

We've had this indexing for a week so no go on reindexing, we'll just end up where we left off.

We turned off our job though and after waiting some time the shards taking more storage balanced out in size much closer to the same size as the rest of the shards, this makes me think that it's not documents that are unbalanced, but which documents are being updated.

Perhaps take a look at [this] (https://www.elastic.co/blog/efficient-duplicate-prevention-for-event-based-data-in-elasticsearch)

It talks about the concepts.

As to the actual hash function you will need to look it up in the code it's all open.

It could be this one but I am not positive

Perhaps autogenerated IDs might be a better solution and keep the row id as a term for quick lookup.

We have to bulk update documents at a high pace. If the ID is autogenerated then how do I tell Elasticsearch what documents to replace?

The are design tradeoffs in every system if you need to use the _id for update perhaps you will need to figure out another way to balance the shards... It's hard to say without knowing all your requirement (which I am not asking for :wink: )

Perhaps your postgresql could generate uuids

Updates often need to be carefully considered.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.