How does Elasticsearch map Integer doc IDs to shards

rex-remind · January 16, 2021, 6:50pm

How does Elasticsearch map Integer doc IDs to shards? What algorithm does it use?

I've been digging around but haven't seemed to find the answer.

For context, every document we have has an id that's a row id from postgres. We're seeing skew on our nodes, likely because many rows have been deleted at different id ranges of this table in the past and if it's simply doing a modulo the documents may not evenly distribute.

Thanks

warkolm · January 16, 2021, 11:16pm

Do the responses in this topic help? What algorithm is ElasticSearch create Document _Id based on?Could somebody answer me，plz

rex-remind · January 17, 2021, 1:42am

Afaiu that post is for auto-generated _ids. Our integer ids from postgres are used as the id for the document, which seems to be the same as _id therefore we don't have auto-generated ids.

I'm wondering how these Integer ids/_ids will map to a shard.

stephenb · January 17, 2021, 3:14am

Look here

A document is routed to a particular shard in an index using the following formula:

shard_num = hash(_routing) % num_primary_shards

The default value used for _routing is the document’s _id .

So in your case you are using your row number as _id if I understand
As to the exact hash function you would need to look at the code, there are a lot of reasons that nodes / shards can skew over time, the deletions you spoke could be part of it. You can reindex etc if it is really causing problems etc...

rex-remind · January 17, 2021, 3:26am

What is the hash function?

We've had this indexing for a week so no go on reindexing, we'll just end up where we left off.

We turned off our job though and after waiting some time the shards taking more storage balanced out in size much closer to the same size as the rest of the shards, this makes me think that it's not documents that are unbalanced, but which documents are being updated.

stephenb · January 17, 2021, 4:27am

Perhaps take a look at [this] (https://www.elastic.co/blog/efficient-duplicate-prevention-for-event-based-data-in-elasticsearch)

It talks about the concepts.

As to the actual hash function you will need to look it up in the code it's all open.

It could be this one but I am not positive

github.com

elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/common/hash/MurmurHash3.java

/*
 * Licensed to Elasticsearch under one or more contributor
 * license agreements. See the NOTICE file distributed with
 * this work for additional information regarding copyright
 * ownership. Elasticsearch licenses this file to you under
 * the Apache License, Version 2.0 (the "License"); you may
 * not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 */

package org.elasticsearch.common.hash;

This file has been truncated. show original

Perhaps autogenerated IDs might be a better solution and keep the row id as a term for quick lookup.

rex-remind · January 17, 2021, 5:14am

We have to bulk update documents at a high pace. If the ID is autogenerated then how do I tell Elasticsearch what documents to replace?

stephenb · January 17, 2021, 6:58pm

The are design tradeoffs in every system if you need to use the _id for update perhaps you will need to figure out another way to balance the shards... It's hard to say without knowing all your requirement (which I am not asking for )

Perhaps your postgresql could generate uuids

Updates often need to be carefully considered.

system · February 14, 2021, 6:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
What algorithm is ElasticSearch create Document _Id based on?Could somebody answer me，plz Elasticsearch	3	6690	February 28, 2019
Documents not getting sharded evenly Elasticsearch	14	1617	July 5, 2017
Is there an easy way to get the shard of a document? Elasticsearch	17	1283	August 22, 2022
How to get shard id from document Id? Elasticsearch	2	1979	December 9, 2019
Elasticsearch how to figure out the shard number with the specified routing? Elasticsearch	5	959	July 5, 2017

How does Elasticsearch map Integer doc IDs to shards

Related topics