Is there an easy way to get the shard of a document?

linkerc · July 20, 2022, 6:32pm

Hello:
I often get a document directly via /index/_doc/_id.
Is there a way to display the shard number this particular document resides?

I can get the shard via "explain":true if I'm using search query. But it doesn't work for direct access to a document.

Thanks in advance.

Christian_Dahlqvist · July 21, 2022, 7:12am

Why would you need this?

linkerc · July 21, 2022, 6:09pm

It's for our application. This info is known already, it seems weird that we need to calculate it again from reading the source code. And the code could change in future releases.
I have submitted a feature request. Provide an access function to map _id to shard number. · Issue #88661 · elastic/elasticsearch · GitHub.
Not sure how much traction it would get.
I really need to have this info. Please help pushing for exposing this data if possible.
Ideally, I wish to have a function simply telling me what shard the doc would reside if I give the function the _id (or whatever routing field value).

warkolm · July 21, 2022, 10:35pm

Why not just use routing then?

linkerc · July 22, 2022, 12:01am

Don't quiet understand your suggestion.
I'm not inserting data. I'm hoping to retrieve a document and knowing which shard it belongs to.

warkolm · July 22, 2022, 12:03am

Right, but if you want to do that then why not just use routing from the start, then you know.

Christian_Dahlqvist · July 22, 2022, 7:09am

I still do not understand why you would ever need this. The github issue mentions some application level optimisation but it is mot clear whatvthat is. Could you describe what you are looking to do and the rationale behind needing this?

linkerc · July 22, 2022, 6:34pm

There are 2 items here.

Why I need/want this.
Can an API provide this?

What I'm asking for is #2. It seems very easy to provide this. And I don't see the downside. Maybe I'm overlooking something that might be bad to expose such info.

As for #1, I can see why people are interested. It's complicated since I'm trying to prove a theory. If my theory is correct, it could fail proof our scaling going forward.
I try not to write an essay here, so please bare with me that I'm just going to summarize why I want this.

This is tied to a question I posted here: Question regarding size of bulk action.
If I have a cluster of 100 data nodes. My biggest index shard is also 100. How efficient is bulk write?

If I can know the final destination shard, I can optimize my bulk write to say just 5 shards and having 20 parallel bulk writing tasks.

As to @warkojm's question, I don't want to be responsible for shard balancing. That's not the point. I just need the shard number for updating a document primarily. So when I read a doc, it would be nice to also include the shard # in the payload.
For document insertion, all I need is to randomly generated the _id and the new API I'm requesting could tell me the destination shard. Then I can forward the doc to the appropriate ES writing task, etc.

So my assumption is when our cluster grows to above 50 data nodes, bulkwrite is very inefficient within the ES cluster. There are too many sockets writing small payloads. Insertion queue overflow will become an issue. We have few large indices that have the same shard number as the data nodes.
This is also another question of mine on how can folks have cluster size of multiple hundred of data nodes. I can see large ES cluster serving many low shard count indices. But not large indices.

We started to encounter queue overflow several months ago. We ended up solving it by growing the bulkWrite payload size. But the down side is the writer task's RAM is getting pressure (OOM exceptions). So there's a balancing act we are doing here. Eventually, there's a limit on how large the payload a bulkwrite can support.

Am I wrong here?
What I really want is for ES to handle this internally.
If the ingestion only nodes can buffer the bulkwrite in local disk by sorting destination shards, then I don't need to worry about this. The only downside I can think of is increased index time. Data won't get lost since it's stored transiently in ingestion nodes. I'm guessing it's too much to ask for; therefore, exposing the shard number should be a good alternative and it seems attainable.

Christian_Dahlqvist · July 22, 2022, 6:48pm

The bulk API does as far as I recall not allow you to specify shard.

If you have immutable documents and want to optimise bulk writes into indices with a large number of primary shards, you can generate a random routing string for the whole batch and set this for all documemnts in the bulk request. This ensures they will all go to the a single shard. As long as the generated routing string is random the distribution of requests across shards should be uniform. The drawback of this is that you need to know the routing value if you want to perform direct updates or deletes, so you would need to store this externally.

You could potentially also calculate a number of predefined routing strings that resolve to specific shards and associate this with data partitions or processes and use these when writing or updating documents. It sounds like this existing functionality might do something similar to what you described.

If you are partitioning data you can also send this to multiple different indices, which would allow you to group writes and updates without using routing.

Most very large clusters I have come across are used for storing immtable data in multiple time-based indices, which tend to scale well and likely is different to your use case.

linkerc · July 22, 2022, 7:07pm

I think there's a misunderstanding here, which is understandable. It's a long post.

I'm not trying to specify shard number in bulk write.
I am attempting to group documents to similar shards within the same bulkwrite task.
bulkwrite_id1(s) -> only documents to shard # 2,4,6,7,9
bulkwrite_id2(s) -> only documents to shard # 1,3,5,8,10

instead of
bulkwrite task(s) -> documents to all shard # 1,2,3,4,5,6,7,8,9,10

I don't want to go the routing path cuz that just shifts the complexity of my scaling to other part of the system. The bulkWrite sometimes would be mixture of new and updated documents. It seems to just opening up a can of worm going down this path.
Plus I can't be caching all updated documents forever. When my application restarts, I need to read a document to update. I still need to know which shard it resides.
I really don't want to come up with my own shard assignment algorithm.

Is there a reason why exposing shard number in the payload of a document is not good?
You guys already expose this data in search with "explain": true flag.
Can this be added to direct access of a document as well?

Christian_Dahlqvist · July 22, 2022, 7:16pm

What you are describing sounds exactly the same as creating 2 indices with 5 primary shards each and know which index specific documenst go to. Querying a single index of 100 shards is basically the same as querying 20 indices of 5 primary shards each.

linkerc · July 22, 2022, 8:48pm

You are correct technically. But I need to come up with a partition algorithm to split seemingly same type of data between 2 indices. That means I have to take on the extra work of balancing documents between 2 indices.
See how I ended up reinventing something's is already build in to ES?
If I'm going to do that, I would rather use the strategy of splitting cluster. Managing 2/multiple clusters is something we will ended up doing anyway. But I am hoping to do this much further down the road in many years.

Again, is there any reason why shard number couldn't be provided to the document payload?
I don't think you need to rewrite code for this at all. It's simply exposing a static data you already have.

warkolm · July 22, 2022, 11:19pm

It really sounds like you are trying to reinvent the wheel to me, the best plan here is to wait for a response to your GitHub issue.

It might also be worth starting a new topic with some of the details you are seeing with your bulk issues so they can be addressed.

stephenb · July 23, 2022, 12:23am

Edited ... bowing out

Christian_Dahlqvist · July 23, 2022, 1:14pm

I suspect this type of change might break backwards compatibility and potentially affect client libraries. If that is the case I would expect the change to need to be important to warrant implementation, and I do not see it here as you can work around the issue by using existing featuires like multiple indices and/or routing.

linkerc · July 25, 2022, 6:17pm

I posted this question first. Nobody seems to have an answered and I couldn't google any useful results caused me to open the feature request.
So I'm just explaining the reason for my desire to have such info since folks asked.
And you are correct. I do not wish to reinvent ES features that work very well. I just want an integer.

linkerc · July 25, 2022, 6:18pm

I don't think backward compatibility should be an issue. Just port the flag "explain": true to direct document query will suffice. It was never available.

system · August 22, 2022, 6:18pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How documents is retrieved if we are not using _routing Elasticsearch	4	233	May 2, 2022
How to get shard id from document Id? Elasticsearch	2	2027	December 9, 2019
Elasticsearch how to figure out the shard number with the specified routing? Elasticsearch	5	964	July 5, 2017
How does Elasticsearch map Integer doc IDs to shards Elasticsearch	8	1231	February 14, 2021
GET _doc request not working as expected with routing aliases Elasticsearch	4	586	October 24, 2021

Is there an easy way to get the shard of a document?

Related topics