It's for our application. This info is known already, it seems weird that we need to calculate it again from reading the source code. And the code could change in future releases.
I have submitted a feature request. Provide an access function to map _id to shard number. · Issue #88661 · elastic/elasticsearch · GitHub.
Not sure how much traction it would get.
I really need to have this info. Please help pushing for exposing this data if possible.
Ideally, I wish to have a function simply telling me what shard the doc would reside if I give the function the _id (or whatever routing field value).
I still do not understand why you would ever need this. The github issue mentions some application level optimisation but it is mot clear whatvthat is. Could you describe what you are looking to do and the rationale behind needing this?
What I'm asking for is #2. It seems very easy to provide this. And I don't see the downside. Maybe I'm overlooking something that might be bad to expose such info.
As for #1, I can see why people are interested. It's complicated since I'm trying to prove a theory. If my theory is correct, it could fail proof our scaling going forward.
I try not to write an essay here, so please bare with me that I'm just going to summarize why I want this.
This is tied to a question I posted here: Question regarding size of bulk action.
If I have a cluster of 100 data nodes. My biggest index shard is also 100. How efficient is bulk write?
If I can know the final destination shard, I can optimize my bulk write to say just 5 shards and having 20 parallel bulk writing tasks.
As to @warkojm's question, I don't want to be responsible for shard balancing. That's not the point. I just need the shard number for updating a document primarily. So when I read a doc, it would be nice to also include the shard # in the payload.
For document insertion, all I need is to randomly generated the _id and the new API I'm requesting could tell me the destination shard. Then I can forward the doc to the appropriate ES writing task, etc.
So my assumption is when our cluster grows to above 50 data nodes, bulkwrite is very inefficient within the ES cluster. There are too many sockets writing small payloads. Insertion queue overflow will become an issue. We have few large indices that have the same shard number as the data nodes.
This is also another question of mine on how can folks have cluster size of multiple hundred of data nodes. I can see large ES cluster serving many low shard count indices. But not large indices.
We started to encounter queue overflow several months ago. We ended up solving it by growing the bulkWrite payload size. But the down side is the writer task's RAM is getting pressure (OOM exceptions). So there's a balancing act we are doing here. Eventually, there's a limit on how large the payload a bulkwrite can support.
Am I wrong here?
What I really want is for ES to handle this internally.
If the ingestion only nodes can buffer the bulkwrite in local disk by sorting destination shards, then I don't need to worry about this. The only downside I can think of is increased index time. Data won't get lost since it's stored transiently in ingestion nodes. I'm guessing it's too much to ask for; therefore, exposing the shard number should be a good alternative and it seems attainable.
The bulk API does as far as I recall not allow you to specify shard.
If you have immutable documents and want to optimise bulk writes into indices with a large number of primary shards, you can generate a random routing string for the whole batch and set this for all documemnts in the bulk request. This ensures they will all go to the a single shard. As long as the generated routing string is random the distribution of requests across shards should be uniform. The drawback of this is that you need to know the routing value if you want to perform direct updates or deletes, so you would need to store this externally.
You could potentially also calculate a number of predefined routing strings that resolve to specific shards and associate this with data partitions or processes and use these when writing or updating documents. It sounds like this existing functionality might do something similar to what you described.
If you are partitioning data you can also send this to multiple different indices, which would allow you to group writes and updates without using routing.
Most very large clusters I have come across are used for storing immtable data in multiple time-based indices, which tend to scale well and likely is different to your use case.
I think there's a misunderstanding here, which is understandable. It's a long post.
I'm not trying to specify shard number in bulk write.
I am attempting to group documents to similar shards within the same bulkwrite task.
bulkwrite_id1(s) -> only documents to shard # 2,4,6,7,9
bulkwrite_id2(s) -> only documents to shard # 1,3,5,8,10
instead of
bulkwrite task(s) -> documents to all shard # 1,2,3,4,5,6,7,8,9,10
I don't want to go the routing path cuz that just shifts the complexity of my scaling to other part of the system. The bulkWrite sometimes would be mixture of new and updated documents. It seems to just opening up a can of worm going down this path.
Plus I can't be caching all updated documents forever. When my application restarts, I need to read a document to update. I still need to know which shard it resides.
I really don't want to come up with my own shard assignment algorithm.
Is there a reason why exposing shard number in the payload of a document is not good?
You guys already expose this data in search with "explain": true flag.
Can this be added to direct access of a document as well?
What you are describing sounds exactly the same as creating 2 indices with 5 primary shards each and know which index specific documenst go to. Querying a single index of 100 shards is basically the same as querying 20 indices of 5 primary shards each.
You are correct technically. But I need to come up with a partition algorithm to split seemingly same type of data between 2 indices. That means I have to take on the extra work of balancing documents between 2 indices.
See how I ended up reinventing something's is already build in to ES?
If I'm going to do that, I would rather use the strategy of splitting cluster. Managing 2/multiple clusters is something we will ended up doing anyway. But I am hoping to do this much further down the road in many years.
Again, is there any reason why shard number couldn't be provided to the document payload?
I don't think you need to rewrite code for this at all. It's simply exposing a static data you already have.
I suspect this type of change might break backwards compatibility and potentially affect client libraries. If that is the case I would expect the change to need to be important to warrant implementation, and I do not see it here as you can work around the issue by using existing featuires like multiple indices and/or routing.
I posted this question first. Nobody seems to have an answered and I couldn't google any useful results caused me to open the feature request.
So I'm just explaining the reason for my desire to have such info since folks asked.
And you are correct. I do not wish to reinvent ES features that work very well. I just want an integer.
I don't think backward compatibility should be an issue. Just port the flag "explain": true to direct document query will suffice. It was never available.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.