What does the _split api do internally when creating a new index

techytushar · November 5, 2024, 6:55am

I have a few questions on the internal working of the split api:

In the documentation it is mentioned that Hashes all documents again, after low level files are created, to delete documents that belong to a different shard. What is actually happening here ? When monitoring the split processes I see that the number of segments increase and then the segment merging happens to delete the extra documents in the shards but the _id field of the documents remains the same. So what is the meaning of "Hashes all documents again" here ?
There is a section on "Why Incremental resharding is not supported?" but it doesn't explain what advantages do we get by only allowing splits in multiples of number of shards. How is adding a single shard different from adding shards in multiple number of shards in source index.
The segment merging processes is faster when the number of shards are higher (say 16) but it takes more time when number of shards are lower (say 2 or 3). Also when the number of shards are low (say 2 or 4) then even after the merging is complete there are still some documents that are marked for deletion, but in higher number of shards the number of documents to be deleted always comes down to 0 in the _cat/indices API.

Topic		Replies	Views
Does _split actually splits data or just copies it across shards Elasticsearch	5	343	October 24, 2022
_split index API issues Elasticsearch	2	299	June 13, 2022
How does Elasticsearch Splitting an Index Work? Elasticsearch	18	2526	October 24, 2022
Documentation for scroll API is a bit confusing! Elasticsearch	2	548	July 14, 2019
Split API: shard sizing issue post split process Elasticsearch	2	386	February 17, 2021