Reindex vs Split index speeds

Daithi_O_Conchobhair · December 8, 2022, 12:01pm

Hi there,

I have a relatively large index, 1.1TB, which currently has one shard due to a misconfiguration. I would like to keep my shard size around 50GB. I have two options I guess

Split the index into a new index with the correct shard count.
Reindex the data into a new index with the correct shard count.

I can only think of one 'gotcha' with the split index which is that the operation needs to take place on the one node - which would mean a hefty increase in disk size to accommodate the operation:

The node handling the split process must have sufficient free disk space to accommodate a second copy of the existing index.

Is there a performance consideration to be made also between the two? Or are there any other considerations I should take into account?
1.1 If there are no other considerations, creating a new index with the correct shard settings and reindex into that seems to make most sense here, given the size of the initial index.
Is there an upper limit on index size/a performance degradation over a certain size? So long as we keep the shards to 50GB could I have a 100 Shard index with 5TB data? Or would it be more performant to have 100 indexes each with 1 primary shard of 50GB?

Thanks!

Daithi_O_Conchobhair · December 8, 2022, 12:07pm

Additionally - if restoring a large index from a snapshot, can I use the modify index settings to modify static index settings?

static: They can only be set at index creation time or on a closed index.

Are restoring indices closed? If so I could update the number of shards at restoration time?

Pratik_Wadodkar · December 9, 2022, 8:04am

Hi Daithi,
Its better to have shard size around 50gb for best performance. If we have index with shard size in TB then it quite difficult to handle when you have take the snapshot or rolling over from one data tier to another data tier.
Split vs reindex:
If you go with reindexing it will take so much time for example like if you want reindex 1GB of data will take around 4-5 minutes so in case of TB data it will gona take days for reindexing on the other hand if you go with split api it will quickly split the index with desired primary shard that you have provide in the split api.
Also one thing when you apply split api make sure you must have good amount of storage because in the begening index try to allocated all the shard on different node and then allocate the data so in this process you might see your storage get incresed by maybe 3-4 times but it will come to its original state by some time.
here are the link for reference

DavidTurner · December 9, 2022, 10:12am

As long as you have the disk space, split will give you a usable index much quicker. It works by hard-linking the underlying files into all the new shard copies which is almost instantaneous, and then marking most of the docs in each shard copy as deleted. That means the initial split doesn't take much more disk space in most cases, but then merges will be triggered to rewrite the data in the background and it's those followup merges that take up space.

Reindex will take longer and any data you write while the reindex is running likely won't be copied over.

You cannot split an index while restoring from a snapshot; indeed you cannot change the number of shards on any index, closed or otherwise.

Is there an upper limit on index size/a performance degradation over a certain size?

There's a hard limit of ~2 billion docs in each shard, and individual searches do not parallelise within each shard so you might see better performance with more shards. That's not really a function of shard size, just some other things to consider. Larger shards are just kind of unmanageable, they take a long time to copy around the cluster etc.

Daithi_O_Conchobhair · December 9, 2022, 10:12am

Hi Pratik,

Thank you so much for your input. The links you have provided are also great thanks.

I am definitely looking to get the shard size down to 50GB a shard - my query now is whether it would be more performant to use the Shrink API or the Reindex API? I will need to double my storage for the Shrink API right? Is it quicker than Reindex?

I ask because I have another index of 1 shard that is 5TB! So while both approaches will require me to temporarily add an additional 6TB storage, if one is quicker I will go with it.

Is Shrink or Reindex API more performant?
Am I right in saying that the Shrink operation has to occur on one node, whereas the Reindex operation is spread out across the cluster?

Thanks again for taking the time to respond, very much appreciated.

Pratik_Wadodkar · December 9, 2022, 10:31am

If you want to reduce the primary shard of the existing index then shrink API will better. You can use this shrink api in your ILM as well to reduce the primary shard . Reindexing is time consuming process.
For shard allocation on node during shrink this link will help you

Daithi_O_Conchobhair · December 9, 2022, 12:07pm

Hi David thanks for your advice!

I will certainly make the space available and use the Shrink API thanks!

Is there an upper limit on index size/a performance degradation over a certain size?

Here I am referring to index rather than shard size, i.e:

Is a 500GB index of ten primary shards of 50GB each equivalent to ten individual 50GB indices with one primary shard each (from a performance perspective)?

Thanks again!

DavidTurner · December 9, 2022, 12:41pm

Oh sorry I see now. Not really, no, at least not if your searches will all need to hit every shard either way. Sometimes there's a natural way to reorganise your data so that many searches will find no hits in many shards (e.g. separate indices by time range) and there are optimisations for this case.

Daithi_O_Conchobhair · December 9, 2022, 1:23pm

Great, thank you both so much for your time and advice.

system · January 6, 2023, 1:24pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Reindex vs Split Speed and Storage Requirements Elasticsearch	2	212	March 7, 2024
Does _split actually splits data or just copies it across shards Elasticsearch	5	334	October 24, 2022
Reindex 1 index to multiple indexes Elasticsearch	8	554	June 15, 2023
Split API: shard sizing issue post split process Elasticsearch	2	377	February 17, 2021
Unable to Split Large Index Elasticsearch	1	29	August 26, 2024

Reindex vs Split index speeds

Related topics