Reindex vs Split index speeds

Hi there,

I have a relatively large index, 1.1TB, which currently has one shard due to a misconfiguration. I would like to keep my shard size around 50GB. I have two options I guess

  1. Split the index into a new index with the correct shard count.
  2. Reindex the data into a new index with the correct shard count.

I can only think of one 'gotcha' with the split index which is that the operation needs to take place on the one node - which would mean a hefty increase in disk size to accommodate the operation:

The node handling the split process must have sufficient free disk space to accommodate a second copy of the existing index.

  1. Is there a performance consideration to be made also between the two? Or are there any other considerations I should take into account?
    1.1 If there are no other considerations, creating a new index with the correct shard settings and reindex into that seems to make most sense here, given the size of the initial index.
  2. Is there an upper limit on index size/a performance degradation over a certain size? So long as we keep the shards to 50GB could I have a 100 Shard index with 5TB data? Or would it be more performant to have 100 indexes each with 1 primary shard of 50GB?

Thanks!

Additionally - if restoring a large index from a snapshot, can I use the modify index settings to modify static index settings?

static: They can only be set at index creation time or on a closed index.

Are restoring indices closed? If so I could update the number of shards at restoration time?

Hi Daithi,
Its better to have shard size around 50gb for best performance. If we have index with shard size in TB then it quite difficult to handle when you have take the snapshot or rolling over from one data tier to another data tier.
Split vs reindex:
If you go with reindexing it will take so much time for example like if you want reindex 1GB of data will take around 4-5 minutes so in case of TB data it will gona take days for reindexing on the other hand if you go with split api it will quickly split the index with desired primary shard that you have provide in the split api.
Also one thing when you apply split api make sure you must have good amount of storage because in the begening index try to allocated all the shard on different node and then allocate the data so in this process you might see your storage get incresed by maybe 3-4 times but it will come to its original state by some time.
here are the link for reference

1 Like

As long as you have the disk space, split will give you a usable index much quicker. It works by hard-linking the underlying files into all the new shard copies which is almost instantaneous, and then marking most of the docs in each shard copy as deleted. That means the initial split doesn't take much more disk space in most cases, but then merges will be triggered to rewrite the data in the background and it's those followup merges that take up space.

Reindex will take longer and any data you write while the reindex is running likely won't be copied over.

You cannot split an index while restoring from a snapshot; indeed you cannot change the number of shards on any index, closed or otherwise.

Is there an upper limit on index size/a performance degradation over a certain size?

There's a hard limit of ~2 billion docs in each shard, and individual searches do not parallelise within each shard so you might see better performance with more shards. That's not really a function of shard size, just some other things to consider. Larger shards are just kind of unmanageable, they take a long time to copy around the cluster etc.

1 Like

Hi Pratik,

Thank you so much for your input. The links you have provided are also great thanks.

I am definitely looking to get the shard size down to 50GB a shard - my query now is whether it would be more performant to use the Shrink API or the Reindex API? I will need to double my storage for the Shrink API right? Is it quicker than Reindex?

I ask because I have another index of 1 shard that is 5TB! So while both approaches will require me to temporarily add an additional 6TB storage, if one is quicker I will go with it.

  1. Is Shrink or Reindex API more performant?
  2. Am I right in saying that the Shrink operation has to occur on one node, whereas the Reindex operation is spread out across the cluster?

Thanks again for taking the time to respond, very much appreciated.

If you want to reduce the primary shard of the existing index then shrink API will better. You can use this shrink api in your ILM as well to reduce the primary shard . Reindexing is time consuming process.
For shard allocation on node during shrink this link will help you

1 Like

Hi David thanks for your advice!

I will certainly make the space available and use the Shrink API thanks!

Is there an upper limit on index size/a performance degradation over a certain size?

Here I am referring to index rather than shard size, i.e:

Is a 500GB index of ten primary shards of 50GB each equivalent to ten individual 50GB indices with one primary shard each (from a performance perspective)?

Thanks again!

Oh sorry I see now. Not really, no, at least not if your searches will all need to hit every shard either way. Sometimes there's a natural way to reorganise your data so that many searches will find no hits in many shards (e.g. separate indices by time range) and there are optimisations for this case.

1 Like

Great, thank you both so much for your time and advice.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.