Shard Optimization - Size vs. Count

Hi,

We are looking to optimize the search rate and snapshot rate for two of our large indices. While we are exploring hardware improvements (I know the recommendation is to use locally attached SSDs), I am hoping to discuss the benefits, if any, of adjusting sharding for these indices here.

TL;DR Questions

  • Can any of the issues we're experiencing be resolved (or even just improved) by adjusting the shard counts for these indices?
    • If so, are there any thoughts on whether we should continue with monthly rollover, but increase the number of primary shards (5 to 10, for example)? Or should we look at doing daily rollovers, one shard per day?
    • If not, what recommendations or next steps do we have for mitigating these issues as much as possible?

Background:

  • We are running our environment as VMs with network-attached storage via ISCSI.
  • We have 3 master nodes (16GB RAM each) and 15 data nodes (2TB storage; 32GB RAM each; 8GB JVM Heap).
  • Both indices rollover every month (we do not have ILMs to rollover based on size, but are open to this).
  • Each index has 5 primary shards and 5 replica shards (and we would like to maintain at least 1 replica shard per primary).
  • The doc counts and storage per index (below) are what is listed the Stack Management view, so I believe that is the total storage required for both the primary and replica shards.

Index Sizes:

  • Index A averages about 1.6TB of storage and 1.1 billion documents per month. This is an average shard size of 162GB.
  • Index B averages 1TB of storage and approx. 1 billion documents per month, with an average shard size of 103GB

Issues Observed:

  • Querying either index is slow, and we are often limited to 7 day time periods (no more than 30 days)
  • Relocating and initializing shards is also very slow, sometimes taking multiple hours per shard
  • Full index snapshots using S3 repos often fail with the following message:

UncategorizedExecutionException[Failed execution]; nested: ExecutionException[java.io.IOException: Unable to upload object [indices/<Index A ID>/4/<Snapshot ID>] using multipart upload]; nested: IOException[Unable to upload object [indices/<Index A ID>/4/<Snapshot ID>] using multipart upload]; nested: AmazonS3Exception[An internal error occurred. Please retry your upload. (Service: Amazon S3; Status Code: 500; Error Code: InternalError; Request ID: null; S3 Extended Request ID: null)]

Thank you in advance!

That is generally too large. I would recommend trying to keep the average shard size below 50GB. For optimal query performance the ideal size will depend on the data and queries so it is possible the ideal size for your use case may be a bit lower than 50GB.

Both of these issues could be caused by the large shard size so I would recommend increasing the number of primary shards to bring down the average size. Sticking with monthly indices should be fine as the large shard count will ensure the write load is distributed across all nodes. If this does not help it is worth noting that both these issues also could also be impacted by slow storage, so it would be useful to monitor disk utilisation and I/O wait to see if the storage is likely to be the issue.

Thank you so much for your response @Christian_Dahlqvist ! I forgot to mention above that we are currently running a 7.17 cluster and are currently not using a hot/cold architecture due to uniform storage for the data nodes.

I did some quick math and it seems that if I made the following change:

  • Index A: 15 primary shards = 54GB / shard
  • Index B: 10 primary shards = 52GB /shard

This puts me just over the 50GB / shard goal. Moving to daily indexing (w/ 1 primary per index) would put me at the following:

  • Index A: 30-31 primary shards = 27GB / shard
  • Index B: 30-31 primary shards = 17GB / shard

I read here and here that the more shards you have, the larger the cluster state and increase the number of shards needed to answer queries, which can also increase query latency. They do indicate, though, that ideally shards are between 10-50GB.

Would the daily shards be too small/inefficient? Is it better to incrementally increase shards (AKA Strategy 1) or try daily sharding first (Strategy 2)?

Is the data you are indexing immutable or do you perform updates and/or deletes?

That is not an unreasonable shard size but may mean that all your indexing is done by only 4 nodes (into 2 primaries and 2 replicas), which may not necessarily be optimal.

The data is immutable.

I hadn't really thought about that . . . Is this because of the 1 primary shard per index per day? So (for Index A, for example) instead of the 27GB of data being distributed across 5 nodes, it would only be sent to 1 node.

Would this impact have a significant impact on indexing rate for those nodes?

For many use cases there is no real reason to store data in indices that cover a specific set time period. A good solution to your problem could be to use ILM and rollover. When you use rollover you can set the maximum size and age of primary shards and have Elasticsearch create a new index behind the scenes when either limit is breached. You could make the larger index have e.g. 7 primary shards and set the max age to 1 month and the max size to 40GB. The smaller index could have the same settings. This would leave you with 28 shards to index into which should spread out well across the cluster. The generated indices will rollover at different points in time and cover different time periods, but you would be able to control the shard size and count well.

The main reason we historically have not automated with ILMs is the difficulty of automating our backup process with dynamically named indices.

We have a script that will automatically update the SLM policies with the names of the indices to backup. This process is easier if we have predictable names ("indexA-yyyy-mm") vs dynamic and variable ones ("indexA-X" where X is practically any number and you can't predict when "indexA-3" will be created vs. "indexA-4").

If there is another way to automate the updating of an SLM to ensure we do not miss any indices, I'd be happy to look into it. It's definitely possible to automate via our current process with ILM rollover, but the logic is more complicated and would rely on additional API calls to the cluster.

Why not use indexA-* in the snapshot policy? Not sure I understand why you update your snapshot policies with the individual index names when you can use an index pattern.

1 Like

We use wildcarding in our incremental policy but have a separate policy to handle full index backups. The motivation for this is ease of replicating current state in case of DR. In the past, restoring multiple incremental snapshots for the same index (let's say indexA-3) required you to rename the index with each restore (i.e., indexA-3-1, indexA-3-2, etc.). Is this still the case?

I know there are ways to combine these indices later on, but at the time we decided it would be less resource-intensive and time-consuming in a DR situation to have a full index backups, especially given the immutability of the data after the month was over.

What you mean with this? Not sure what full index backup means in your case. You have multiple policies with the same index pattern?

This is how snapshot works, you can't restore a snapshot if an index with the same name already exists in the cluster, so you need to use another name for the restored indices or remove the current ones from the cluster.

It is not clear what is your use case and what you mean with full index backup, a snapshot is already a backup of the index, but using ILM saves a lot of time in management.

Yes. My understanding was that restoring from a single snapshot will restore only the changes to indexA-* since the last incremental snapshot. Basically, each snapshot only captures changes since the last one. No single snapshot will restore all of the indexA-* data in that case. Maybe this isn't the case, though.

In our case, we determined that we wanted that capability. The "full index" snapshot policy takes a snapshot of indexA when it is no longer being written to and restoring from this single snapshot will restore all of the data in indexA.

Again, maybe my understanding is off, but it sounds like what you're describing is why we chose to go with two different policies. For example, in the case where we need to restore 1 month's data:

  • If a snapshot policy runs once a day, restoring 30 days' live data means restoring 30 snapshots and creating 30 indices with 30 different names
  • By contrast, with monthly indexing and our full index snapshots, we would only need to restore a single snapshot, and indexA-* would match the old cluster in both number of matching indices and their names.

I think you have misunderstood how snapshots in Elasticsearch work. Every snapshot is a full snapshot of all the data in the cluster at the time it is initiated. Snapshots work at the segment level and the only incremental aspect of it is that it for each snapshot only copies new segments that have not been snapshotted before. For older indices that are no longer written to no new segments may be created and the newest snapshot may rely on segments copied a long time ago.

As each of your indices cover a long time period the segments within it are likely to change over a long period of time. From this point of view it may be better to have each index cover a shorter time period, e.g. by using rollover with a reasonably small number of primary shards. This way you have a smaller amount of segments changing at any point in time which may reduce the size of your snapshot repository.

When you restore an index from a snapshot it must not exist in the cluster already so if you have a corrupted index it is common to completely remove it before restoring that particular index from a snapshot.

It is also worth noting the difference between index patterns and indices. I think that you are refering to an index pattern that consists of a number of monthly indices when you talk about Index A and Index B. Please correct me if I am wrong.

Thank you for that clarification! I was completing misunderstanding how the snapshots worked, but I'm glad I asked :slight_smile: This will help us simplify the SLM policies a lot.

Yes and no. My initial post describing Index A and Index B truly was describing the indices themselves, however I tried to switch to the index pattern naming when discussing how the ILM rollover process would work. I think the ILM strategy is definitely the way to go - thank you for your patience and help understanding how it would work!

Now to go build some policies . . .