Managing large indices

I have question about how to manage large indices. The use case is this customer want to keep the data till 10 years for read and now the issue is this our index is keep growing and its size about 4 tb now and we are seeing the performance degrade now. How can we manage this large index? There is not hot-war-cold architecture deployed.

To be able to provide a better answer here, more context is probably needed on the use case and the type(s) of data being stored in the index.

Some possible generic solutions:

  • If the index is storing timeseries based data (example: metrics, logs, etc...) then you could use something like data streams and/or index lifecycle management (ILM) to break up the data into smaller more manageable indices.
  • If the data isn't timeseries based data, another option could be to look at the data and see if it can be broken apart by another method. Maybe there are multiple different document structures in the index, and each can be moved to their own dedicated index.
  • If neither of the above is an option, you could try and preemptively split the index into a more appropriate number of shards for a period of time.
    • Example: You currently state the index is ~4tb, given Elastic's current shard size recommendations, your index should have anywhere from 80-400 shards to support the current amount of data. If you know that you'll write XGB/TB in a Y time range, you can try to split the index ahead of time to meet a future need of shards for the amount of data in your index. The goal of a method like this is to reduce the number of times in a given time range that you'd need to perform maintenance on the index, and thus reduce the downtime of the index.

It needs to have 10 years of searchable data or you can have snapshots and restore if needed?

What is the date range of those 4 TB of data? Six months? One Year? One month? Is your data timeseries based?

You need to provide more context.

Thank you for your response. This data is not time-series data. My question was how to manage the index. I am good with the splitting of the index. However, how can we delete the old data after 10 years since data is indexed in the same index and there is another question is that the index keeps growing and it is really difficult to manage. One thing I can think of is that we can rollover the index based on size. For example, if the index size is 1 TB, then do the rollover and delete the index after 10 years. Can you help me understand why this approach would work?
Thank you,

Whether or not a rollover via ILM would work, really again depends on the use case. If you don't need to update the data in the documents, then you can simply use ILM to roll over the index after a specific size or age. If you do need to update documents, then ILM might not work as only the "lead" index from ILM can be mark as writable (meaning if you need to update documents you will need to have an additional process to update them.

Another option is to use the delete by query Delete by query API | Elasticsearch Guide [8.4] | Elastic where you would run an occasional query against the index that would delete any documents older than 10 years. (you'd need to have a field in your document that you could use to filter by though for this to work)

1 Like

As @BenB196 said, you will need to use some kind of rollover and delete by query to remove old documents.

But the main issue is that your use case is not clear.

You said you have an index with about 4 TB of data, but what is the time period of the documents in this index? It has data for how many months? It is not clear what you have in this indice since you said it is not a timeseries index.

Also, how you will now if a document is more than 10 years or not? To do that you need a date field and if you have a date field maybe you can split your index in monthly or yearly indices based in this information to make it easier to manage.

Another issue is what is the growing rate of your index, what is the average daily size you got? Depending on the size you will need to keep adding more and more nodes to your cluster to keep those 10 years of data and it will become more and more expensive.

If you provide more context about your use case and how your data looks like , maybe people here can help you better.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.