Using ILM for huge size of indexes

Chirag_Poddar · February 25, 2023, 9:38am

Use case: We have a few indexes which have huge amounts of data and they are growing. We need to figure out a way to optimize the search time. We are thinking of using ILM to manage the indexes. But there are a few roadblocks we are facing. Hoping that we can get help here

Is there a way we can be updated when an index is created through ILM? Maybe through a webhook. We need this to keep a list of indexes and when they are created so whenever we need to update an existing document, we can directly update that document in that particular index rather than searching all the indexes. If not, let us know what the best way to achieve this.
What's the best way to optimize the search for a time filter? We don't want to query all the indexes unnecessarily. We are thinking of keeping track of indexes along with created time and filtering on our end to search on the particular indexes which fall under the time filter.
We are using Elasticsearch 6.8 currently.
Also, if things are possible in the latest version, do let us know.

system · February 25, 2023, 9:38am

Elasticsearch 6.8 is EOL and no longer supported. Please upgrade ASAP.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns )

Wave · February 25, 2023, 8:27pm

Can you provide us with some specific numbers? How many indices and how big is each index and how many shards per index? There are recommended sizes of shards on average and I want to make sure we are in the same ballpark. As far as number 2 goes, elastic does a great job to know what indices contain what data and you usually don't have to worry about tailoring your search query at the index level. Having shards that are too large (in the hundreds of gigabytes is where you usually run into trouble). With all that said it really depends on the data and cluster, so with out specifics it is hard to be give recommendations. Also, like the bot said you should upgrade the cluster at your earlier convenance. There have been lots of upgrades from 6.8 to 8.6 that you will probably make use of.

Chirag_Poddar · February 26, 2023, 5:29am

We are having 2 indexes above 100 GB and 6 indexes in the range of 50GB to 100 GB. Each of them has 2 shards. How does Elastic figure out which indices have what data?

dadoonet · February 26, 2023, 10:27am

Elasticsearch uses a routing value on which we compute a modulo to define in which shard a document should go.
By default the routing value is the id of the document.

Chirag_Poddar · February 26, 2023, 12:22pm

@dadoonet This should be for creating documents, correct?
I am looking to optimize my searching as I have a time-based query and was thinking to refine the search by specifying indexes on my end. I wanted to know does Elastic uses any method to optimise which indexes/shard to look for while searching. If yes, then how does it do that.

leandrojmp · February 26, 2023, 3:04pm

Elasticsearch has had many improvements in search, performance, resource usage and many other aspects in last couple of years, it would be very hard to explain what changed, but you may read the release notes for every version since 6.8 if you want to know what was changed and added.

But in your example, you do not need to specify the exact index that may have your data while search, elasticsearch is able to already do that and from version 7.16 this got even better.

Take a look at the this blog post which has an example of some improvements made.

You should look into upgrade as soon as possible, first you will need to upgrade to the last 7.17 and then you would be able to upgrade to the last 8 version.

Chirag_Poddar · February 26, 2023, 3:35pm

@leandrojmp Okay. Thank you
Is there a way to know when ILM creates an index rather than querying the alias details at an interval? We need this to update the old documents.

Christian_Dahlqvist · February 26, 2023, 4:03pm

I do not see the point with ILM in this scenario. It was designed to handle immutable data, which is not what you have.

If you have a timestamp associated with every document that you can access when updating I would use "traditional" time based indices where the time period covered, e.g. a day or a month, is specified as part of the index name.

Chirag_Poddar · February 27, 2023, 10:29am

Ok. So should we use Rollover API for creating such indexes? And we should use the created time of that doc to find the index for updation.

Christian_Dahlqvist · February 27, 2023, 10:33am

If you use rollover you have the same problem as with ILM (uses rollover behind the scenes) that you do not know which index data resides in. Instead create monthly indices with the year and month in the name, e.g. index-2023.02. If you have the timestamp of a document elasewhere you know exactly which index to update based on this.

Chirag_Poddar · February 27, 2023, 10:39am

For that exact reason, I am trying to figure out a way in which I can keep a mapping on my own end of index name and creation time while using ILM. Is there a way to find this mapping apart from calling ILM/Rollover explain API at a fixed interval?

I am trying to avoid index creation on my own end as this will add an overhead on the system when I am trying to create a document.

Christian_Dahlqvist · February 27, 2023, 10:45am

I do not understand what the issue is. The way I described is how time-based indices were managed for years before rollover came into the picture. You would have an index template that applied to all new indices related to a specific pattern, and you would create the index name based on the timestamp in the document when writing it. If you now what the original timestamp is for documents you want to update (kept track of outside Elasticsearch) that is all you need to create the correct index name.

Chirag_Poddar · February 27, 2023, 1:05pm

Okay.
Is it a bad idea to track the indexex created through rollover and maintain the mapping of the index name and creation time of the index?

Christian_Dahlqvist · February 27, 2023, 5:45pm

What would be the benefit of this over the approach I suggested?

I think you are overcomplicating things, but it could be that I have misunderstood something.

Chirag_Poddar · February 27, 2023, 7:41pm

I am trying to remove the overhead of creating indexes on the application side. Nothing else.

Christian_Dahlqvist · February 27, 2023, 8:11pm

If you have an index template that matches the set of time-based indices I am suggesting, you just derive the correct index name for each document based on the stored timestamp and send a bulk request to Elasticsearch. When Elasticsearch sees you are indexing into a new index, it will automatically create it. You should therefore not explicitly need to create any indices from the application.

system · March 27, 2023, 8:11pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Is there a recommendation on the number of Indices that can be created using ILM Elasticsearch ilm-index-lifecycle-management	10	896	March 20, 2023
Help with data management, I have an index with a size of 119GB, what can I do? Elasticsearch	2	171	May 15, 2023
Use ILM to reduce the size of an existing index Elasticsearch ilm-index-lifecycle-management	3	481	October 20, 2021
Optimizing Index that has grown far too large, suggested settings based on experience needed! Elasticsearch	2	1952	April 17, 2020
ILM questions2 Elasticsearch ilm-index-lifecycle-management	3	380	February 21, 2020

Using ILM for huge size of indexes

Related topics