Enabling ILM for indices created by UPSERT logstash pipeline

Souvik_Das · July 31, 2024, 6:15am

Hi Folks,

I need your guidance on the following scenario for ILM Policy implementation -

There is a data source for which I am using the UPSERT logstash pipeline because the documents need to be updated based on the transaction ID. Since I cannot use data stream for UPSERT data, I have followed - Tutorial: Automate rollover with ILM | Elasticsearch Guide [8.14] | Elastic steps to enable ILM policy.
The ILM policy creates new indices according to the configured rule (let's say every 7 days). My concern is about what happens to the documents created in one index and when an update arrives, it finds the new index created by the ILM policy, as I understand, since it finds a new empty index, it will be added to the new one but it supposed to update the document already present in the earlier index rather creating a new document in the new index.

Hope it makes sense. If not please ask questions for any clarification.

Kind regards,
Souvik

Christian_Dahlqvist · July 31, 2024, 6:28am

The upsert will indeed create a new document in the latest index if a rollover has occurred since the last document with that ID was inserted.

If you have a timestamp that is consistent for both the initial document and subsequent updates it may be easier to use traditional time-based indices instead of rollover, which often does not work well with data that is updated. With traditional time-based indices I mean indices that cover a fixed time period and have the date/month in the index name.

Souvik_Das · July 31, 2024, 12:40pm

Hi @Christian_Dahlqvist ,

Thanks for your quick response on this. In my case, I think it would be difficult to use time-based indices approach because sometimes data comes with a lag of say 1 day. Are there any other alternatives to make it happen?

Christian_Dahlqvist · July 31, 2024, 12:45pm

Why would that be an issue? That is exactly the case where e.g. daily indices shine. It does however require that you have access to the original timestamp used to determine the index used for all subsequent updates. Is this available?

leandrojmp · July 31, 2024, 12:49pm

This doesn't matter for daily indices, but it requires you to use some date field from the document, if you are not using a date field from the document, but the generated @timestamp field from logstash, then this will not work.

For example, your document has a field with a date string named eventDate, you need to have a date filter for this eventDate to parse it and replace the @timestamp field.

If you do not have a date filter in your pipeline, then you are using the @timestamp field generated by logstash which will be the time when logstash received the event.

Also, with daily indices you cannot use rollover, you can use ILM only to move indices between data tiers or delete them.

Souvik_Das · July 31, 2024, 1:33pm

Hi @leandrojmp ,
Thanks for your explanations.

Yes, I have a time field like eventDate which is being used as the timefield of my index pattern. Currently, the @timestamp field generated by logstash. However, we can open to change it according to your suggestion.

Can you please share an example of a logstash pipeline to accommodate this change?

leandrojmp · July 31, 2024, 2:06pm

You need to add a date filter that will matchs your date string.

date {
    match => ["eventDate", "DATE PATTERN"]
    target => "@timestamp"
}

The target is not required because the default value is already @timestamp, but it is good to make it clear for someone that does not know that.

This will parse your date into @timestamp.

Then you would need something like this in your elasticsaerch output.

index => "indexName-%{YYYY.MM.dd}"

This will extract the date from the @timestamp field to create the index, so if your date is 2024-07-31, your index will be indexName-2024.07.31

Souvik_Das · July 31, 2024, 2:29pm

Hi @leandrojmp ,

Thanks a ton!

Let me try out from my end. Will reach out to you.

Souvik_Das · August 2, 2024, 4:37am

Hi @leandrojmp ,

A very happy birthday to you!

I have POCed the approach that you mentioned and it worked! Thank you so much for the help.

I have another small requirement, since after adding the date filter, @timestamp became the eventDate, is it now possible to capture the logstash insertion datetime somehow?

Souvik_Das · August 2, 2024, 5:11am

Hi @leandrojmp ,

I got the solution! Here is the final code -

# Copy the original @timestamp to logstash_insertion_timestamp
mutate {
add_field => { "logstash_insertion_timestamp" => "%{@timestamp}" }
}

# Parse the EventDate and set it as the new @timestamp
date {
match => ["[parsed_json_soap][EventDate]", "yyyy-MM-dd'T'HH:mm:ss"]
target => "@timestamp"
timezone => "UTC"  # Specify the input time zone
}

Souvik_Das · August 6, 2024, 5:50am

Hi @leandrojmp,

As the daily indices strategy is working, we can implement this in prod. However, I have a concern about query performance, having a single index per day and if we have 1-year data retention policy then it would end up with 365 indices. Does that huge number of indices impact the query performance?

If we do the index per month then we will have 12 indices, will that help on the query performance?

Christian_Dahlqvist · August 6, 2024, 6:10am

It will depend on the size of the data and indices. If you go to monthly indices you can adjust the shard size by increasing the number of primary shards per index. What shard count and size that is optimal for your use case depends on your data and queries, so you will need to test yourself.

Topic		Replies	Views
Manage index created by roll-up job with ILM policy Elasticsearch ilm-index-lifecycle-management	1	363	December 3, 2019
Existing index and lifecycle policy Elasticsearch ilm-index-lifecycle-management	3	277	June 30, 2023
ILM - Does not trigger rollover Elasticsearch ilm-index-lifecycle-management	8	699	January 4, 2024
Elastic ILM rollover not working as expected Elasticsearch ilm-index-lifecycle-management	3	943	May 25, 2022
ILM questions: How to delete indexes based daily referenced to @timestamp? Elasticsearch ilm-index-lifecycle-management	2	1120	May 19, 2020

Enabling ILM for indices created by UPSERT logstash pipeline

Related topics