I need your guidance on the following scenario for ILM Policy implementation -
There is a data source for which I am using the UPSERT logstash pipeline because the documents need to be updated based on the transaction ID. Since I cannot use data stream for UPSERT data, I have followed - Tutorial: Automate rollover with ILM | Elasticsearch Guide [8.14] | Elastic steps to enable ILM policy.
The ILM policy creates new indices according to the configured rule (let's say every 7 days). My concern is about what happens to the documents created in one index and when an update arrives, it finds the new index created by the ILM policy, as I understand, since it finds a new empty index, it will be added to the new one but it supposed to update the document already present in the earlier index rather creating a new document in the new index.
Hope it makes sense. If not please ask questions for any clarification.
The upsert will indeed create a new document in the latest index if a rollover has occurred since the last document with that ID was inserted.
If you have a timestamp that is consistent for both the initial document and subsequent updates it may be easier to use traditional time-based indices instead of rollover, which often does not work well with data that is updated. With traditional time-based indices I mean indices that cover a fixed time period and have the date/month in the index name.
Thanks for your quick response on this. In my case, I think it would be difficult to use time-based indices approach because sometimes data comes with a lag of say 1 day. Are there any other alternatives to make it happen?
Why would that be an issue? That is exactly the case where e.g. daily indices shine. It does however require that you have access to the original timestamp used to determine the index used for all subsequent updates. Is this available?
This doesn't matter for daily indices, but it requires you to use some date field from the document, if you are not using a date field from the document, but the generated @timestamp field from logstash, then this will not work.
For example, your document has a field with a date string named eventDate, you need to have a date filter for this eventDate to parse it and replace the @timestamp field.
If you do not have a date filter in your pipeline, then you are using the @timestamp field generated by logstash which will be the time when logstash received the event.
Also, with daily indices you cannot use rollover, you can use ILM only to move indices between data tiers or delete them.
Yes, I have a time field like eventDate which is being used as the timefield of my index pattern. Currently, the @timestamp field generated by logstash. However, we can open to change it according to your suggestion.
Can you please share an example of a logstash pipeline to accommodate this change?
I have POCed the approach that you mentioned and it worked! Thank you so much for the help.
I have another small requirement, since after adding the date filter, @timestamp became the eventDate, is it now possible to capture the logstash insertion datetime somehow?
# Copy the original @timestamp to logstash_insertion_timestamp
mutate {
add_field => { "logstash_insertion_timestamp" => "%{@timestamp}" }
}
# Parse the EventDate and set it as the new @timestamp
date {
match => ["[parsed_json_soap][EventDate]", "yyyy-MM-dd'T'HH:mm:ss"]
target => "@timestamp"
timezone => "UTC" # Specify the input time zone
}
As the daily indices strategy is working, we can implement this in prod. However, I have a concern about query performance, having a single index per day and if we have 1-year data retention policy then it would end up with 365 indices. Does that huge number of indices impact the query performance?
If we do the index per month then we will have 12 indices, will that help on the query performance?
It will depend on the size of the data and indices. If you go to monthly indices you can adjust the shard size by increasing the number of primary shards per index. What shard count and size that is optimal for your use case depends on your data and queries, so you will need to test yourself.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.