The problem that I am going to comment is a common problem but I do not know if elastic has already given a solution.
We have an api ingest via logstash that we have to attack it 7 days ago because throughout the week there are modifications in comments and fields that we must reingest .
When the index rotates the fingerprint stops working because the index is no longer the same (at the time we thought that with the rollover alias this would be avoided, but no) and duplicates data for a week.
We understand that this is a very common problem and that maybe some fix or workaround has already been done.
Is there any solution for duplicate prevention after rollover at this time ?
Thanks in advance and sorry for my bad english level!
Best regards.
and what should I do if my intake is about 10 gb per day and a year of retention is required?
Would I create a single very oversized shard to which I would have to perform manual deletions ? This would practically block one or few nodes and would produce big imbalances between my nodes.
As long as you know the original timestamp of the document you want to update you can use old-fashioned time-based indices with the day, week or month they cover in the index name. The original timestamp will determine which index the document goes to, which makes updates easy. You can still use ILM to delete indices based on age after creation date.
If you do not have any timestamp that can help you to send documents and subsequent updates to a single index you may need to resort to one large index and use delete by query to delete data. Note that this is much more resource intensive than deleting full indices.
Yes, Logstash can determine the index name based on the @timestamp field and that works like you indicated. With that timestamp the document may go into an index named e.g. project-pro-2023-09-28.
Yes. When a document is new it is inserted into an index based on the timestamp. Updates use the same timestamp to go to the same index. It is possible that updates will be going to quite a few of your indices, depending on when the updated documents were created.
ILM will delete indices based on when they were created.
Make sure you set the @timestamp field to the timestamp you want to use for routing in your Logstah pipeline. Make sure this is the same for the initial write as all subsequent updates for that particular document.
In your Elasticsearch output you then set index as follows:
index => "project-pro-%{+YYYY.MM.dd}"
This will allow Elasticsearch to create new indices with this naming convention as data is written to them. No use of rollover or initialisation required, apart from verifying that you have any index template you want to apply to new indices set up.
okey. Normally ilm policies calculate the delete from rollover if i disable the rollover, wil elastic calculate the delete from the creation of the index ?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.