Duplicates after index rollover

elk-user-0001 · September 28, 2023, 4:26pm

Good afternoon.

The problem that I am going to comment is a common problem but I do not know if elastic has already given a solution.

We have an api ingest via logstash that we have to attack it 7 days ago because throughout the week there are modifications in comments and fields that we must reingest .
When the index rotates the fingerprint stops working because the index is no longer the same (at the time we thought that with the rollover alias this would be avoided, but no) and duplicates data for a week.

We understand that this is a very common problem and that maybe some fix or workaround has already been done.

Is there any solution for duplicate prevention after rollover at this time ?

Thanks in advance and sorry for my bad english level!
Best regards.

Christian_Dahlqvist · September 28, 2023, 4:33pm

No. The solution is to not use rollover when data is not immutable. have a look at this thread for a recent discussion on the topic.

elk-user-0001 · September 28, 2023, 4:38pm

and what should I do if my intake is about 10 gb per day and a year of retention is required?

Would I create a single very oversized shard to which I would have to perform manual deletions ? This would practically block one or few nodes and would produce big imbalances between my nodes.

elk-user-0001 · September 28, 2023, 4:40pm

could the curator or something like that help me in any way ?

I think that no rollover is not an option.

Christian_Dahlqvist · September 28, 2023, 5:23pm

Did you read the thread I linked to as well as the resources linked to from that thread?

elk-user-0001 · September 28, 2023, 5:38pm

yes. i read this :

he best way to handle this would however IMHO be to make sure you avoid duplicates when you extract your data in the first place.

so i read the other blog.

Your first recomendation is to not rollover the index and if the index is too big do periodically splits of de index ?

i dont know exactly which will be the best option or if there is an option. Sorry if i am not understanding you well.

Christian_Dahlqvist · September 28, 2023, 5:41pm

As long as you know the original timestamp of the document you want to update you can use old-fashioned time-based indices with the day, week or month they cover in the index name. The original timestamp will determine which index the document goes to, which makes updates easy. You can still use ILM to delete indices based on age after creation date.

If you do not have any timestamp that can help you to send documents and subsequent updates to a single index you may need to resort to one large index and use delete by query to delete data. Note that this is much more resource intensive than deleting full indices.

elk-user-0001 · September 28, 2023, 5:51pm

So if the date of the registry is 2023-09-28 19:40:00

create a field from year and month for index name with a grok similar than
(?<Year>(?>\d\d){1,2})-(?<month>0?[1-9]|1[0-2])

join them an using it for the index name in the logstash output as:
index => "project-pro-%{year}-%{month}"

In this way the records will always go to their corresponding index and can be updated?

maybe I'm very wrong.

If this were a valid method, it would have to have multiple write indexes. How would ilm work correctly here?

Christian_Dahlqvist · September 28, 2023, 5:58pm

Yes, Logstash can determine the index name based on the @timestamp field and that works like you indicated. With that timestamp the document may go into an index named e.g. project-pro-2023-09-28.

Yes. When a document is new it is inserted into an index based on the timestamp. Updates use the same timestamp to go to the same index. It is possible that updates will be going to quite a few of your indices, depending on when the updated documents were created.

ILM will delete indices based on when they were created.

elk-user-0001 · September 28, 2023, 6:05pm

i usually initialize the index with the following command:

PUT /%3project-pro-%7Bnow%2Fd%7D-00001%3E
{
  "aliases": {
    "{{same index as logstash output}}": {
      "is_write_index": true 
    }
  }
}

With this command the index follow the ilm rollover policiy etc and works.

How should i initialize this indices for the method that we are talking about ?
Sorry

Christian_Dahlqvist · September 28, 2023, 6:07pm

How are you indexing data into Elasticsearch?

elk-user-0001 · September 28, 2023, 6:11pm

logstash output like

output {
    elasticsearch {
        hosts => ["https://elastic-balancer:9200"]
        user => "${elastic}"
        password => "${password}"
        index => "project-pro-ro-alias"
        document_id => "%{identificator}"
        action => "update"
        doc_as_upsert => "true"
    }
}

but first i initialize the index like this:

PUT /%3Cindex-name-%7Bnow%2Fd%7D-001%3E
{
  "aliases": {
    "project-pro-ro-alias": {
      "is_write_index": true
    }
  }
}

elk-user-0001 · September 28, 2023, 6:17pm

should i use something like:

    ilm_enabled => true
    ilm_write_alias => “logstash”
    index => “logstash”
    ilm_pattern => “000001”
    ilm_policy => “logstash”

in this case?

Thank you so much for the great help!

Christian_Dahlqvist · September 28, 2023, 6:21pm

Make sure you set the @timestamp field to the timestamp you want to use for routing in your Logstah pipeline. Make sure this is the same for the initial write as all subsequent updates for that particular document.

In your Elasticsearch output you then set index as follows:

index => "project-pro-%{+YYYY.MM.dd}"

This will allow Elasticsearch to create new indices with this naming convention as data is written to them. No use of rollover or initialisation required, apart from verifying that you have any index template you want to apply to new indices set up.

ilm_enabled => true
ilm_write_alias => “logstash”
index => “logstash”
ilm_pattern => “000001”
ilm_policy => “logstash”

No. You will set up an ILM policy in Kibana, not using Logstash.

elk-user-0001 · September 28, 2023, 6:32pm

okey. Normally ilm policies calculate the delete from rollover if i disable the rollover, wil elastic calculate the delete from the creation of the index ?

Could this policy works ? :

PUT _ilm/policy/<policyName>
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "set_priority": {
            "priority": 100
          }
        },
        "min_age": "0ms"
      },
      "delete": {
        "min_age": "365d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Christian_Dahlqvist · September 28, 2023, 6:47pm

Yes, it will delete based on time since index creation. What you posted looks reasonable, but I have not tested it.

elk-user-0001 · September 28, 2023, 6:49pm

perfect!!! i wil try it asap and i will tell you the results!

Thank you so much for the support Christian!

system · October 26, 2023, 6:50pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Remove/Prevent duplicates with rollover Elasticsearch	15	770	August 24, 2022
Configuring Pipeline To Handle Duplicates In Rollover Indices Logstash	3	1006	September 20, 2019
Duplicates in ElasticSearch when using ILM Rollover Elasticsearch ilm-index-lifecycle-management	2	388	July 21, 2022
Rollover Index duplication data,data coming from logstash Elasticsearch	19	1535	July 14, 2023
Duplicate document ids when using rolling indices Elasticsearch	1	1093	April 22, 2022

Duplicates after index rollover

Related topics