Duplicates after index rollover

Good afternoon.

The problem that I am going to comment is a common problem but I do not know if elastic has already given a solution.

We have an api ingest via logstash that we have to attack it 7 days ago because throughout the week there are modifications in comments and fields that we must reingest .
When the index rotates the fingerprint stops working because the index is no longer the same (at the time we thought that with the rollover alias this would be avoided, but no) and duplicates data for a week.

We understand that this is a very common problem and that maybe some fix or workaround has already been done.

Is there any solution for duplicate prevention after rollover at this time ?

Thanks in advance and sorry for my bad english level!
Best regards. :slight_smile:

No. The solution is to not use rollover when data is not immutable. have a look at this thread for a recent discussion on the topic.

and what should I do if my intake is about 10 gb per day and a year of retention is required?

Would I create a single very oversized shard to which I would have to perform manual deletions ? This would practically block one or few nodes and would produce big imbalances between my nodes.

could the curator or something like that help me in any way ?

I think that no rollover is not an option.

Did you read the thread I linked to as well as the resources linked to from that thread?

yes. i read this :

he best way to handle this would however IMHO be to make sure you avoid duplicates when you extract your data in the first place.

so i read the other blog.

Your first recomendation is to not rollover the index and if the index is too big do periodically splits of de index ?

i dont know exactly which will be the best option or if there is an option. Sorry if i am not understanding you well. :frowning:

As long as you know the original timestamp of the document you want to update you can use old-fashioned time-based indices with the day, week or month they cover in the index name. The original timestamp will determine which index the document goes to, which makes updates easy. You can still use ILM to delete indices based on age after creation date.

If you do not have any timestamp that can help you to send documents and subsequent updates to a single index you may need to resort to one large index and use delete by query to delete data. Note that this is much more resource intensive than deleting full indices.

So if the date of the registry is 2023-09-28 19:40:00

create a field from year and month for index name with a grok similar than
(?<Year>(?>\d\d){1,2})-(?<month>0?[1-9]|1[0-2])

join them an using it for the index name in the logstash output as:
index => "project-pro-%{year}-%{month}"

In this way the records will always go to their corresponding index and can be updated?

maybe I'm very wrong.

If this were a valid method, it would have to have multiple write indexes. How would ilm work correctly here?

Yes, Logstash can determine the index name based on the @timestamp field and that works like you indicated. With that timestamp the document may go into an index named e.g. project-pro-2023-09-28.

Yes. When a document is new it is inserted into an index based on the timestamp. Updates use the same timestamp to go to the same index. It is possible that updates will be going to quite a few of your indices, depending on when the updated documents were created.

ILM will delete indices based on when they were created.

1 Like

i usually initialize the index with the following command:

PUT /%3project-pro-%7Bnow%2Fd%7D-00001%3E
{
  "aliases": {
    "{{same index as logstash output}}": {
      "is_write_index": true 
    }
  }
}

With this command the index follow the ilm rollover policiy etc and works.

How should i initialize this indices for the method that we are talking about ?
Sorry

How are you indexing data into Elasticsearch?

logstash output like

output {
    elasticsearch {
        hosts => ["https://elastic-balancer:9200"]
        user => "${elastic}"
        password => "${password}"
        index => "project-pro-ro-alias"
        document_id => "%{identificator}"
        action => "update"
        doc_as_upsert => "true"
    }
}

but first i initialize the index like this:

PUT /%3Cindex-name-%7Bnow%2Fd%7D-001%3E
{
  "aliases": {
    "project-pro-ro-alias": {
      "is_write_index": true
    }
  }
}

should i use something like:

    ilm_enabled => true
    ilm_write_alias => “logstash”
    index => “logstash”
    ilm_pattern => “000001”
    ilm_policy => “logstash”

in this case?

Thank you so much for the great help!

Make sure you set the @timestamp field to the timestamp you want to use for routing in your Logstah pipeline. Make sure this is the same for the initial write as all subsequent updates for that particular document.

In your Elasticsearch output you then set index as follows:

index => "project-pro-%{+YYYY.MM.dd}"

This will allow Elasticsearch to create new indices with this naming convention as data is written to them. No use of rollover or initialisation required, apart from verifying that you have any index template you want to apply to new indices set up.

ilm_enabled => true
ilm_write_alias => “logstash”
index => “logstash”
ilm_pattern => “000001”
ilm_policy => “logstash”

No. You will set up an ILM policy in Kibana, not using Logstash.

1 Like

okey. Normally ilm policies calculate the delete from rollover if i disable the rollover, wil elastic calculate the delete from the creation of the index ?

Could this policy works ? :

PUT _ilm/policy/<policyName>
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "set_priority": {
            "priority": 100
          }
        },
        "min_age": "0ms"
      },
      "delete": {
        "min_age": "365d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Yes, it will delete based on time since index creation. What you posted looks reasonable, but I have not tested it.

1 Like

perfect!!! i wil try it asap and i will tell you the results!

Thank you so much for the support Christian! :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.