Rollover Index duplication data,data coming from logstash

Hello ,

I'm facing one issue,to elaborate I've 40 elastic index and these are handled by ILM policy with rollover defined.The ilm policy is maintained to send data to new index each day(rollover) and delete after 5 days from rollover.

I've requirement to have single policy for all indices,this cause duplication of data.
Mostly logstash run perl based script to get data and insert to index,In order to avoid duplication of data I'm using doc_as_upsert in logstash(same documents gets updated with unique id maintained).This way dupliaction is avoided.

Now the challenge is for given same index pattern I've 5 rollover created due to policy and now when I do some metric or staistics visualization then data comes inaccurate due to dupliaction.

Reason Noticed: With in the same index(current rollover index) elastic understands to update the documents ,but I wanted that logstash should some how check documents in rollover index with document id present then update them and if new document then send to current index.This way duplication is avoided.

Solution :could be maintain only 1 giant index,this way document would be always unique in present index.
Reason can't use one index only : for dev,prod,qualit env data would be maintained for ex 5,10,15 days, now if followed 1 index only i.e 30 days write to same index and delete there after then customer whole data is lost.
Historical data is needed ,if using one index only then when policy will delete index whole data is gone and if used rollover index(every day rollover and delete after 5days)->this causes duplication.

I'm confused how to handle this issue?
I wan't some way to tell logstash to check the documents in rollover index if present there then update those doc and only new document index to current rollover index.

Any suggestion would be helpful.

Thanx

Hi @PRASHANT_MEHTA

This duplication has nothing to do with logstash, filebeat etc.. etc...

Here the simple explanation which I think you get...

Yesterday you are writing documents to wire alias

mis-logs which is pointing to concrete index mis-logs-2023.05.05-000001

You write the document

{
  "_id" : 1234567,
  "foo" : "bar"
}

Today you are writing what you want as an updated document but now the write alias
mis-logs points to concrete index mis-logs-2023.05.06-000002

This document is

{
  "_id" : 1234567,
  "foo" : "not bar"
}

Even if you try to doc_as_upsert that will not work as the write alias no longer points to the index where the original document exists.

a doc_as_upsert must have the same index AND _id

This is the behavior you are seeing...

What it seems is that you want to treat time-series data as updateable, (which is not the common nor optimized case) AND to have index rollover and ILM policies delete etc.

There is no simple direct way to do this with elastic.... no matter what components you use, filebeat, logstash etc...

If you want ILM then you must use write alias and rollover

If you want to use doc_as_upsert you can not use write alias and rollover

If you want to do this you will need to write custom code using one of the language clients that will search for that document first, find it, and then write the doc_as_upsert to that index and _id

Perhaps someone else has a solution.

As you head to 8.x and datastreams ... time series data is basically "immutable"

You will need to handle your updates in another manner...

There is the idea of a latest transforms etc which keeps on the latest version of a document, but that is not available in 7.9,

As Stephen describes, rollover was designed to help solve a common but quite specific problem: Create indices and shards of reasonably uniform size when data volumes fluctuate over time and have immutable data.

It is now more or less the standard and is often recommended on this forum even for scenarios and use cases where it is not really suitable or at least not a natural fit. A long time ago I wrote a blog post about duplicate prevention and in this I discuss the problems using rollover if you need to perform duplicate prevention the way you describe. Even though it is old, I think it is at least largely still applicable.

Before rollover was available in Elasticsearch the standard way of handling time-series data was to create indices with the time period they cover reflected in the name. This could be daily (e.g. logstash-2023.05.07), weekly (e.g. logstash-2023.14) or monthly (e.g. logstash-2023.05). The time period each index would cover would be set up front in the Logstash cobfiguration as this determines the index name to write to. The number of primary shards would be adjusted periodically through an index tempaste depending on the projected data volumes so that the expected shard size would not be too large. With this approach each piece of data, if associated with a specific timestamp, can be uniquely sent to one specific index, which allows for deduplication.

This approach naturally has the drawback that the size of indices and shards can vary significantly, which is something that is generally not desirable.

I believe it should be possible to use ILM with indices created according to the above described scheme as the use of rollover should be optional. It will however as far as i know delete indices based on when they were created, so will not automatically take the timestamp component of the index name into account, which may or may not be an issue.

As an alternative to ILM there is the old and trusted Curator. This runs externally to the cluster through cron and has more features than ILM and therefore offers more flexibility.

The best way to handle this would however IMHO be to make sure you avoid duplicates when you extract your data in the first place. This may mean you may need to enhance your perl scripts or possibly switch to some other extraction mechanism. It is generally a good idea to stick with the standard recommended approach (rollover and data streams) if possible. If this is not an option I would recommend looking into the options I outlined above.

1 Like

My two cents: You are looking for an OLTP engine, not OLAP. Generally, updates on OLAP engines are handled via data model implementations, such as type 2 or 3 dimensions. You could technically always append and look for the latest record (TS). This isn't deduping, but it's a method that OLAP engines are best designed for.

Additionally, consider using Flink if you absolutely need to handle deduping with logs at an unreasonable scale in a stream. Those deduped datasets would land in Elasticsearch (ES) for analytical queries (ie OLAP).

Hello @Sunile_Manjee

Thanx for your time to look into this and for your explanation.
I'm not sure how to best handle this solution,I tried several ways but unable to provide best solution .
I do need to use ILM policy to handle the data,but by using rollover indices duplication occurs.
Earlier I thought to maintai single index so that duplication is avoided.But I don't want to do any manual think like delete data manually from index using api.
My implementation is insatller based.Once customer click next-next using installer,entire elk components gets installed in respective server, and indices are managed by ILM.
So now if i do any statistical aggegation sum,average etc...the same document is present in rollover indices and visulaization shows wrong value.
Let me know how can I best handle this ?, I read your above comment and partially understood and not sure how to best tackle this.

Many Thanx

Look at the link I provided in my last response. Use traditional indices with the date (month or day precision depending on the data volumes and retention period). Then create an ID for each log line that will be usique, e.g. a hash of the full log line or something like that. You can still use ILM to manage the life cycle without adopting rollover.

Hi @Christian_Dahlqvist

Perhaps you can help me a bit :slight_smile:

Reading the blog, I am still confused how the update happens if I am writing the new document (the duplication) to today's daily index

my-index-2023.05.23

But the original is in yesterday's

my-index-2023.05.22

This would seem to work if there was only 1 total index? I am sure I am missing something.

Or are you saying parse the event time and use that to identify the correct index to write to?

If you set up the Logstah Elasticsearch output something like this:

output {
  elasticsearch {
    index => "my-index-%{+YYYY.MM.dd}"
  }
}

The timestamp in the index name the event will be written to will be defined by the @timestamp field. For this to work it is therefore assumed that a timestamp has been parsed out from the event and put into @timestamp. As the timestamp is related to the event and consistent the crrect index will be updated.

I did leave out setting of the document_id parameter in the config example, but that is naturally also required.

1 Like

Funny all this time I thought that was just UTC time... So yes that will work!

@PRASHANT_MEHTA you have your answer you can do that

Note that this approach does require that you have sensible timestamps in your data. I have on occasions seen incorrect timestamp parsing or bad timestamps in the input data create large number of indices far into the past and/or future, which can be quite problematic.

Hello @stephenb ,

Is there any other way to handle this in 8.7.1 version? , we need the ilm approach.

I'm not sure what you meant by this and how would this help.
There is the idea of a latest transforms etc which keeps on the latest version of a document, but that is not available in 7.9

What if ILM is not used, and only single index maintained and then how come data can be managed if its size becomes to large.Would not require any manual way like use commands manually devtools or curl.More like how ilm works.

Thanx

You should be able to use ILM with the approach I described, just not the rollover part of it.

I do not think transforms will be a solution to this issue.

If you maintain a single large index, which would allow you to avoid duplicates, note that there are several drawbacks with this approach:

  • As you are specifying an external document ID inserts will need to be treated as potential updates and result in slower indexing the larger the shards get.
  • You will need to delete data through delete-by-query which is inefficient and adds load to the cluster and also requires extra disk space as deleted documents are updated with a tombstone before later being deleted through merging.
  • You will need to set up a cron job or similar that runs the delete-by-query task/script periodically. There is nothing built in within Elsticsearch that allows you to do this.

@PRASHANT_MEHTA @Christian_Dahlqvist ,

Even I am facing such issue in our usecases.The nature of data is such that it reads logs and write to index and I use logstash upsert functionality to update the documents.
The data gets updated in current rollover indexes and ideally I would require them if same document already present then update in that particular rollover and only new document should do write operation in current index.
So, Even I face issues in kibana visual to show some calulation or metrics data.
For eaxample-Instead of summing data in latest index ,it sums up in all indices as document ID is same all across.
I did read comments by elastic engineers @Christian_Dahlqvist , but I would like to understand how would you tackel such situation better? , In community I've read various post regarding this but didn't found concrete solution.This is a messy situation with rollovers, In github issue is there with no concrete solution so far.
@PRASHANT_MEHTA Were you able to find solution for this, if yes ,please let know as it would be great help.

This is not possible.

This is not possible.

I have provided a concrete solution described in the blog post I linked to (use time-based indices without rollover) and further clarified in this thread.

@Christian_Dahlqvist
What I understood is that with rollover, duplication would be there and as per your recommendation don't use rollover indices to avoid duplication.
But waht if your index size becomes to large and how will this be managed without ILM.
We need ILM and don't want to do manual things to delete data.
With this particular problem I understood that advanced level analytics cant be done with KIBANA due to data behaviour.
It woullb be my request if something can be done at elastic end i.e if document is already present in rollover index updates should happen there and only new documents should write in current index.
I did see github issue created on this without any solution yet.

After reading this thread I'd like to give simple usecases example.
Considering python script execution in multiple servers and generating data every 5 hour and the documets created is always updated using logstash doc_as_upsert.This causes same document to be present in multiple indices.
In order to avoid this I thought of this solution:
Maintain data in single index for 7 days (hot phase)and delete after 1millisenonds from rollover.
Now when new rollover index is created all previous data is gone.But now customer historical data is gone.

I now understood the way I'm managing data I am violating elastic way of working , but this is a blocker for many advanced aanalytics solution?
Does elastic provide any help/discussion or consultation on this? , we're currently exploring elastic for our usecases and if it best works for us ,then will opt for paid version.

Even with your recommended solution there is possibility of duplication, as you mentioned bad timestamp may occur.This causes production analytics issue which are quite critical.

You can use ILM without rollovers, you just need to use time-based indices that will rotate daily, weekly, monthly or yearly depending on your needs and configurations.

This adds a little more work because you need to know your data and configure the number of shards according to it.

For example, if your data has an average of 1 GB per day, in 30 days you would have something close to 30 GB.

But if you have another kind of data that has an average of 20 GB per day, if you use monthly indices you would have a 600 GB indices in the end of the month, since the recommendation is to try to have shard size between 10 GB and 50 GB, you would need to have 12 shards for this indice.

This is basically how people organized indices before Elastic created ILM and rollovers.

You can configure an ILM policy without a rollover if you use time-based indices, this way your ILM policy would be able to move your data to warm phase or delete it.

I don't think this is true, it is just a matter of how the data is organized.

If you are working with unique ids then you cannot have duplicates, since rollover can lead to duplicate data you should not use it and think in another way to store your data.

From what I understand your use case does not work if you use rollover, but it may work if you use time-based indices.

Before rollover was available the size of indices were regularly monitored and gradual increases in data volume would periofically result in a modification of the number of primary shards used in the index templates. If you have sudden spikes in traffic you nowadays have the option to use the split index API once the index is no longer written to (or make the index readonly just while it is being split). This results in a new index name, but you can use an alias to restore access under the old name as this likely is not a common issue.

Hello @Jason_Paralta ,

I have come to my conclusion that this can't be better managed,may be I am wrong, after reading various blogs and post in community I understood that If you've frequent updates in your data then rollover is not best solution, as this causes duplicates.
If I use single index only this resolves all issue but the very need of ILM automatically managing my indices vanishes.
I'm still working on this, if found something worthwhile solution will post here.
Alternatively I'm following this post in stack overflow, unable to understand the filter that james jiang mentioned here, how he has implemented this and if possible then to search in document in previous rollover is big task for logstash processing.

Conclusion: Use only single index resolve duplicates, if ILM used here then entire data will be lost as single index and in case you require historical data-its lost.
How to better send the data is the question here,For timeserires data its not issue and only for frequent data updates with rollover this don't work well.

If you use a date filter to reliably parse a consistent timestamp in the record into the @timestamp field and also use a hash function to generate a unique and deterministic document ID through the fingerprint filter, this is the solution I described in the blog post I linked to.

I suspect the reason he is ending up with duplicates is because he is not parsinga a consistent date into @timestamp (@timestamp is by default populated with the received time, which can lead to different indices if records arrive very late).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.