How to deal with duplicate data

ksobon · February 28, 2025, 5:10pm

I have a Filebeat pipeline that is ingesting data from an end-user machine that might be stored there for 30 days. My Logstash pipeline has the following settings:

document_id => "%{[@metadata][newId]}"
action => "create"

Because I wanted to make sure that the same log is never written into the database twice I set the action to create and I created my unique document_id. That setup works fine for me but I had a situation recently where Filebeat was uninstalled and the registry folder for it was wiped clean on that machine. Now, when we got it re-installed it tries to re-ingest 30 days worth of logs and write them again.

That results in errors because the action "create" doesn't allow for overrides. I can change that to "update" but another issue that pops up is that some of the indexes that this tries to update are already in the Warm tier, and are not flagged as "write" indexes.

Any idea how I should handle this situation? Logstash is returning a lot of errors because "create" fails to deal with existing documents, and that slows it down to a grind.

I was thinking that could change the action to "update" and add the doc_as_upsert to true to make sure that I can override existing documents. Then I would need to make older indexes somehow writable again. Should I reindex that older data into the Hot tier for now, and then after logstash overrides all of the documents, I can just move it back to warm again? Does that sound like a reasonable thing to do? Any other ways to deal with this issue?

leandrojmp · February 28, 2025, 6:19pm

But is this an issue? This is expected as the document already exists, so it will be reject, you can just ignore those errors.

What the rest of your output looks like? Are you using data streams? Normal indices? Are your indices time based?

leandrojmp · February 28, 2025, 6:39pm

Just saw this, the performance impact could be related to the amount of logs being written, I had a past issue related to this.

One quick solution would be to simple change the Logstash loglevel for the Elasticsearch output, to only logs on EROR or FATAL logs for example, not sure in which level the current create errors are logged.

Something like this:

curl -XPUT 'localhost:9600/_node/logging?pretty' -H 'Content-Type: application/json' -d'{ "logger.logstash.outputs.elasticsearch" : "FATAL"}'

leandrojmp · March 1, 2025, 4:26am

There are a lot of native Elastic Agent integrations that relies on using a custom _id to avoid duplicate events.

This id normally comes from a field in the original message or a fingerprint of one or more fields from the original message.

Using a custom _id value is how you can avoid duplicate data in Elasticsearch, it is a common approach when you want to avoid duplicated events.

strawgate · March 1, 2025, 4:29am

If these are spread across many files you can set ignore_older to the number of days ago the registry was reset

If the file is older than ignore_older, Filebeat will add the file to its registry with the offset set to the end of the file and then you can simply revert back or remove the ignore older setting.

Similarly, you could just add a processor to Filebeat or to Logstash that drops events older than the last message processed from the device prior to the registry reset.

RainTown · March 1, 2025, 7:51am

Obviously I stand corrected, thank you. On reflection, I have deleted the above comment as a) it was not helpful and b) it was also factually wrong. Apologies.

ksobon · March 31, 2025, 2:24pm

@leandrojmp yeah, as it turns out it's just a lot of warnings being logged into the file, and that has performance implications. I saw Logstash run out of memory and crash. When it reboots it will use up CPU. I also saw Java eat up a lot of CPU on that machine, probably due to heap memory being all used up, and it constantly performing garbage collection, etc. I tried using failure_type_logging_whitelist to exclude 409 from logging but that doesn't seem to work. I can change the log level to Error and that will solve the issue. I think at the moment the best way for me to go is to make sure that I keep the "registry" of Filebeat safely tucked away so that it doesn't get deleted causing this issue in the first place. Other than that I cranked up heap memory allocation to minimize out of memory errors and ease up garbage collection. That seems to be helping.

Topic		Replies	Views
How to use document id to avoid duplication of logs? Logstash	8	2012	June 26, 2020
Duplicate events with filebeat -> logstash -> elasticsearch pipeline Logstash	6	2425	November 28, 2017
Will logstash duplicate already indexed data in elasticsearch? Logstash	2	1145	July 6, 2017
Logstash generating duplicated index Logstash	1	475	September 5, 2017
How to stop duplicate entries using elasticsearch plugin Logstash	10	6263	June 29, 2017

How to deal with duplicate data

Related topics