How to deal with duplicate data

I have a Filebeat pipeline that is ingesting data from an end-user machine that might be stored there for 30 days. My Logstash pipeline has the following settings:

document_id => "%{[@metadata][newId]}"
action => "create"

Because I wanted to make sure that the same log is never written into the database twice I set the action to create and I created my unique document_id. That setup works fine for me but I had a situation recently where Filebeat was uninstalled and the registry folder for it was wiped clean on that machine. Now, when we got it re-installed it tries to re-ingest 30 days worth of logs and write them again.

That results in errors because the action "create" doesn't allow for overrides. I can change that to "update" but another issue that pops up is that some of the indexes that this tries to update are already in the Warm tier, and are not flagged as "write" indexes.

Any idea how I should handle this situation? Logstash is returning a lot of errors because "create" fails to deal with existing documents, and that slows it down to a grind.

I was thinking that could change the action to "update" and add the doc_as_upsert to true to make sure that I can override existing documents. Then I would need to make older indexes somehow writable again. Should I reindex that older data into the Hot tier for now, and then after logstash overrides all of the documents, I can just move it back to warm again? Does that sound like a reasonable thing to do? Any other ways to deal with this issue?

But is this an issue? This is expected as the document already exists, so it will be reject, you can just ignore those errors.

What the rest of your output looks like? Are you using data streams? Normal indices? Are your indices time based?

Just saw this, the performance impact could be related to the amount of logs being written, I had a past issue related to this.

One quick solution would be to simple change the Logstash loglevel for the Elasticsearch output, to only logs on EROR or FATAL logs for example, not sure in which level the current create errors are logged.

Something like this:

curl -XPUT 'localhost:9600/_node/logging?pretty' -H 'Content-Type: application/json' -d'{ "logger.logstash.outputs.elasticsearch" : "FATAL"}'

There are a lot of native Elastic Agent integrations that relies on using a custom _id to avoid duplicate events.

This id normally comes from a field in the original message or a fingerprint of one or more fields from the original message.

Using a custom _id value is how you can avoid duplicate data in Elasticsearch, it is a common approach when you want to avoid duplicated events.

If these are spread across many files you can set ignore_older to the number of days ago the registry was reset

If the file is older than ignore_older, Filebeat will add the file to its registry with the offset set to the end of the file and then you can simply revert back or remove the ignore older setting.

Similarly, you could just add a processor to Filebeat or to Logstash that drops events older than the last message processed from the device prior to the registry reset.

1 Like

Obviously I stand corrected, thank you. On reflection, I have deleted the above comment as a) it was not helpful and b) it was also factually wrong. Apologies.