Avoid duplication

Hi there,

I've a lot of traffic logs (around 20 millions per day) to index and I would like to know suggestion of how to avoid duplication with these amount of data.

I was reading this article but I haven't such experience with ES to think for pros and cons.

Creating an UUID for every log to check if has not been indexed by ES it's a good way to lead with Logstash?

You can also have a look at this blog post on the topic. Setting a document id before indexing is a common way to avoid duplicates when using time-based indices.

2 Likes

Hey, @Christian_Dahlqvist
Thanks for the reply.

Would you do the same if you need to avoid duplication with millions of documents per day?

Great blog post, I have a script that pulls down Cloudflare logs for a given time period in unix time but still occasionally get duplicate entries. Being able to mitigate these duplications would be fantastic.

Q1: You mention that having Elasticsearch handle identifier assignment is the most efficient but then you give a process where logstash generates an identifier. Is this the only way to prevent duplication and have an event updated on ingest? Is it possible to use the pre-existing _id field?

Q2: If we generate a fingerprint in the way that your example suggests (off the message field), won't this only prevent duplication of events that contain the exact same data but still cause duplication instead of updates to events that have changes in the message field?

What pre-existing _id field? As soon as you supply an ID, it has to be treated as a potential update, which is slower.

This assumes immutable data. If you have data changing, you probably need to specify the fields the hash should be based on and most likely not prefix it with a timestamp as that may change.

1 Like

I have a few different indexes and they all display a field of some type called _id. Am I misunderstanding what this is?
image

That looks like the ID Elasticsearch assigns automatically, which is the default behaviour.

1 Like

I added the fingerprint function to my config (both filter and output stages) and Logstash throws the following warning but everything seems to be working with the pipeline:

[2018-11-08T14:26:07,085][WARN ][org.logstash.FieldReference] Detected ambiguous Field Reference@metadata[tsprefix], which we expanded to the path[@metadata, tsprefix]; in a future release of Logstash, ambiguous Field References will not be expanded.

Q1 Does this mean in my logstash config I should change references in both filter and output from @metadata[tsprefix] to [@metadata, tsprefix]?

Also, I did a test. My pipeline ingests data from static files. I am creating the fingerprint off of a field that has a unique ID itself (CloudFlare RayID, a hex UID). Resulting ingested events have an _id field value that is 40 characters long, twice as long as the previous values. I then took one of the files that have already been ingested, renamed it without altering the data and stuck it in the ingest folder. This resulted in a duplicate event being created.

Q2 Did I do something wrong, does Elasticsearch need to be configured to check this field?

filter {
  }
  fingerprint {
    source => "RayID"
    target => "[@metadata][fingerprint]"
    method => "MD5"
    key => "stlouisco"
  }
  ruby {
    code => "event.set('@metadata[tsprefix]', event.get('@timestamp').to_i.to_s(16))"
  }
}
output {
  elasticsearch {
    id => "Send to Elasticsearch"
    hosts => ["1.2.3.4:9200"]
    document_id => "%{[@metadata][tsprefix]}%{[@metadata][fingerprint]}"
    template_name => "cloudflare"
    index => "cloudflare-%{+YYYY.MM.dd}
  }
}

If you have a unique ID, why then not use that instead of creating a hash? Which version of Logstash are you using?

You mentioned improved performance through the use of timestamp prefixing and I also don't 100% know that the RayID isn't eventually reused.

Modified the code and resolved the ambiguity warning that was occurring earlier. This also resolved duplicate events from being created. However, I am not seeing the events get updated if I change any field values and reingest the data.

filter {
  fingerprint {
    source => "RayID"
    target => "[@metadata][fingerprint]"
    method => "MD5"
    key => "stlouisco"
  }
  ruby {
    code => "event.set('[@metadata, tsprefix]', event.get('@timestamp').to_i.to_s(16))"
  }
output {
  elasticsearch {
    document_id => "%{[@metadata, tsprefix]}%{[@metadata][fingerprint]}"
  }
}

Logstash generates a current @timestamp value for each event that is read. In order to use a timestamp prefix here you will need a timestamp that is constant for the event in your data and you need to parse this with a date filter. Otherwise the timestamp will change every time the same event is read and all IDs will be unique, which results in them all being successfully indexed. This is shown in this gist.

As the same RayID will always hash to the same hash value, there is no point in using fingerprint here unless that ID is very long and using a hash would shorten it or eliminate illegal characters.

:man_facepalming:

Changed the ruby script to use the field EdgeStartTimestamp provided by the Cloudflare logs that is in nanosecond format and am now seeing event updates instead of duplications. Thanks for the assistance @Christian_Dahlqvist.

filter {
  fingerprint {
    source => "RayID"
    target => "[@metadata][fingerprint]"
    method => "MD5"
    key => "test"
  }
  ruby {
    code => "event.set('[@metadata][tsprefix]', event.get('[EdgeStartTimestamp]').to_i.to_s(16))"
  }
}
output {
  elasticsearch {
    document_id => "%{[@metadata][tsprefix]}%{[@metadata][fingerprint]}"
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.