Logstash->Elasticsearch document deduplication efficiency & optimization

Quick Summary: Looking for feedback on our data deduplication approach using generated doc IDs for time-based event data vs leveraging ES for generating document IDs.

Data Overview: We are working with an event dataset (ingest 20-50million events/day, each event is small: up to ~20 fields). Events come in asynchronously, even though everything is time-based so for example data for August 1 we will receive anywhere between Aug 2 - Aug 15. We're writing daily indexes (~10-20GB/day) and so we need to keep the index open for 2 weeks before we can close it, make it read-only and force_merge to optimize storage/memory usage.

Challenge: The crux of our problem is we want to efficiently store and deduplicate events, ~0.5-2% of our events are duplicates so for 30M records for a given day roughly 150K-600K are duplicates. Our priorities are (1) deduplicate dataset, (2) efficient storage of our indices, (3) ingest time. Based on our research we went create a fingerprint (base64 encoded SHA1 hash) in logstash as the document ID. We've started testing different fingerprint methods (SHA1, MURMUR3, MD5) to evaluate the impacts to storage and compression. As a side note, we've already optimized storage via https://www.elastic.co/guide/en/elasticsearch/reference/7.10/tune-for-disk-usage.html and several other articles. We're using "codec": "best compression", we force_merge our index down to a single segment (using only 1 shard).

Unexplained Case: Upon testing, we found that allowing elasticsearch to generate the document ID our index size was roughly 10% smaller. This is very perplexing. We've read multiple blogs and other posts around efficient deduplication and storage. We can't understand why our index with 1.8% fewer documents and an ID generated by elastic is 10% smaller than the deduplicated index with a fingerprint generated ID. Here are the details (all stats are post force_merge to 1 segment and 1 shard)

Elastic generated ID

  • documents: 35,971,639
  • storage size: 10,685Mb
  • index time: 88min

SHA1 base64 bit encoded fingerprint document ID

  • documents: 35,351,999
  • storage size: 12,105Mb
  • index time: 134min

Why does the SHA1 hash dedup method, with 1.8% fewer documents stored, take up 1,420Mb (12%) more storage? Everything else is identical other than the document IDs and our fingerprint based approach eliminating duplicates? What am I missing?

MURMUR3 generated ID

  • documents: 35,206,655
  • storage: 10,663Mb (slightly smaller but again we're missing records)
  • we can't use MURMUR3 because we'd be losing data due to collisions, 145,344 events lost (would need the fingerprint plugin to implement MURMUR3 128-bit algorithm)

Example document IDs

Elastic-generated: 4os1nHUB9e1e2Zb3Ep6N
SHA-1:             nM+bkeQYucLeRAK1HzZJutt1SVM=
MD5:               3UVPog8tgsjEFsXFuoDqPw==
MURMUR3:           2583342839

Since each document SHA-1 ID has 4 more bytes over the elastic-generated ID, I can see the index being ~137Mb larger (4bytes x 36M documents) , but that assumes that the 619,640 duplicates in the index take up no space. That's what I'm most perplexed about, I would expect the elimination of roughly 2% of the documents from the index to cause the index size to be smaller.

Open Questions:

  1. What approach for document ID should we use to optimize storage and eliminate duplicates?
  2. We've thought about deduplicating events in databricks prior to sending over to logstash so that we're guaranteed no duplicates and can use ES generated ID, are there other suggestions we should consider?

Hi,
Afaik, the IDs Elasticsearch generates share a common prefix as better compaction.

1 Like

Thanks for feedback, still a bit perplexed because while I can see a small saving for that, I can't see it bigger than elimination of almost 2% of the documents in our index. Definitely makes a very strong case for us to use the elastic generated ID vs our own.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.