Quick Summary: Looking for feedback on our data deduplication approach using generated doc IDs for time-based event data vs leveraging ES for generating document IDs.
Data Overview: We are working with an event dataset (ingest 20-50million events/day, each event is small: up to ~20 fields). Events come in asynchronously, even though everything is time-based so for example data for August 1 we will receive anywhere between Aug 2 - Aug 15. We're writing daily indexes (~10-20GB/day) and so we need to keep the index open for 2 weeks before we can close it, make it read-only and force_merge to optimize storage/memory usage.
Challenge: The crux of our problem is we want to efficiently store and deduplicate events, ~0.5-2% of our events are duplicates so for 30M records for a given day roughly 150K-600K are duplicates. Our priorities are (1) deduplicate dataset, (2) efficient storage of our indices, (3) ingest time. Based on our research we went create a fingerprint (base64 encoded SHA1 hash) in logstash as the document ID. We've started testing different fingerprint methods (SHA1, MURMUR3, MD5) to evaluate the impacts to storage and compression. As a side note, we've already optimized storage via https://www.elastic.co/guide/en/elasticsearch/reference/7.10/tune-for-disk-usage.html and several other articles. We're using
"codec": "best compression", we
force_merge our index down to a single segment (using only 1 shard).
Unexplained Case: Upon testing, we found that allowing elasticsearch to generate the document ID our index size was roughly 10% smaller. This is very perplexing. We've read multiple blogs and other posts around efficient deduplication and storage. We can't understand why our index with 1.8% fewer documents and an ID generated by elastic is 10% smaller than the deduplicated index with a fingerprint generated ID. Here are the details (all stats are post force_merge to 1 segment and 1 shard)
Elastic generated ID
- documents: 35,971,639
- storage size: 10,685Mb
- index time: 88min
SHA1 base64 bit encoded fingerprint document ID
- documents: 35,351,999
- storage size: 12,105Mb
- index time: 134min
Why does the SHA1 hash dedup method, with 1.8% fewer documents stored, take up 1,420Mb (12%) more storage? Everything else is identical other than the document IDs and our fingerprint based approach eliminating duplicates? What am I missing?
MURMUR3 generated ID
- documents: 35,206,655
- storage: 10,663Mb (slightly smaller but again we're missing records)
- we can't use MURMUR3 because we'd be losing data due to collisions, 145,344 events lost (would need the fingerprint plugin to implement MURMUR3 128-bit algorithm)
Example document IDs
Elastic-generated: 4os1nHUB9e1e2Zb3Ep6N SHA-1: nM+bkeQYucLeRAK1HzZJutt1SVM= MD5: 3UVPog8tgsjEFsXFuoDqPw== MURMUR3: 2583342839
Since each document SHA-1 ID has 4 more bytes over the elastic-generated ID, I can see the index being ~137Mb larger (4bytes x 36M documents) , but that assumes that the 619,640 duplicates in the index take up no space. That's what I'm most perplexed about, I would expect the elimination of roughly 2% of the documents from the index to cause the index size to be smaller.
- What approach for document ID should we use to optimize storage and eliminate duplicates?
- We've thought about deduplicating events in databricks prior to sending over to logstash so that we're guaranteed no duplicates and can use ES generated ID, are there other suggestions we should consider?