Preserve uniqueness while supporting large amount of data

sam10 · July 21, 2024, 1:21pm

Hi community,

I have approximately a few TB of data (few billions of documents) and I want to save them in an elastic manner while preserving the uniqueness of each document. I achieve uniqueness by generating the _id of each document myself through creating a hash of some field of a document.

I have tried a few Elastic ways to solve the problem, but I have encountered some issues:

One index with a large number of shards (I tried 48) will preserve uniqueness because all documents are in the same index. However, after pushing around 1TB of data, the process of extracting and manipulating the data takes too much time.
Using ILM and data streams can help with the issue of increasing data, but will not preserve uniqueness as the data is saved in different indices.

I would love your help, thanks in advance!