Querying differences between indexes

Hi. I have a main ES data set (span over a number of indicies) of approximately 5 billion entries. For the purpose of this question, let's say the data consists of username, email, first name, last name. (u, e, fn, ln).

I now have another very large dataset (2 billion entries) that I want to compare against my original set to find out how many are duplicates and how many are original.

The comparison is ONLY looking at username and email to see if there is a match. The other fields are ignored. As of now, i am literally going line by line in the new data, looking to see if there is already a match in my existing data set.

My first approach was to do this via command line, reading X number of rows at a tie and feeding it to ES via multisearch api. That isn't working very well.

Instead, what if all of the new data (2billion entries) were loaded on a few temporary indices. Is there a way to query the new data against the old data and somehow flag the entries that are unique (or duplicate)?


Elasticsearch can not do this natively. For better resiliency and scalability I suggest that you implement this operation with Apache Spark using the es-hadoop connector.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.