Querying differences between indexes

VinnyT · March 16, 2018, 9:34pm

Hi. I have a main ES data set (span over a number of indicies) of approximately 5 billion entries. For the purpose of this question, let's say the data consists of username, email, first name, last name. (u, e, fn, ln).

I now have another very large dataset (2 billion entries) that I want to compare against my original set to find out how many are duplicates and how many are original.

The comparison is ONLY looking at username and email to see if there is a match. The other fields are ignored. As of now, i am literally going line by line in the new data, looking to see if there is already a match in my existing data set.

My first approach was to do this via command line, reading X number of rows at a tie and feeding it to ES via multisearch api. That isn't working very well.

Instead, what if all of the new data (2billion entries) were loaded on a few temporary indices. Is there a way to query the new data against the old data and somehow flag the entries that are unique (or duplicate)?

Thanks!

thiago · March 20, 2018, 3:04am

Elasticsearch can not do this natively. For better resiliency and scalability I suggest that you implement this operation with Apache Spark using the es-hadoop connector.

system · April 17, 2018, 3:05am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Duplicate in Dataset while reading from elasticsearch index with SPARK Elasticsearch es-hadoop	1	697	May 9, 2019
Difference of Sets in ES Elasticsearch	1	83	July 11, 2024
Compare field value of two different index Elasticsearch	2	379	June 14, 2018
Comparing data from a RDBMS to Elasticsearch Elasticsearch	2	1224	September 22, 2017
ES design regarding duplicates across indexes Elasticsearch	9	4860	March 1, 2018

Querying differences between indexes

Related topics