Deduplicating data in ElasticSearch

Rafay_Khan · August 8, 2017, 1:24pm

I am facing a challenge where I need to reduce my indexed data, by clustering similar items into one. Actually, somewhat deduplicating data. For eg.
> -----------------------------------------------------------------------------------------------------------------------------------------------

    Product Name                |    Category    |    Product Group           |   Attributes
-----------------------------------------------------------------------------------------------------------------------------------------------
 1- Board Marker - Blue     | Stationary    | White Board Markers   | Color = Blue, Type = Board
 2- Blue Board Marker  | Stationary    | White Board Markers   | Color = Blue, Type = Board 
---------------------------------------------------------------------------------------------------------------------------------------------------

In the new indexed data I need to store the data as:
> -----------------------------------------------------------------------------------------------------------------------------------------------

    Product Name                |    Category    |    Product Group           |  
    -----------------------------------------------------------------------------------------------------------------------------------------------
1-  Board Marker - Blue               | Stationary    | White Board Markers | ...

I did see @nfantone's similar question on "Clustering data on elastic Search index"(Clustering data on Elasticsearch index). Some of the replies there mentioned about the :

There is a proposal for a fingerprinting inget processor here: https://github.com/elastic/elasticsearch/issues/1693811 that you might find interesting as well.

and secondly, @Mark_Harwood explaining about:

Automate some batch data de-duplication process with minimal human intervention.

Requires more rigour and emphasises precision over recall. Each merge operation has to be something that can be trusted for the algorithm to iterate on without human intervention. A person cannot be allowed to over-link and become his brother then their father etc through a steady accumulation of weakly linked properties.
Large-scale iterative entity resolution can be achieved but uses non-fuzzy keys and lots of different ways of composing the keys. I demoed this 37 minutes into this presentation at elasticon: https://www.elastic.co/elasticon/conf/2016/sf/graph-capabilities-in-the-elastic-stack3

Can anyone explain how to go about doing these, as the fingerprint feature seems mature enough, since the question was posted, and implementing Mark_Harwood's approach(I have seen the linked presentation).

TL,DR: Find items which are at least 90% similar, on properties such as Product Name, Category , Product Group, (some)Attributes, and generate a new index to store them. I am new to Elasticsearch and a noob at this.

Rafay_Khan · August 15, 2017, 1:52pm

Any suggestions?

system · September 12, 2017, 1:53pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Deduplicating using nested query Elasticsearch	2	353	July 6, 2017
Clustering data on Elasticsearch index Elasticsearch	9	4312	July 5, 2017
Get rid of duplicates Elasticsearch	2	287	April 29, 2019
ElasticSearch 1.3.4 - Duplicate data sometimes Elasticsearch	2	338	July 6, 2017
Help with aggregation to identify dups Elasticsearch	3	1080	March 4, 2019

Deduplicating data in ElasticSearch

Related topics