Deduplicating data in ElasticSearch


(Rafay Khan) #1

I am facing a challenge where I need to reduce my indexed data, by clustering similar items into one. Actually, somewhat deduplicating data. For eg.
> -----------------------------------------------------------------------------------------------------------------------------------------------

    Product Name                |    Category    |    Product Group           |   Attributes
-----------------------------------------------------------------------------------------------------------------------------------------------
 1- Board Marker - Blue     | Stationary    | White Board Markers   | Color = Blue, Type = Board
 2- Blue Board Marker  | Stationary    | White Board Markers   | Color = Blue, Type = Board 
---------------------------------------------------------------------------------------------------------------------------------------------------

In the new indexed data I need to store the data as:
> -----------------------------------------------------------------------------------------------------------------------------------------------

    Product Name                |    Category    |    Product Group           |  
    -----------------------------------------------------------------------------------------------------------------------------------------------
1-  Board Marker - Blue               | Stationary    | White Board Markers | ...

I did see @nfantone's similar question on "Clustering data on elastic Search index"(Clustering data on Elasticsearch index). Some of the replies there mentioned about the :

There is a proposal for a fingerprinting inget processor here: https://github.com/elastic/elasticsearch/issues/1693811 that you might find interesting as well.

and secondly, @Mark_Harwood explaining about:

Automate some batch data de-duplication process with minimal human intervention.

Requires more rigour and emphasises precision over recall. Each merge operation has to be something that can be trusted for the algorithm to iterate on without human intervention. A person cannot be allowed to over-link and become his brother then their father etc through a steady accumulation of weakly linked properties.
Large-scale iterative entity resolution can be achieved but uses non-fuzzy keys and lots of different ways of composing the keys. I demoed this 37 minutes into this presentation at elasticon: https://www.elastic.co/elasticon/conf/2016/sf/graph-capabilities-in-the-elastic-stack3

Can anyone explain how to go about doing these, as the fingerprint feature seems mature enough, since the question was posted, and implementing Mark_Harwood's approach(I have seen the linked presentation).

TL,DR: Find items which are at least 90% similar, on properties such as Product Name, Category , Product Group, (some)Attributes, and generate a new index to store them. I am new to ElasticSearch and a noob at this.


(Rafay Khan) #2

Any suggestions?


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.