Probabilistic Record Linkage using Elastic Search


(SAURAV PAUL) #1

Hi,

I would like to know How do I use Elastic Search to implement Probabilistic Record Linkage across multiple data set?
Do we I need to pre-process the data from multiple source first then push the data into ES or Shall I load the data directly push the data into ES?


(Jörg Prante) #2

You can load the data sets and then you can use fuzzy matching and/or implement function score scripts. See for example https://github.com/YannBrrd/elasticsearch-entity-resolution


(SAURAV PAUL) #3

It doesn't support 1.7.1 which internally uses Duke. This supports till 1.4.1


(Jörg Prante) #4

Did you try it?

It compiles against Elasticsearch 1.7.1 without any changes, so it will work.


(SAURAV PAUL) #5

Yup I tried it compiles against 1.7.1 only if we skip the test cases.


(SAURAV PAUL) #6

I took the source code and tried to build/compile it just upgrading the ES version. If I use mvn clean install -DskipTests=true it compiles the code and generates the jar however if I use it only mvn clean install it fails it throws up exception NoClassDef from test class


(Jörg Prante) #7

Test must not fail. You have to update to Lucene 4.10.4 for ES 1.7.1. Edit pom.xml to set versions:

 <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <elasticsearch.version>1.7.1</elasticsearch.version>
        <lucene.version>4.10.4</lucene.version>
        <duke.version>1.2</duke.version>
    </properties>

(SAURAV PAUL) #8

Thank you. Already did that. it works now. Now the question is, How should I approach the problem statement in ES? I mean I have two datasets I have to perform the record Linking across dataset. Do I need to index both of the dataset? I am confused with how does this entity-resolution plug-in works with multiple datasets?


(system) #9