Probabilistic Record Linkage using Elastic Search

Hi,

I would like to know How do I use Elastic Search to implement Probabilistic Record Linkage across multiple data set?
Do we I need to pre-process the data from multiple source first then push the data into ES or Shall I load the data directly push the data into ES?

You can load the data sets and then you can use fuzzy matching and/or implement function score scripts. See for example https://github.com/YannBrrd/elasticsearch-entity-resolution

It doesn't support 1.7.1 which internally uses Duke. This supports till 1.4.1

Did you try it?

It compiles against Elasticsearch 1.7.1 without any changes, so it will work.

Yup I tried it compiles against 1.7.1 only if we skip the test cases.

I took the source code and tried to build/compile it just upgrading the ES version. If I use mvn clean install -DskipTests=true it compiles the code and generates the jar however if I use it only mvn clean install it fails it throws up exception NoClassDef from test class

Test must not fail. You have to update to Lucene 4.10.4 for ES 1.7.1. Edit pom.xml to set versions:

 <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <elasticsearch.version>1.7.1</elasticsearch.version>
        <lucene.version>4.10.4</lucene.version>
        <duke.version>1.2</duke.version>
    </properties>

Thank you. Already did that. it works now. Now the question is, How should I approach the problem statement in ES? I mean I have two datasets I have to perform the record Linking across dataset. Do I need to index both of the dataset? I am confused with how does this entity-resolution plug-in works with multiple datasets?