Probabilistic Record Linkage using Elastic Search



I would like to know How do I use Elastic Search to implement Probabilistic Record Linkage across multiple data set?
Do we I need to pre-process the data from multiple source first then push the data into ES or Shall I load the data directly push the data into ES?

(Jörg Prante) #2

You can load the data sets and then you can use fuzzy matching and/or implement function score scripts. See for example


It doesn't support 1.7.1 which internally uses Duke. This supports till 1.4.1

(Jörg Prante) #4

Did you try it?

It compiles against Elasticsearch 1.7.1 without any changes, so it will work.


Yup I tried it compiles against 1.7.1 only if we skip the test cases.


I took the source code and tried to build/compile it just upgrading the ES version. If I use mvn clean install -DskipTests=true it compiles the code and generates the jar however if I use it only mvn clean install it fails it throws up exception NoClassDef from test class

(Jörg Prante) #7

Test must not fail. You have to update to Lucene 4.10.4 for ES 1.7.1. Edit pom.xml to set versions:



Thank you. Already did that. it works now. Now the question is, How should I approach the problem statement in ES? I mean I have two datasets I have to perform the record Linking across dataset. Do I need to index both of the dataset? I am confused with how does this entity-resolution plug-in works with multiple datasets?

(system) #9