Creating a MDM

We are using elasticsearch and have around 500,000,000 person records. I'm looking a creating a MDM on persons. I'm thinking Java but worried about performance. I'm happy to get my head around Spark or use Mapreduce. Any tips or has anyone else done a similar project. Appreciate your help.

What's MDM?

A Master Data Management system. Basically doing entity resolution on a
person. E.g. if the name is John Smith is it the same John Smith as another
one in the corpus? Based on DOB or passport, address or some other metadata


Ahh ok. Graph can totally do that for you!

Check out or
There are also a bunch of great blog posts that explain it.

Champion I'll have a look

Hi David,

I'm looking to implement something similar. I'd be grateful for your experience on this project and any pitfalls to avoid.


Hi Ryan
We have just started using Graph. The issue is scale because we then need a second pass using mapreduce and string comparators. We tried Scala Spark but the recursive nature of comparison killed us.
I'll let you know more in a couple of weeks. Appreciate any feedback on your findings

HI David,

Have seen similar things on other threads. One of the things I have had recommended to me is to have a dedicated graph database e.g. neo4j and essentially do an enrichment between elastic and neo4j to give the final score and threshold for the MDM key. It requires 2 stores of the same data and a custom interface between the 2 but seems to get around the scale issue.

Curious - do you need pre-fused entities or can you assemble them as-you-go?

It's the difference between working with pre-fused entities like this:

   "fused_entity_id": 435223
   "keys": ["passport1", "handle1", "handle2" ...]

(which take a lot of maintenance using any tech stack) and ...

   "doc_id": 1,
   "keys": ["passport1", "handle1"]
   "doc_id": 2,
   "keys": ["passport1", "handle2"]
   "doc_id": 3,
   "keys": ["unrelated_passport", "unrelated handle"]

... where a single entity can be assembled on-the-fly using the graph API to walk the chain of identifiers in your entity sightings data. The advantage of the latter approach is a user can experiment with fuzzier keys that would over-link if you tried using them in your batch linking rules that maintain all entities at index-time.

On the fly is perfect.
Love to see an example on github to get me started

Hey Mark
Just to let you know it has finally clicked, I knew Graph Theory was nice
but REALLY nice embedded in ElasticSearch. Very excited about what can be
achieved with Graph and ElasticSearch.
Building some demos now.

1 Like

Good to know. The trick is to ensure that each entity sighting has its own dedicated document - I call these an "entity reference". Original business documents like an insurance claim may contain references to several different people entities (driver, third party, witness...). With multi-entity docs like this each of these people should have their keys broken out into different person_reference child docs to avoid muddling the different identities. Each person_reference doc contains:

  • several different keys that can be used to identify that person (name+DOB+zipcode, SSN, email address...)
  • the id of the source business document e.g. our insurance claim number

This gives you all you need to crawl the graph of keys that make up the aliases for a person entity which you can then visually group in the Graph UI as a single vertex (I'm thinking of adding a button to help automate this UI grouping). From each person entity you can also crawl out via the IDs of original business docs to discover connections to other entities (people, cars, addresses).

I plan to cover these techniques in more depth in a future blog or talk.