Creating a MDM

David_she · June 25, 2016, 12:30am

Hi,
We are using elasticsearch and have around 500,000,000 person records. I'm looking a creating a MDM on persons. I'm thinking Java but worried about performance. I'm happy to get my head around Spark or use Mapreduce. Any tips or has anyone else done a similar project. Appreciate your help.

warkolm · June 25, 2016, 8:50pm

What's MDM?

David_she · June 26, 2016, 3:55am

A Master Data Management system. Basically doing entity resolution on a
person. E.g. if the name is John Smith is it the same John Smith as another
one in the corpus? Based on DOB or passport, address or some other metadata
field.

Thanks

warkolm · June 26, 2016, 6:48am

Ahh ok. Graph can totally do that for you!

Check out https://www.elastic.co/products/graph or https://www.elastic.co/guide/en/graph/current/index.html
There are also a bunch of great blog posts that explain it.

David_she · June 26, 2016, 9:05am

Champion I'll have a look

ryan_3_thomas · February 2, 2017, 10:38am

Hi David,

I'm looking to implement something similar. I'd be grateful for your experience on this project and any pitfalls to avoid.

Thanks,
Ryan

David_she · February 2, 2017, 7:45pm

Hi Ryan
We have just started using Graph. The issue is scale because we then need a second pass using mapreduce and string comparators. We tried Scala Spark but the recursive nature of comparison killed us.
I'll let you know more in a couple of weeks. Appreciate any feedback on your findings

ryan_3_thomas · February 3, 2017, 3:41pm

HI David,

Have seen similar things on other threads. One of the things I have had recommended to me is to have a dedicated graph database e.g. neo4j and essentially do an enrichment between elastic and neo4j to give the final score and threshold for the MDM key. It requires 2 stores of the same data and a custom interface between the 2 but seems to get around the scale issue.

Mark_Harwood · February 3, 2017, 5:28pm

Curious - do you need pre-fused entities or can you assemble them as-you-go?

It's the difference between working with pre-fused entities like this:

{
   "fused_entity_id": 435223
   "keys": ["passport1", "handle1", "handle2" ...]
}

(which take a lot of maintenance using any tech stack) and ...

{
   "doc_id": 1,
   "keys": ["passport1", "handle1"]
}
{
   "doc_id": 2,
   "keys": ["passport1", "handle2"]
}
{
   "doc_id": 3,
   "keys": ["unrelated_passport", "unrelated handle"]
}

... where a single entity can be assembled on-the-fly using the graph API to walk the chain of identifiers in your entity sightings data. The advantage of the latter approach is a user can experiment with fuzzier keys that would over-link if you tried using them in your batch linking rules that maintain all entities at index-time.

David_she · February 4, 2017, 8:53am

On the fly is perfect.
Love to see an example on github to get me started

David_she · February 4, 2017, 9:46am

Hey Mark
Just to let you know it has finally clicked, I knew Graph Theory was nice
but REALLY nice embedded in ElasticSearch. Very excited about what can be
achieved with Graph and ElasticSearch.
Building some demos now.

Mark_Harwood · February 6, 2017, 8:58am

Good to know. The trick is to ensure that each entity sighting has its own dedicated document - I call these an "entity reference". Original business documents like an insurance claim may contain references to several different people entities (driver, third party, witness...). With multi-entity docs like this each of these people should have their keys broken out into different person_reference child docs to avoid muddling the different identities. Each person_reference doc contains:

several different keys that can be used to identify that person (name+DOB+zipcode, SSN, email address...)
the id of the source business document e.g. our insurance claim number

This gives you all you need to crawl the graph of keys that make up the aliases for a person entity which you can then visually group in the Graph UI as a single vertex (I'm thinking of adding a button to help automate this UI grouping). From each person entity you can also crawl out via the IDs of original business docs to discover connections to other entities (people, cars, addresses).

I plan to cover these techniques in more depth in a future blog or talk.

Cheers
Mark

Topic		Replies	Views
Storing a graph of data in an index Elasticsearch	3	739	December 28, 2016
Patterns, Best Practices.... for Ingesting large SQL multi-table/multi-relationship data into ES Elasticsearch	1	305	June 6, 2021
Elasticsearch for master data management Elasticsearch	1	676	August 16, 2018
Knowlegde Graphs, ELK and NEO4j Elasticsearch elastic-stack-graph	5	1541	February 29, 2024
Questions from a newbie Elasticsearch	15	403	July 6, 2017

Creating a MDM

Related topics