Match data from two sets and mark entries

mydeadvictoria · August 28, 2021, 5:22pm

I have two relatively large sets of people data (first_name, last_name, birth_date etc).
And I need to 'match' them, because some people are present in both sets and I need to find and mark them (by adding their ID from the other set). I've come up with this solution:

Load set A into Elasticsearch index (6M entries)
During load of set B, for each entry, make an update-by-query request which looks for people with the same first_name(Text), last_name(Text) & birth_date(Date) and adds B_id field through a simple script (painless)

So for example, A index roughly might look like this:

_id	first_name	last_name	birth_date	b_id
yJzdWIiOiJ	demo1	qwerty	2001.11.04	4d5323bd91c2
MKPj0ILgq1	demo2	demo2	1995.11.11	null
oueUg3sBO	demo512	demo512	2000.05.16	null

Here, an entry with id yJzdWIiOiJ got matched with an entry with id 4d5323bd91c2 from index B.

An example request for such flow:

{
    "script":{
        "id":"id-field-adding-script",
        "params":{
            "value":9150456064
        }
    },
    "size":1000,
    "query":{
        "bool":{
            "must":[
                {
                    "match":{
                        "first_name":{
                            "query":"Ryan",
                            "operator":"AND",
                            "prefix_length":0,
                            "max_expansions":50,
                            "fuzzy_transpositions":true,
                            "lenient":false,
                            "zero_terms_query":"NONE",
                            "auto_generate_synonyms_phrase_query":true,
                            "boost":1.0
                        }
                    }
                },
                {
                    "match":{
                        "last_name":{
                            "query":"Jewel",
                            "operator":"AND",
                            "prefix_length":0,
                            "max_expansions":50,
                            "fuzzy_transpositions":true,
                            "lenient":false,
                            "zero_terms_query":"NONE",
                            "auto_generate_synonyms_phrase_query":true,
                            "boost":1.0
                        }
                    }
                },
                {
                    "match":{
                        "birth_date":{
                            "query":"1992-09-11",
                            "operator":"OR",
                            "prefix_length":0,
                            "max_expansions":50,
                            "fuzzy_transpositions":true,
                            "lenient":false,
                            "zero_terms_query":"NONE",
                            "auto_generate_synonyms_phrase_query":true,
                            "boost":1.0
                        }
                    }
                }
            ],
            "adjust_pure_negative":true,
            "boost":1.0
        }
    }
}

This request is sent for every entry (via Java Spring backend). it is not as slow as I expected it to be, but I still want to improve it. In my tests, set A has 6M entries and set B has ~250K. It takes up to 2 hours to both match set B with set A and import set B afterwards (actually it takes 10k entries, matches them and then imports into ES, repeats).

The question is: How do I improve this, make it faster? Or maybe you can suggest another approach.

spinscale · August 30, 2021, 11:16am

There is no built-in function to solve this very concrete use-case.

You could possibly do this with two single searches.. albeit longer ones

How about running two scroll/PIT searches, one against index a, one against index b. Sort by last name, first name, birth_date and then iterate through the index with less documents and see compare to the index with more indices, when you find a matching document.

The idea would be something like leap-frogging/zig zag joining the data and thus find the two matches. Whenever you find a match, run an update query (and potentially bulk those to save some more resources).

Does that make sense?

mydeadvictoria · August 30, 2021, 4:03pm

I'm relatively new to ES, not sure I get the idea. Could you please show some example?

vincenbr · August 30, 2021, 5:07pm

When I hear "match data" in elasticsearch, the percolator automatically comes to my mind...
I can't dig into it further now, but maybe index set B in a percolation index (with term queries), and percolate documents from set A into it, in order to retrieve matching ids ?

system · September 27, 2021, 5:07pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Compare 2 tables and add matching field in second table to first table Elasticsearch	9	1981	July 5, 2017
Search between two indexes Elasticsearch	6	377	March 4, 2021
Elasticsearch - matching between two index Elasticsearch	1	571	July 5, 2017
Match field A in index A against field B in index B Elasticsearch	3	580	July 5, 2017
Matching two fields with the same value across two indices Elasticsearch	8	4623	December 29, 2017

Match data from two sets and mark entries

Related topics