Match data from two sets and mark entries

I have two relatively large sets of people data (first_name, last_name, birth_date etc).
And I need to 'match' them, because some people are present in both sets and I need to find and mark them (by adding their ID from the other set). I've come up with this solution:

  1. Load set A into Elasticsearch index (6M entries)
  2. During load of set B, for each entry, make an update-by-query request which looks for people with the same first_name(Text), last_name(Text) & birth_date(Date) and adds B_id field through a simple script (painless)

So for example, A index roughly might look like this:

_id first_name last_name birth_date b_id
yJzdWIiOiJ demo1 qwerty 2001.11.04 4d5323bd91c2
MKPj0ILgq1 demo2 demo2 1995.11.11 null
oueUg3sBO demo512 demo512 2000.05.16 null

Here, an entry with id yJzdWIiOiJ got matched with an entry with id 4d5323bd91c2 from index B.

An example request for such flow:

{
    "script":{
        "id":"id-field-adding-script",
        "params":{
            "value":9150456064
        }
    },
    "size":1000,
    "query":{
        "bool":{
            "must":[
                {
                    "match":{
                        "first_name":{
                            "query":"Ryan",
                            "operator":"AND",
                            "prefix_length":0,
                            "max_expansions":50,
                            "fuzzy_transpositions":true,
                            "lenient":false,
                            "zero_terms_query":"NONE",
                            "auto_generate_synonyms_phrase_query":true,
                            "boost":1.0
                        }
                    }
                },
                {
                    "match":{
                        "last_name":{
                            "query":"Jewel",
                            "operator":"AND",
                            "prefix_length":0,
                            "max_expansions":50,
                            "fuzzy_transpositions":true,
                            "lenient":false,
                            "zero_terms_query":"NONE",
                            "auto_generate_synonyms_phrase_query":true,
                            "boost":1.0
                        }
                    }
                },
                {
                    "match":{
                        "birth_date":{
                            "query":"1992-09-11",
                            "operator":"OR",
                            "prefix_length":0,
                            "max_expansions":50,
                            "fuzzy_transpositions":true,
                            "lenient":false,
                            "zero_terms_query":"NONE",
                            "auto_generate_synonyms_phrase_query":true,
                            "boost":1.0
                        }
                    }
                }
            ],
            "adjust_pure_negative":true,
            "boost":1.0
        }
    }
}

This request is sent for every entry (via Java Spring backend). it is not as slow as I expected it to be, but I still want to improve it. In my tests, set A has 6M entries and set B has ~250K. It takes up to 2 hours to both match set B with set A and import set B afterwards (actually it takes 10k entries, matches them and then imports into ES, repeats).

The question is: How do I improve this, make it faster? Or maybe you can suggest another approach.

There is no built-in function to solve this very concrete use-case.

You could possibly do this with two single searches.. albeit longer ones :slight_smile:

How about running two scroll/PIT searches, one against index a, one against index b. Sort by last name, first name, birth_date and then iterate through the index with less documents and see compare to the index with more indices, when you find a matching document.

The idea would be something like leap-frogging/zig zag joining the data and thus find the two matches. Whenever you find a match, run an update query (and potentially bulk those to save some more resources).

Does that make sense?

I'm relatively new to ES, not sure I get the idea. Could you please show some example?

When I hear "match data" in elasticsearch, the percolator automatically comes to my mind...
I can't dig into it further now, but maybe index set B in a percolation index (with term queries), and percolate documents from set A into it, in order to retrieve matching ids ?