I have two relatively large sets of people data (first_name, last_name, birth_date etc).
And I need to 'match' them, because some people are present in both sets and I need to find and mark them (by adding their ID from the other set). I've come up with this solution:
- Load set A into Elasticsearch index (6M entries)
- During load of set B, for each entry, make an update-by-query request which looks for people with the same first_name(Text), last_name(Text) & birth_date(Date) and adds
B_id
field through a simple script (painless)
So for example, A
index roughly might look like this:
_id | first_name | last_name | birth_date | b_id |
---|---|---|---|---|
yJzdWIiOiJ | demo1 | qwerty | 2001.11.04 | 4d5323bd91c2 |
MKPj0ILgq1 | demo2 | demo2 | 1995.11.11 | null |
oueUg3sBO | demo512 | demo512 | 2000.05.16 | null |
Here, an entry with id yJzdWIiOiJ
got matched with an entry with id 4d5323bd91c2
from index B
.
An example request for such flow:
{
"script":{
"id":"id-field-adding-script",
"params":{
"value":9150456064
}
},
"size":1000,
"query":{
"bool":{
"must":[
{
"match":{
"first_name":{
"query":"Ryan",
"operator":"AND",
"prefix_length":0,
"max_expansions":50,
"fuzzy_transpositions":true,
"lenient":false,
"zero_terms_query":"NONE",
"auto_generate_synonyms_phrase_query":true,
"boost":1.0
}
}
},
{
"match":{
"last_name":{
"query":"Jewel",
"operator":"AND",
"prefix_length":0,
"max_expansions":50,
"fuzzy_transpositions":true,
"lenient":false,
"zero_terms_query":"NONE",
"auto_generate_synonyms_phrase_query":true,
"boost":1.0
}
}
},
{
"match":{
"birth_date":{
"query":"1992-09-11",
"operator":"OR",
"prefix_length":0,
"max_expansions":50,
"fuzzy_transpositions":true,
"lenient":false,
"zero_terms_query":"NONE",
"auto_generate_synonyms_phrase_query":true,
"boost":1.0
}
}
}
],
"adjust_pure_negative":true,
"boost":1.0
}
}
}
This request is sent for every entry (via Java Spring backend). it is not as slow as I expected it to be, but I still want to improve it. In my tests, set A
has 6M entries and set B
has ~250K. It takes up to 2 hours to both match set B
with set A
and import set B
afterwards (actually it takes 10k entries, matches them and then imports into ES, repeats).
The question is: How do I improve this, make it faster? Or maybe you can suggest another approach.