I'm using elastic for storing documents with multiple attributes that can go on and off (like active or inactive).
They can go up to 1-2 million per index.
On a daily basis I need to synchronize those documents who can become OFF (or back ON, or insert new ones) with some data feeds.
Based on data feeds I know what IDs have status="on". Nothing else.
My approach now is to preprocess and make a difference between all docs that had status="on" before synchronization, and after synch is finished.
Which gets me a list of ID's that will need to be updated with "off"
Compared to total number those is usually 0.5-1%
I run 2 queries: A. update_by_query where I set status = "off" for docs in this list of ID's
and B. one where I set status ="on" for docs not in the list of deleted ID's
{
"query": {
"bool": {
"must_not": {
"terms": {
"_id": ["id1", "id2", "id3", ... , "idN"]
}
}
}
}
}
From time to time this list gets to a few hundred thousands.
My challenges:
- when this list gets really huge (once a month it happens), I can not make the query (it's MB in size)
- I can't split it in batches because I wouldn't know what to update
My question is how else could I do it?