Best approach to update huge # of documents (millions) single query

I'm using elastic for storing documents with multiple attributes that can go on and off (like active or inactive).
They can go up to 1-2 million per index.

On a daily basis I need to synchronize those documents who can become OFF (or back ON, or insert new ones) with some data feeds.

Based on data feeds I know what IDs have status="on". Nothing else.
My approach now is to preprocess and make a difference between all docs that had status="on" before synchronization, and after synch is finished.

Which gets me a list of ID's that will need to be updated with "off"
Compared to total number those is usually 0.5-1%

I run 2 queries: A. update_by_query where I set status = "off" for docs in this list of ID's
and B. one where I set status ="on" for docs not in the list of deleted ID's

{
  "query": {
    "bool": {
      "must_not": {
        "terms": {
          "_id": ["id1", "id2", "id3", ... , "idN"]
        }
      }
    }
  }
}

From time to time this list gets to a few hundred thousands.

My challenges:

  • when this list gets really huge (once a month it happens), I can not make the query (it's MB in size)
  • I can't split it in batches because I wouldn't know what to update

My question is how else could I do it?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.