I am looking for an efficient way to update large numbers of entries
in my index. Let's say my ElasticSearch index contains documents with
tags. Tags are expressed as arrays of strings. So one document might
have tags ['foo', 'bar'] and some other document might have tags ['bar', 'baz'].
Typically I need to add a certain tag to a number of documents (I
know their IDs) and remove this tag from all other documents. There are
about half a million documents in the index and growing, and I need to
update tens, maybe hundreds of thousands of documents.
The operation doesn't need to be atomic. There are no problems with
concurrent writing, I can be pretty sure there are no other write
operations on that index while I'm updating.
My goal is for this update to be done as quickly as possible.
Let's imagine I need to set tag 'foo' to documents 123, 456 and 555.
Should I remove tag 'foo' from all documents in the
index, and then add it to 123, 456 and 555? Or should I first get the
list of documents that need to have that tag removed, then remove tag
only from those documents and add only to documents that need it?
Are there any other ways of solving this problem?