Reindexing with zero downtime - update document

Hello,

I am looking for solution to do zero downtime reindexing with all read, write, delete and update without any interruption/data loss.

In our case, we are doing data encryption using encryption keys in index data. So we usually do key rotation every 6 months.

Say we create index1 with different aliases initially. So read, write and delete operations will use different aliases in application.

index1

  • read_alias
  • write_alias
  • delete_alias.

So when we want to reindex, we will create index2 for doing reindex from index1 to index2 (new encryption key) to do the key rotation.
Also we will create index3 for writing new documents (with new encryption key)

Before starting reindex the below will be the state of aliases of each index

1. index1 (original index)

  • read_alias
  • delete_alias

2. index2 (dest index for reindex)

  • delete_alias

3. index3 (new write index)

  • read_alias
  • write_alias
  • delete_alias (pointed to new index for new writes)

This is we have designed to achieve zero downtime reads, deletes, writes earlier for other application, that doesn't have update operation. right now we need updates as well in the application.

After reindex we switched aliases like

1. index1 (original index)

  • all aliases will be removed

2. index2 (dest index for reindex)

  • read_alias (added)
  • delete_alias

3. index3 (new write index)

  • read_alias
  • write_alias
  • delete_alias

So we want to handle update as well on the existing index (index1) and index2 as well with aliases.

Limitations we have faced:

  • When aliases pointed to multiple indices we can't do write operation by alias
  • Get by id, delete by id can't be done with alias when pointed to multiple indices (that we are thinking to use search and delete_by_query with _id field)

I have searched outside for how to do reindex from index1 to index2 with handling updates on both index1 and index2 while reindex is happening to avoid the data loss.

I couldn't find any clean solutions for update existing documents. All solutions are like

  • Block updates on index1 and reindexing to index2
  • Or Write to each index by checking each index and update the document by indexName

Any help/pointers would be appreciated.

I do not understand the point of these aliases. Can you please elaborate?

It sounds to me like you are updating data within the documents that is encrypted. If you had a field in the document specifying the encryption key version used would you not be able to avoid messing with all the indices and just update data in place? That would resolve the problem, wouldn't it?

We have used different aliases for different operations

  • read_alias will be used for _search
  • delete_alias will be used for delete operations (delete_by_query)
  • write_alias for create new documents

We are maintaining/mapping encryption key based on index name. We are not maintaining any specific field. Each field value will be encrypted using the encryption key which is mapped to specific index. When we wanted to rotate the key for existing data, we will create another index and map a new key and re-encrypt using new key.

If you changed to select encryption key based on field in the data instead of the index I suspect the issue would go away and it would likely be easier to manage. If you need to transition between different indices and be able to handle deletes and updates without data loss I do not see any way of doing this that does not require downtime.

Maybe others have some ideas or suggestions.

Our product is fully dependent on the index name related keys to manage single key per index for security reasons. So looking for a way to achieve updates while reindexing.

Any ideas?

If you do not get any suggestions it is often a sign that it is not possible or possibly require very specialised knowledge. The only way I can see this possibly work (not sure even this is 100% possible) is if you put in an application layer in front of Elasticsearch that keeps track of where data resides and ensures consistency across the indices. I have worked with Elasticsearch a long time and do not see any way of handling this within Elasticsearch itself. Not every problem has a nice, neat and simple solution.