Strategy for keeping multiple indices in the same cluster synchronized

mbklein · July 21, 2022, 5:56pm

I have four indices in the same cluster (running ES v7.17, in case that matters) with different mappings and settings. For reasons that will soon become clear, we can call them v1 and v2-a, v2-b, and v2-c.

v1 contains three different categories of documents, a, b, and c. It is updated from our application using the bulk index API. Each document includes an indexed_at date field with the timestamp of when the document was last indexed.

For v2, we split the three categories into three different indices. They are updated using the Reindex API first by getting the most recent index time from the target index:

POST /v2-a/_search
{
  "_source": ["indexed_at"],
  "query": {
    "match_all": {}
  },
  "size": 1,
  "sort": [
    {
      "indexed_at": {
        "order": "desc"
      }
    }
  ]
}

and then using that in a reindex-by-query:

POST _reindex
{
  "source": {
    "index": "v1",
    "query": {
      "bool": {
        "must": [
          {
            "term": {
              "category": {
                "value": "a"
              }
            }
          },
          {
            "range": {
              "indexed_at": {
                "gt": "[most recent indexed_at timestamp in v2-a from above query]"
              }
            }
          }
        ]
      }
    }
  },
  "dest": {
    "index": "v2-a",
    "pipeline": "v1-to-v2-a"
  }
}

This part is working great. The problem is deletes. I can't figure out a reasonably efficient way to delete documents from v2-a/b/c that no longer exist in v1. I've tried using an aggregation to find all IDs that only exist in one index and then use _delete_by_query to delete those IDs from the corresponding v2 index. This was working for small-scale tests, but starts failing (by deleting far too many documents) once we get past about 50,000 documents.

The only thing I can think to try is to change our v1 indexing strategy so that deleted documents aren't actually deleted, but replaced with a { "category": "DELETED", "indexed_at": [deletion time] } document, which can then be handled in the reindex pipelines using a drop processor. But that seems a bit messy. Is there a better way? I just realized this won't work, because the drop processor won't actually delete the document; it'll just drop it from the reindexing process.

warkolm · July 21, 2022, 10:44pm

Welcome to our community!

It would work if you reindexed into a new index, and then used aliases to manage how your application talks to the data.

eg reindex to new index, point alias to new index, delete old index.

system · August 18, 2022, 10:45pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Reindexing multiple indexes into a single one Elasticsearch	3	1187	May 8, 2019
Is there a way to merge two indices Elasticsearch reindex	2	8216	November 3, 2022
Identify and delete duplicates on several indexes Elasticsearch	1	1935	January 9, 2018
Elasticsearch delete or update a document should reflect in two indexes from server side Elasticsearch	3	574	July 5, 2017
How to delete documents by term and timestamp range in elasticsearch 1.5.2? Elasticsearch	5	16903	July 5, 2017

Strategy for keeping multiple indices in the same cluster synchronized

Related topics