Strategy for keeping multiple indices in the same cluster synchronized

I have four indices in the same cluster (running ES v7.17, in case that matters) with different mappings and settings. For reasons that will soon become clear, we can call them v1 and v2-a, v2-b, and v2-c.

v1 contains three different categories of documents, a, b, and c. It is updated from our application using the bulk index API. Each document includes an indexed_at date field with the timestamp of when the document was last indexed.

For v2, we split the three categories into three different indices. They are updated using the Reindex API first by getting the most recent index time from the target index:

POST /v2-a/_search
{
  "_source": ["indexed_at"],
  "query": {
    "match_all": {}
  },
  "size": 1,
  "sort": [
    {
      "indexed_at": {
        "order": "desc"
      }
    }
  ]
}

and then using that in a reindex-by-query:

POST _reindex
{
  "source": {
    "index": "v1",
    "query": {
      "bool": {
        "must": [
          {
            "term": {
              "category": {
                "value": "a"
              }
            }
          },
          {
            "range": {
              "indexed_at": {
                "gt": "[most recent indexed_at timestamp in v2-a from above query]"
              }
            }
          }
        ]
      }
    }
  },
  "dest": {
    "index": "v2-a",
    "pipeline": "v1-to-v2-a"
  }
}

This part is working great. The problem is deletes. I can't figure out a reasonably efficient way to delete documents from v2-a/b/c that no longer exist in v1. I've tried using an aggregation to find all IDs that only exist in one index and then use _delete_by_query to delete those IDs from the corresponding v2 index. This was working for small-scale tests, but starts failing (by deleting far too many documents) once we get past about 50,000 documents.

The only thing I can think to try is to change our v1 indexing strategy so that deleted documents aren't actually deleted, but replaced with a { "category": "DELETED", "indexed_at": [deletion time] } document, which can then be handled in the reindex pipelines using a drop processor. But that seems a bit messy. Is there a better way? I just realized this won't work, because the drop processor won't actually delete the document; it'll just drop it from the reindexing process.

Welcome to our community! :smiley:

It would work if you reindexed into a new index, and then used aliases to manage how your application talks to the data.

eg reindex to new index, point alias to new index, delete old index.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.