I have four indices in the same cluster (running ES v7.17, in case that matters) with different mappings and settings. For reasons that will soon become clear, we can call them v1
and v2-a
, v2-b
, and v2-c
.
v1
contains three different categories of documents, a
, b
, and c
. It is updated from our application using the bulk index API. Each document includes an indexed_at
date field with the timestamp of when the document was last indexed.
For v2
, we split the three categories into three different indices. They are updated using the Reindex API first by getting the most recent index time from the target index:
POST /v2-a/_search
{
"_source": ["indexed_at"],
"query": {
"match_all": {}
},
"size": 1,
"sort": [
{
"indexed_at": {
"order": "desc"
}
}
]
}
and then using that in a reindex-by-query:
POST _reindex
{
"source": {
"index": "v1",
"query": {
"bool": {
"must": [
{
"term": {
"category": {
"value": "a"
}
}
},
{
"range": {
"indexed_at": {
"gt": "[most recent indexed_at timestamp in v2-a from above query]"
}
}
}
]
}
}
},
"dest": {
"index": "v2-a",
"pipeline": "v1-to-v2-a"
}
}
This part is working great. The problem is deletes. I can't figure out a reasonably efficient way to delete documents from v2-a/b/c
that no longer exist in v1
. I've tried using an aggregation to find all IDs that only exist in one index and then use _delete_by_query
to delete those IDs from the corresponding v2 index. This was working for small-scale tests, but starts failing (by deleting far too many documents) once we get past about 50,000 documents.
The only thing I can think to try is to change our I just realized this won't work, because the v1
indexing strategy so that deleted documents aren't actually deleted, but replaced with a { "category": "DELETED", "indexed_at": [deletion time] }
document, which can then be handled in the reindex pipelines using a drop
processor. But that seems a bit messy. Is there a better way?drop
processor won't actually delete the document; it'll just drop it from the reindexing process.