Perform a single _reindex request that has "index_a" from the remote cluster, and another "index_a" of the current cluster, and reindexing them both to "index_b"?

I have "index_a" on a remote elasticsearch cluster that looks like this:

{
   _index: "index_a",
   _type: "_doc",
   _id: "1",
   _score: 1,
   _source: {
      customer_id: "1234",
      customer_name: "spider",
      message: "does what ever"
   }
}, 
{
   _index: "index_a",
   _type: "_doc",
   _id: "2",
   _score: 1,
   _source: {
      customer_id: "3333",
      customer_name: "pig",
      message: "spider-pid does"
   }
}

And I Also have "index_a" (yes, it's the same name!) on the current elasticsearch cluster that i'm performing the _reindex to, that looks like this:

{
   _index: "index_a",
   _type: "_doc",
   _id: "2",
   _score: 1,
   _source: {
      customer_id: "3333",
      customer_name: "pig",
      message: "spider-pid does"
   }
},
{
   _index: "index_a",
   _type: "_doc",
   _id: "3",
   _score: 1,
   _source: {
      customer_id: "9876",
      customer_name: "coronavirus",
      message: "stay safe and at home"
   }
}

as you can see there are duplications docs from the first "index_a" above, but there is also new data there that I wanna keep!

Eventually what I wanna end up with, in my current elasticsearch cluster is this index_b:

{
   _index: "index_b",
   _type: "_doc",
   _id: "1",
   _score: 1,
   _source: {
      customer_id: "1234",
      customer_name: "spider",
      message: "does what ever"
   }
}, 
{
   _index: "index_b",
   _type: "_doc",
   _id: "2",
   _score: 1,
   _source: {
      customer_id: "3333",
      customer_name: "pig",
      message: "spider-pid does"
   }
},
{
   _index: "index_b",
   _type: "_doc",
   _id: "3",
   _score: 1,
   _source: {
      customer_id: "9876",
      customer_name: "coronavirus",
      message: "stay safe and at home"
   }
}

So basically I know for a fact that I could reach this result in two different
_reindex requests, 1st _reindex will be from the remote cluster index_a to the current elasticsearch cluster index_b.
And the 2nd _reindex will be from the current elasticsearch cluster index_a to the current cluster index_b.
but running those two _reindex request is VERY wasteful in terms of big data,
cause what the request does is basically run over each doc-id one by one and wrtie/override it.

when trying to do this on a single _reindex request, I've tried this:

POST http://current_cluster/_reindex

{
  "source": {
    "remote": {
      "host": "http://remote_cluster/"
    },
    "index": ["index_a-from-remote", "index_a-of-current"] //renamed them to be more understood for you
  },
  "dest": {
    "index": "index_b"
  }
}

and the response indicates that there is no "index_a-of-current" in the remote cluster and it makes sense: it has happened because this type of _reindex request is built to only get indices from a remote elasticsearch cluster.

so my question is:

is there a way to perform a single _reindex request that would take both "index_a" from the remote cluster, and also "index_a" of the current cluster, and will reindex them both to "index_b" at the current cluster?

I would be happy if anyone cloud shed any light on this matter as I've tried a bunch of other stuff in the request and read the Reindex API documentation and didn't found an answer yet.
tnx for any help!

If you need to reindex everything in one index, that is the way to go (the main question is, if that is needed or you can query from two indices as well). If you have to query, you could use the script functionality in the reindex api to not index documents that have been indexed already (it depends on your setup if those get overwritten anyway, see the docs at https://www.elastic.co/guide/en/elasticsearch/reference/7.6/docs-reindex.html)

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.