Reindex, creating multiple destination documents from each source document

I want to build a new index from an existing, but split each source document into a set of destination documents. Is it possible to do this with reindex, or is there some other means which it can be done server-side?

On a whim, I tried doing this in script: ctx._source = [[:],[:]]; ... and then set properties
but unsurprisingly, it errors out with java.util.ArrayList cannot be cast to java.util.Map".

Hey,

take this as an example, where you can just have a script that uses different logic to create an _id in the destination index.

DELETE test,output

PUT test/_doc/my_doc
{
  "first" : "first",
  "second" : "second",
  "third" : "third"
}

POST _reindex
{
  "source" : { "index":  "test"},
  "dest" : { "index" : "output" },
  "script" : {
    "lang": "painless",
    "source": """
def run = 3;
if (run == 1) {
  ctx._source.key = ctx._source.first;
  ctx._id = ctx._id + "_1";
} else if (run == 2) {
  ctx._source.key = ctx._source.second;
  ctx._id = ctx._id + "_2";
} else {
  ctx._source.key = ctx._source.third;
  ctx._id = ctx._id + "_3";
}
ctx._source.remove("first");
ctx._source.remove("second");
ctx._source.remove("third");
    """
  }
}

GET output/_search

This also means you have to run the reindex API n times (the number of different changes) and not once.

--Alex

Sorry, I guess I wasn't clear. The number of documents depends on the data in the source document. We have some data in an array of arbitrary length. I want to transform it to a new index, with each element of the array being a new row (and containing some other data from the document). I guess I could find the max n, run it for each, and then filter those from the reindex who's length is less than that.

But is there no way to directly transform a document into 0 or more docs with a script in a single pass?

indeed, there is no way, reindex is a one to one action basically. Having a small python script that is doing a scroll search on the one hand and a bulk index on the other sounds like the way to go here from my perspective.

Re: scroll search: Yes, that's the road I was starting to going down. However, it now goes from an in-server process which (as I understand it) gets a snapshot of the data, to a client process, with all that additional communications overhead, and presumably not a snapshot. I think It would be useful if elasticsearch could provide reindex-like semantics, but allow for 0 or more documents created (ctx._docs, instead of ctx._source, ctx._id, or maybe ctxs?)

a scroll search is a point in time snapshot, from the moment you start it. Changes done after that will not be taken into account.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.