Reindex, creating multiple destination documents from each source document

mconner · June 6, 2019, 4:52pm

I want to build a new index from an existing, but split each source document into a set of destination documents. Is it possible to do this with reindex, or is there some other means which it can be done server-side?

On a whim, I tried doing this in script: ctx._source = [[:],[:]]; ... and then set properties
but unsurprisingly, it errors out with java.util.ArrayList cannot be cast to java.util.Map".

spinscale · June 7, 2019, 12:42pm

Hey,

take this as an example, where you can just have a script that uses different logic to create an _id in the destination index.

DELETE test,output

PUT test/_doc/my_doc
{
  "first" : "first",
  "second" : "second",
  "third" : "third"
}

POST _reindex
{
  "source" : { "index":  "test"},
  "dest" : { "index" : "output" },
  "script" : {
    "lang": "painless",
    "source": """
def run = 3;
if (run == 1) {
  ctx._source.key = ctx._source.first;
  ctx._id = ctx._id + "_1";
} else if (run == 2) {
  ctx._source.key = ctx._source.second;
  ctx._id = ctx._id + "_2";
} else {
  ctx._source.key = ctx._source.third;
  ctx._id = ctx._id + "_3";
}
ctx._source.remove("first");
ctx._source.remove("second");
ctx._source.remove("third");
    """
  }
}

GET output/_search

This also means you have to run the reindex API n times (the number of different changes) and not once.

--Alex

mconner · June 7, 2019, 1:06pm

Sorry, I guess I wasn't clear. The number of documents depends on the data in the source document. We have some data in an array of arbitrary length. I want to transform it to a new index, with each element of the array being a new row (and containing some other data from the document). I guess I could find the max n, run it for each, and then filter those from the reindex who's length is less than that.

But is there no way to directly transform a document into 0 or more docs with a script in a single pass?

spinscale · June 7, 2019, 2:09pm

indeed, there is no way, reindex is a one to one action basically. Having a small python script that is doing a scroll search on the one hand and a bulk index on the other sounds like the way to go here from my perspective.

mconner · June 7, 2019, 2:39pm

Re: scroll search: Yes, that's the road I was starting to going down. However, it now goes from an in-server process which (as I understand it) gets a snapshot of the data, to a client process, with all that additional communications overhead, and presumably not a snapshot. I think It would be useful if elasticsearch could provide reindex-like semantics, but allow for 0 or more documents created (ctx._docs, instead of ctx._source, ctx._id, or maybe ctxs?)

spinscale · June 10, 2019, 1:12pm

a scroll search is a point in time snapshot, from the moment you start it. Changes done after that will not be taken into account.

system · July 8, 2019, 1:12pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to reindex , with multiple _id in index Elasticsearch	15	2536	February 7, 2018
Reindex multiple indices from remote with the same name Elasticsearch	5	6022	April 8, 2019
Script not works correct in reindex operation when want to copy particular data from source Elasticsearch	1	341	March 13, 2019
Dec 15th, 2017: [EN][Elasticsearch] Going from multiple types to one type with the Reindex API Advent Calendar	2	3580	October 24, 2018
Cannot set _id from _source during reindex Elasticsearch painless , reindex	2	599	August 3, 2022

Reindex, creating multiple destination documents from each source document

Related topics