Reindexing 20TB document tips

zatom · June 29, 2019, 2:54pm

Hi,

Everytime I am trying to reindex this index with 20million documents(20TB) with reindex api, it stops in the half way. I even tried using sliced scroll(slicing) method to break it into multiple jobs and parallelized the reindexing, but stops after sometime. Sometimes, either the node goes down or the cluster's overall health goes bad while doing this. Is there another efficient way of reindexing such huge documents without knocking my cluster?

Thank you

Christian_Dahlqvist · June 29, 2019, 4:27pm

What does your reindx job look like?

zatom · June 29, 2019, 5:53pm

I am doing through kibana console. Following query worked until halfway through and knocked down of the node and stopped completely. My query looks like this:

POST _reindex?slices=20&timeout=60m&scroll=60m
{

"source":{

"index": "source_index"
},

"dest": {
"index": "destination_index",

}
}

Christian_Dahlqvist · June 29, 2019, 5:54pm

Is this a single index? How many shards does it have?

zatom · June 29, 2019, 5:55pm

I had 20 shards. Yes, its going to single index

Christian_Dahlqvist · June 29, 2019, 5:56pm

So 1TB per shard??? What is the specification of the cluster?

Christian_Dahlqvist · June 29, 2019, 5:58pm

What changes are you making for the destination index?

zatom · June 29, 2019, 5:59pm

yes. Is more shards the better? I saw with 20 shards, it was faster while it lasted. Later I tried on less shards, it was slower and stopped eventually

zatom · June 29, 2019, 6:07pm

@Christian_Dahlqvist
I have another index with similar number of documents but with only 5 TB size. I am trying to reindex that also. I tried that with 20 shards. That failed too

Christian_Dahlqvist · June 29, 2019, 6:08pm

I have never dealt with shards that large so will have to leave it to someone else I am afraid.

iukea · June 29, 2019, 9:10pm

hold on @zatom

are you reindexing 20 TB of data on one node?

Do you have a master node and a couple of data nodes?

iukea · June 29, 2019, 9:21pm

I mean if you are doing it on one node I got mad respect for you.

To better increase reindex speed (This will cause the server or servers to be maxed out at 100% utilization for a bit)

I would do

POST _reindex
{
  "size": 10000,
  "source": {
    "index": "oldBigAssIndex"
  },
  "dest": {
    "index": "newBigassIndex"
  }
}

By default, if the size is not set it only does 1,000 documents at a time
the max size you can do is 10,000 without editing some properties.

Reference
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html

DavidTurner · June 30, 2019, 5:53am

I'm not aware of any regular tests of creating such large shards, so you're a little off the beaten track here. I'd like to know more about how it is failing. Why is the node going down? Does it log anything about its failure?

What are the mapping and settings for this index?

zatom · July 1, 2019, 2:17pm

I have 1 master node, 3 data nodes

zatom · July 1, 2019, 2:21pm

putting size outside the source will give me only that number of documents. I think you mean putting size inside the source. I will try that too

system · July 29, 2019, 2:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Reindex API performance Elasticsearch	3	4494	July 5, 2017
I need more insights into how reindexing with parallel slicing works like about how it allocates slices Elasticsearch	1	838	August 5, 2019
Improving performance of reindex API? Elasticsearch	7	12146	July 5, 2017
Reindex 1 index to multiple indexes Elasticsearch	8	554	June 15, 2023
Improve reindex speed into new cluster Elasticsearch	4	1090	January 5, 2019

Reindexing 20TB document tips

Related topics