Copying huge index to create parent-child structure

I have 2 sets of indexes, indexes A_* and indexes B_*. I need to create a parent-child structure in B_*. B_* already contains the parent documents, and A_* contains the child documents. So, essentially, I need to copy the child documents from A_* into B_* with some logic in the middle that matches child documents to parent documents based on matching on several fields that serve as a unique key.

A_* contains about 40 indexes with document counts ranging between 100-250 million. Each index is between 100-500 GB. B_* contains 16 indexes with 15 million documents each and of size 20 GB each.

I have tried to do this via a python script, with the main logic being the following:

doc_chunk = helpers.scan(self.es, index=some_index_from_A, size=4000, scroll='5m')
actions = self.doc_iterator(doc_chunk)
deque(helpers.parallel_bulk(self.es, actions, chunk_size=1000, thread_count=4))

The function doc_iterator scrolls through the iterator returned by helpers.scan and, based on values of certain fields in a given child document, determines the id of that document's parent. For each document, it yields indexing actions that index the child documents under the appropriate parent in B_*.

I've tried several different approaches to create this parent-child index, but nothing seems to work:

  • Running the script in parallel using xargs results in BulkIndexingErrors and leads to, at most, only 1/3 of the corpus being indexed. If this worked, it would be the ideal approach as it would cut down this whole process to 2-4 days.
  • Running the python script in 1 process doesn't result in BulkIndexingErrors, but it only indexes about 22-28 million documents, at which point a read timeout occurs, and the whole process just hangs indefinitely. This is the less ideal approach as in the best case it would take 7-8 days to finish. During one of my attempts to run it this way, I was monitoring the cluster in Kibana and noticed that searches had spiked to 30,000 documents/second, after which they immediately plummeted to 0 and never picked up afterwards. Indexing tapered off at that point.
  • I have tried different values for scan size, chunk size, and thread count. I get the fastest performance for 1 process with scan size of 6000, chunk size of 1000, and thread count of 6, but I also noticed the aforementioned read spike with this setting, so it seems like I may be reading too much. Taking it down to a scan size of 4000 still resulted in the read timeouts (I was unable to monitor the search rate at that setting).

Some more details:

  • ES version: 5.2.1
  • Nodes: 6
  • Primary shards: 956
  • Replicas: 76
  • I am currently required to run the script from a different server from the one where ES is running.

Any tips that could help with finishing the parent-child index in as few days as possible would be greatly appreciated. Thank you in advance!

I don't understand what you mean by Replicas: 76? Only 76 of your primary shards have a single replica, and none of the other primaries? Or each primary shard has 76 replicas?

Either way, it sounds like you have a lot of shards for the number of nodes you have and I would highly recommend reducing this ratio.

Things are timing out while you're reindexing. If you look in your logs I bet you'll see lots of long GC messages.

First, I'd recommend upgrading - 5.4 has a number of improvements which reduces the amount of GC needed. Then I'd also break your request down into smaller chunks and use filters to index (eg) Ids 1..1000000, then 1000001..200000, etc.

Not least of all, if something goes wrong, you can just restart from the latest chunk, instead of starting from the beginning again

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.