I have 2 sets of indexes, indexes A_* and indexes B_*. I need to create a parent-child structure in B_*. B_* already contains the parent documents, and A_* contains the child documents. So, essentially, I need to copy the child documents from A_* into B_* with some logic in the middle that matches child documents to parent documents based on matching on several fields that serve as a unique key.
A_* contains about 40 indexes with document counts ranging between 100-250 million. Each index is between 100-500 GB. B_* contains 16 indexes with 15 million documents each and of size 20 GB each.
I have tried to do this via a python script, with the main logic being the following:
doc_chunk = helpers.scan(self.es, index=some_index_from_A, size=4000, scroll='5m')
actions = self.doc_iterator(doc_chunk)
deque(helpers.parallel_bulk(self.es, actions, chunk_size=1000, thread_count=4))
The function doc_iterator
scrolls through the iterator returned by helpers.scan
and, based on values of certain fields in a given child document, determines the id of that document's parent. For each document, it yields indexing actions that index the child documents under the appropriate parent in B_*.
I've tried several different approaches to create this parent-child index, but nothing seems to work:
- Running the script in parallel using xargs results in BulkIndexingErrors and leads to, at most, only 1/3 of the corpus being indexed. If this worked, it would be the ideal approach as it would cut down this whole process to 2-4 days.
- Running the python script in 1 process doesn't result in BulkIndexingErrors, but it only indexes about 22-28 million documents, at which point a read timeout occurs, and the whole process just hangs indefinitely. This is the less ideal approach as in the best case it would take 7-8 days to finish. During one of my attempts to run it this way, I was monitoring the cluster in Kibana and noticed that searches had spiked to 30,000 documents/second, after which they immediately plummeted to 0 and never picked up afterwards. Indexing tapered off at that point.
- I have tried different values for scan size, chunk size, and thread count. I get the fastest performance for 1 process with scan size of 6000, chunk size of 1000, and thread count of 6, but I also noticed the aforementioned read spike with this setting, so it seems like I may be reading too much. Taking it down to a scan size of 4000 still resulted in the read timeouts (I was unable to monitor the search rate at that setting).
Some more details:
- ES version: 5.2.1
- Nodes: 6
- Primary shards: 956
- Replicas: 76
- I am currently required to run the script from a different server from the one where ES is running.
Any tips that could help with finishing the parent-child index in as few days as possible would be greatly appreciated. Thank you in advance!