Everyday I create an index with a size of approximately 10 GB of size. Once the index is completed, I need to reindex a certain set of documents into a new filtered index. To filter the documents I'm using a Terms query with the terms that I want to allow in the new index. I'm currently using around one million terms in my reindex query. From the documentation I read that the maximum number of terms defaults to 65,536, which is considerably lower than the number of terms that I need to use.
As a solution, I'm using the Python API and reindexing the document in "batches". What I'm doing is divide the list of one million terms into batches of 50,000 terms and reindexing my index. To control the number of simultaneous reindexing processes I set the
wait_for_completion parameter to True. My script is correctly (as far as I've seen) reindexing the data. However, it is taking too long in the reindexing process. I've read comments of considerably larger indexes being reindexed in a day. Mine has taken about 6 hours in just 13 of the 35 batches.
I think this is because I'm running 35 reindexing processes one after the other. I have two main questions:
Is it possible to approach to this problem in a different manner? Can I reindex directly from Elasticsearch with that amount of terms? Any ideas are very welcomed.
In the case that my current solution is "the only" way to proceed, could you give me some insights on the size of the batches? Is it more efficient to have a lot of batches with not many terms o just a few batches with a lot of terms?
Many thanks in advance,