Hi,
I have a case where i need to fetch (very) large number of documents.
Example:
i have a list of 15000 entity id's that i need to export data for.
My docs have entity_id field.
What i do so far, is partitioning this input list of Id's and then for every partition, use terms and time range query to fetch the data using scroll. My partition size is 100.
What you describe sounds like a reasonable way to do this.
Does this mean you're only retrieving 100 documents in each batch? I would expect a larger batch size to be faster if so. You'll need to experiment to find the best value for your system.
Also if you're only retrieving 100 documents each time then you don't need to scroll.
No,
It means that in one batch i am fetching all the data for 100 entities (out of 15000) which can be hundreds of thousands document.
So I do need scroll.
In SQL world this would be for each partition : select * from data where entity_id in (1,2,3,...100) and timestamp between xxx and yyy
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.