I have a requirement to fetch ~12M documents (it could get more than that also..). I tried using scroll API with
GET {index}/{type}/_search?scroll=1m
{
"size": 10000,
"sort": ["_doc"],
"_source": ["field_1", "field_2"],
"stored_fields": "_none",
"query": {...}
}
It takes ~170ms to fetch first page with 10,000 documents with required fields in response. So, to retrieve 12M documents with pagination(10,000 doc) would take ~5min. However, I would like to get this 12M docs as fast as possible. I am thinking to bump up the "index.max_result_window" to 100,000 may be ?? Does it even make sense to bump it up the default value? Am I completely off?? What should I do to make ES return these many (~12M) within couple of minutes or even faster. I have 20 data nodes, 3 master nodes, 20 primary side and 1 replica.
Let me correct myself. It takes ~170ms for query I see this "took": 170, in a response. I bumped up "index.max_result_window" to 20,000 and added more logs and I see
page number: 173 -> Total Hit: 20000 -> Time took in ms: 74 -> page time(network+query): 461ms for each scroll request. So total query time is 56sec for all pagination requests.
No, I do not have x-pack monitoring. My ES cluster is on AWS, let me check logs/metrics there. But I see the bottleneck here is network (serializer + deserializer + transmit time), which adds another ~4 mins for 12M docs.
so, you are right, bumping up "index.max_result_window" is not gonna help here.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.