How to fetch ~12M documents(may be even more) quickly from ES using scroll API?


I have a requirement to fetch ~12M documents (it could get more than that also..). I tried using scroll API with
GET {index}/{type}/_search?scroll=1m
"size": 10000,
"sort": ["_doc"],
"_source": ["field_1", "field_2"],
"stored_fields": "_none",
"query": {...}
It takes ~170ms to fetch first page with 10,000 documents with required fields in response. So, to retrieve 12M documents with pagination(10,000 doc) would take ~5min. However, I would like to get this 12M docs as fast as possible. I am thinking to bump up the "index.max_result_window" to 100,000 may be ?? Does it even make sense to bump it up the default value? Am I completely off?? What should I do to make ES return these many (~12M) within couple of minutes or even faster. I have 20 data nodes, 3 master nodes, 20 primary side and 1 replica.

I'm not sure that getting 100000 docs will accelerate things.

170ms to get 10000 documents does not seem that much to me. But may be I'm wrong.

Anyway, do you have x-pack monitoring? Anything interesting to see on it?

You did not mention BTW how fast is the query by itself?
Can you show it?

Hi @dadoonet,

Let me correct myself. It takes ~170ms for query I see this "took": 170, in a response. I bumped up "index.max_result_window" to 20,000 and added more logs and I see
page number: 173 -> Total Hit: 20000 -> Time took in ms: 74 -> page time(network+query): 461ms for each scroll request. So total query time is 56sec for all pagination requests.

No, I do not have x-pack monitoring. My ES cluster is on AWS, let me check logs/metrics there. But I see the bottleneck here is network (serializer + deserializer + transmit time), which adds another ~4 mins for 12M docs.

so, you are right, bumping up "index.max_result_window" is not gonna help here.

Hi @dadoonet,

Is there a way I can reduce the network latency, by gzipping the response or any other settings?

