How to fetch ~12M documents(may be even more) quickly from ES using scroll API?

harshil · November 30, 2017, 7:08pm

Hello,

I have a requirement to fetch ~12M documents (it could get more than that also..). I tried using scroll API with
GET {index}/{type}/_search?scroll=1m
{
"size": 10000,
"sort": ["_doc"],
"_source": ["field_1", "field_2"],
"stored_fields": "_none",
"query": {...}
}
It takes ~170ms to fetch first page with 10,000 documents with required fields in response. So, to retrieve 12M documents with pagination(10,000 doc) would take ~5min. However, I would like to get this 12M docs as fast as possible. I am thinking to bump up the "index.max_result_window" to 100,000 may be ?? Does it even make sense to bump it up the default value? Am I completely off?? What should I do to make ES return these many (~12M) within couple of minutes or even faster. I have 20 data nodes, 3 master nodes, 20 primary side and 1 replica.

dadoonet · November 30, 2017, 7:40pm

I'm not sure that getting 100000 docs will accelerate things.

170ms to get 10000 documents does not seem that much to me. But may be I'm wrong.

Anyway, do you have x-pack monitoring? Anything interesting to see on it?

You did not mention BTW how fast is the query by itself?
Can you show it?

harshil · November 30, 2017, 8:53pm

Hi @dadoonet,

Let me correct myself. It takes ~170ms for query I see this "took": 170, in a response. I bumped up "index.max_result_window" to 20,000 and added more logs and I see
page number: 173 -> Total Hit: 20000 -> Time took in ms: 74 -> page time(network+query): 461ms for each scroll request. So total query time is 56sec for all pagination requests.

No, I do not have x-pack monitoring. My ES cluster is on AWS, let me check logs/metrics there. But I see the bottleneck here is network (serializer + deserializer + transmit time), which adds another ~4 mins for 12M docs.

so, you are right, bumping up "index.max_result_window" is not gonna help here.

harshil · November 30, 2017, 11:48pm

Hi @dadoonet,

Is there a way I can reduce the network latency, by gzipping the response or any other settings?

system · December 28, 2017, 11:49pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Get all documents from an index Elasticsearch	10	110208	June 21, 2017
How to get data more than 10000 in elasticsearch Elasticsearch	27	21613	January 17, 2018
Infinite scroll best practices with ES Elasticsearch	4	7273	July 6, 2017
Scrolling performance Elasticsearch	5	1745	July 6, 2017
Fetching large data from Elasticsearch in Node.js Elasticsearch	1	642	September 12, 2019

How to fetch ~12M documents(may be even more) quickly from ES using scroll API?

Related topics