How to improve Scroll runtime for 5 billion record retrieval?

Sheraz_Tariq · April 13, 2020, 6:18am

I want to retrieve 5 billion _ids from an index in Elasticsearch - the index itself is around 3TB and I'm using the scroll feature to do this. However, I'm getting pretty poor runtime. I'm seeing it take around 11 seconds to retrieve 100k entries (also around 0.5 seconds to retrieve 5k entries). I've tried using a smaller size, 5000 instead of 100k and sorted by _doc to get better performance, but I'm still not seeing runtime good enough to pull out 5 billion entries in anything less than 7 days of non stop running. I want to return all the _ids for all indices in my cluster, but I'm not even able to get even one index without a very long runtime. I'm not sure what else I can do to improve performance? Should I add more data or router nodes? Will that even help? Or is there something I can do with my scroll query?

Here's what my search looks like:

       result = @client.search index: indexname,
                            scroll: '1m',
                            body: {size: 5000,
                            sort: [
                                "_doc"
                              ],
                            query: {match_all: {},
                                   },
                            }

result_data = @client.scroll body: {scroll_id: scroll_id, scroll: '1m'}

I'm using the scroll_id from the previous scroll as well?

Christian_Dahlqvist · April 13, 2020, 7:43am

What does the latency look like if you do not sort and use the natural sorting order?

Why do you need to do this in the first place?

Sheraz_Tariq · April 13, 2020, 7:51am

This is for a project to get metrics on the number of ids that make it into ES from a database that we use. Without sort, I get 13 seconds to get 100k entries.

system · May 11, 2020, 7:52am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Retrieving over a million records in Elasticsearch Elasticsearch	10	28418	July 5, 2017
Scan and scroll performance with IDs query Elasticsearch	6	3444	July 5, 2017
Scrolling performance Elasticsearch	19	6356	July 6, 2017
How to fetch ~12M documents(may be even more) quickly from ES using scroll API? Elasticsearch	4	851	December 28, 2017
Slow scroll after big query Elasticsearch	1	567	April 25, 2017

How to improve Scroll runtime for 5 billion record retrieval?

Related topics