How to improve Scroll runtime for 5 billion record retrieval?

I want to retrieve 5 billion _ids from an index in Elasticsearch - the index itself is around 3TB and I'm using the scroll feature to do this. However, I'm getting pretty poor runtime. I'm seeing it take around 11 seconds to retrieve 100k entries (also around 0.5 seconds to retrieve 5k entries). I've tried using a smaller size, 5000 instead of 100k and sorted by _doc to get better performance, but I'm still not seeing runtime good enough to pull out 5 billion entries in anything less than 7 days of non stop running. I want to return all the _ids for all indices in my cluster, but I'm not even able to get even one index without a very long runtime. I'm not sure what else I can do to improve performance? Should I add more data or router nodes? Will that even help? Or is there something I can do with my scroll query?

Here's what my search looks like:

       result = @client.search index: indexname,
                            scroll: '1m',
                            body: {size: 5000,
                            sort: [
                                "_doc"
                              ],
                            query: {match_all: {},
                                   },
                            }
result_data = @client.scroll body: {scroll_id: scroll_id, scroll: '1m'}

I'm using the scroll_id from the previous scroll as well?

What does the latency look like if you do not sort and use the natural sorting order?

Why do you need to do this in the first place?

This is for a project to get metrics on the number of ids that make it into ES from a database that we use. Without sort, I get 13 seconds to get 100k entries.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.