I want to retrieve 5 billion _id
s from an index in Elasticsearch - the index itself is around 3TB and I'm using the scroll feature to do this. However, I'm getting pretty poor runtime. I'm seeing it take around 11 seconds to retrieve 100k entries (also around 0.5 seconds to retrieve 5k entries). I've tried using a smaller size, 5000 instead of 100k and sorted by _doc
to get better performance, but I'm still not seeing runtime good enough to pull out 5 billion entries in anything less than 7 days of non stop running. I want to return all the _ids
for all indices in my cluster, but I'm not even able to get even one index without a very long runtime. I'm not sure what else I can do to improve performance? Should I add more data or router nodes? Will that even help? Or is there something I can do with my scroll query?
Here's what my search looks like:
result = @client.search index: indexname,
scroll: '1m',
body: {size: 5000,
sort: [
"_doc"
],
query: {match_all: {},
},
}
result_data = @client.scroll body: {scroll_id: scroll_id, scroll: '1m'}
I'm using the scroll_id from the previous scroll as well?