Fetch 1 million records

Hi !

I'm trying to fetch +1 Million records from elastic search.
So far all approaches are still slow.

At the moment I'm using a sliced scroll search.

I just need a few fields from the documents, not the whole document. Is there a way I can optimize this ?

Thanks!

I think this describes what you can try, but read the warnings about text fields.

This looks like a similar question

Let us know how it works out :slight_smile:

You can use source filtering to include or exclude fields from the result.

For instance, let's say I have an index my_published_stories and want to loop over all but just fetch the publishing date (pubdate) and processing time, then I'd do something like this:

GET my_published_stories/_search
{
  "_source": ["pubdate", "processing_time"]
}

Hope this answered your question.

Hi @Bernt_Rostad

Yep, I'm already doing that :smile: Thanks!

Thanks for that @rugenl !

I'm going to give the doc values a go! I'll post the results when I have some! :smile:

hi @rugenl!

I've started using Sliced Scroll using doc_values and no major improvements.. Any other ideas ?

Thanks!

Ok, what language are you using?

I put some timing in my scripts to measure what time I was waiting on the scroll vs. when I was processing the data. The waiting time includes network time, but it let me know whether to improve my search or my process.

Powershell was MUCH slower than the Python Elasticsearch DSL, like orders of magnitude slower.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.