I'm trying to fetch +1 Million records from elastic search.
So far all approaches are still slow.

At the moment I'm using a sliced scroll search.

I just need a few fields from the documents, not the whole document. Is there a way I can optimize this ?


I think this describes what you can try, but read the warnings about text fields.

You can use source filtering to include or exclude fields from the result.

For instance, let's say I have an index my_published_stories and want to loop over all but just fetch the publishing date (pubdate) and processing time, then I'd do something like this:

GET my_published_stories/_search
  "_source": ["pubdate", "processing_time"]

hi @rugenl!

I've started using Sliced Scroll using doc_values and no major improvements.. Any other ideas ?


Ok, what language are you using?

I put some timing in my scripts to measure what time I was waiting on the scroll vs. when I was processing the data. The waiting time includes network time, but it let me know whether to improve my search or my process.

Powershell was MUCH slower than the Python Elasticsearch DSL, like orders of magnitude slower.

