Handle big result set?

Hello Elastic experts,

Suppose a query matches a large volumes of records (saying a few million records), what is the best way to handle (I want to store the results on local disk) the results? Is there a way to streaming big result set as I have the concern the local box memory may not be able to hold all result set?

thanks in advance,
Lin

Deep pagination can be achieved using the Scan-Scroll feature: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan

Hi Colin,

Good sharing. Looked through the document you referred and find sorting may have cost. My use case is, I just need to find top N results, and do not care the order of results in top N. Wondering in my case, what is the most efficient way to write the query?

regards,
Lin

What do you mean by this? to define the top N of something there has to be some kind of sorting. How are you defining the top N? do you want the top N scoring documents?

Yes, Colin, yes, I need top N scored documents, you are correct. I mean I do not need to strict ascending/descending order sort inside the top N documents, as long as top N documents are returned. Any efficient way to implement? Thanks.

regards,
Lin

Then the scan-scroll feature is what you want, Just don't set any explicit sorting in the scan request

Thanks Colin,

I read the document (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan), for statements, "Deep pagination with from and size — e.g. ?size=10&from=10000 — is very inefficient as (in this example) 100,000 sorted results have to be retrieved from each shard and resorted in order to return just 10 results.", I am confused, should it be 10,000 sorted results? Other than 100,000? Which maps to from =10000 parameter?

Please feel free to correct me if I am wrong.

BTW, another quick question is, if I want to use scroll only without scan (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html), could I combine sorting with scroll? And why using scroll is more efficient than ordinary queries?