Handle big result set?


#1

Hello Elastic experts,

Suppose a query matches a large volumes of records (saying a few million records), what is the best way to handle (I want to store the results on local disk) the results? Is there a way to streaming big result set as I have the concern the local box memory may not be able to hold all result set?

thanks in advance,
Lin


(Colin Goodheart-Smithe) #2

Deep pagination can be achieved using the Scan-Scroll feature: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan


#3

Hi Colin,

Good sharing. Looked through the document you referred and find sorting may have cost. My use case is, I just need to find top N results, and do not care the order of results in top N. Wondering in my case, what is the most efficient way to write the query?

regards,
Lin


(Colin Goodheart-Smithe) #4

What do you mean by this? to define the top N of something there has to be some kind of sorting. How are you defining the top N? do you want the top N scoring documents?


#5

Yes, Colin, yes, I need top N scored documents, you are correct. I mean I do not need to strict ascending/descending order sort inside the top N documents, as long as top N documents are returned. Any efficient way to implement? Thanks.

regards,
Lin


(Colin Goodheart-Smithe) #6

Then the scan-scroll feature is what you want, Just don't set any explicit sorting in the scan request


#7

Thanks Colin,

I read the document (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan), for statements, "Deep pagination with from and size — e.g. ?size=10&from=10000 — is very inefficient as (in this example) 100,000 sorted results have to be retrieved from each shard and resorted in order to return just 10 results.", I am confused, should it be 10,000 sorted results? Other than 100,000? Which maps to from =10000 parameter?

Please feel free to correct me if I am wrong.

BTW, another quick question is, if I want to use scroll only without scan (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html), could I combine sorting with scroll? And why using scroll is more efficient than ordinary queries?


(system) #8