Deep Pagination with scroll(100 millions of docs) could be a problem?

(Felipe Santos) #1

I have a lot of background tasks which needs paginate arround 15 millions of documents on avarage, this tasks needs to run a lot of times, the cluster has arround 100 millions of documents.

The same elasticsearch is used for another tasks which I don't need pagination.

On documentation I saw the scroll is very expensive task and there is another databases that I could do deep pagination with some techniques and it perform well.

The question is, Is elasticsearch the right tool for this?

(David Pilato) #2

If you need to export a lot of results, scroll is the way to go.

Give it a try. Should not be an issue IMO.

(Felipe Santos) #3

Even if I need to run scroll a lot of times concurrently with different queries?
On the Elasticsearch manual, create and maintain scroll contexts seems to be very expensive.

(David Pilato) #4

It can be because during that time no merge can happen while scrolling so segments will be keep around a longer time.

But you have to test it yourself.

(Felipe Santos) #5

And what about this:

"Deep Paging in Distributed Systems
To understand why deep paging is problematic, let’s imagine that we are searching within a single index with five primary shards. When we request the first page of results (results 1 to 10), each shard produces its own top 10 results and returns them to the coordinating node, which then sorts all 50 results in order to select the overall top 10.

Now imagine that we ask for page 1,000—results 10,001 to 10,010. Everything works in the same way except that each shard has to produce its top 10,010 results. The coordinating node then sorts through all 50,050 results and discards 50,040 of them!

You can see that, in a distributed system, the cost of sorting results grows exponentially the deeper we page. There is a good reason that web search engines don’t return more than 1,000 results for any query."

Does scroll solve this problem here? If yes, how it does?

(Jörg Prante) #6

With scan/scroll search, you create a query with a cursor. The server stores the last cursor for a specified time. If you continue the scroll, the server can proceed at the last known position.

The operation is cheap in the case when you omit sorting and relevance ranking. Elasticsearch provides the option to sort by _doc, that means, the documents are ordered like Lucene returns them from the index, without sorting and relevance ranking.

There are rare occasions when you really need to compute scores for relevance ranking on 10,000 documents just to fetch the result 10,001-10,010. This search is indeed a very expensive operation and should be avoided, for example, by smart filtering of the documents to shrink the result set. The most popular purpose for relevance ranking is to perform top-k searches, that is, let the search engine present the most relevant k documents in the result set first (and forget about all the rest of the documents).

(David Pilato) #7

Yes it does. The fastest option is to sort by _doc (default I think). Also "search after" helps for deep pagination. But for extracting a whole result set, I'd use scroll.

The fact is that scroll makes extraction consistent so you won't get results in double or missing.
Which is not the case with standard pagination as new results might have been indexed between 2 calls.

Scroll takes care of this.

(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.