SCROLL internals


(Kasia) #1

Hi,
The question is about scrolling in ES v.5.1 cluster architecture.

In the docs, there's a nice explanation of the deep pagination problem with from/size:
you know from that what the coordinating node and each shard computes whenever a new page is being requested.

Could someone explain how scrolling works at the same level of abstraction?
For instance:

  1. Which node does the search context reside on in the cluster?
  2. What does the search context contain after the initial search? (all documents matching the query? some of them? any reference to the segments cotaining the documents matching the query? something else?)
  3. What do particular shards do when the initial search is launched?
  4. What's the responsability of the coordination node when the initial search is launched?
  5. What do particular shards do when the scroll API to get the next bunch of data is lauched?
  6. What's the responsability of the coordination node when the scroll API to get the next bunch of data is lauched?
  7. Does scroll support sorting in 5.1 (only sorting by _doc is mentioned as the most efficient option)?
  8. How does the scroll relates to query_then_fetch | dfs_query_then_fetch search types? In 2.x the search_type had to be SCAN, right?
  9. Can scrolling be combined with routing?
  10. What is the sliced scroll for? Any use case where it could be useful would be appreciated.
  11. Finally, is it for maintenance purposes exclusively?

Thanks in advance,
Kasia


(Jörg Prante) #2

For scrolling, the search context is not used. It's the scroll context. The scroll context resides on all shards involved in the index.

The scroll context contains the total hits, the max score, and last emitted doc number (of the Lucene shard), and the keep-alive time for timing out the scroll context.

Nothing special. The scroll context is opened.

The coordination node has no responsibility. It works like in a usual search operation.

The shards deliver the next documents, like in a usual search operation, continuing from the last emitted doc.

See 4. There is nothing special regarding scroll.

Yes.

SCAN is gone in 5.x.

query_then_fetch and dfs_query_then_fetch are modes for the search operation how to compute relevance scoring on the shards and how to retrieve the computed documents. This is not specific to scroll, it's a mode that can be set for all kinds of search operations.

Document routing allows to index a document on a certain shard. This has no relationship with scroll search, so it can be "combined" (indexing and search can not be combined in a single operation).

It's for concurrent scrolling, which can be executed with higher performance.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html

Imagine a 10 million doc index and 5 shards on one server node. Then consider a thread pool on the server. To iterate over the index in the naive way, each shard uses one thread with a total of 5 threads. Each thread would have to traverse 2 million docs. Then imagine a sliced scroll with max = 2. Each thread on a shard would have to iterate over only 1 million documents, with a total of 10 threads. If you can run two scroll queries on client side, one for each slice, and join the result set on client side, you can finish the total scroll in approximately half of the time.

Nope. It's for retrieving large result sets, for whatever purpose.


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.