Documentation for scroll API is a bit confusing!

abcarroll · June 15, 2019, 9:41pm

I asked this in the (freenode) IRC channel, and if I happen to get an answer there in the mean-time, I will follow up. However, it's a quite specific question so I figured I'd "double-post" here.

In the scroll documentation, it says ..

"By default the splitting is done on the shards first and then locally on each shard using the _id field with the following formula: slice(doc) = floorMod(hashCode(doc._id), max)"

It then goes on and words it as if this is super expensive (possibly misunderstanding) and further states ...

"To avoid this cost entirely it is possible to use the doc_values of another field to do the slicing"...

And further goes on to talk about using a timestamp field in an append-only index.

I don't have an append-only index so using the timestamp field is out the window out of the gate. However, I don't understand what it's saying is actually expensive - or what cost I am avoiding to use the doc_values of another field?

Again, it further goes on saying, when using doc_value of another field that it ... needs to be numeric and a single value and set once when the document is created and never updated and high cardinality.

.... All of which is true for my "_id" field. Obviously, my "_id" field is not analyzed, so if I have my understanding of doc_values correctly, it has a doc_value field. To be clear, my "_id" field is actually just an auto_increment from mySQL, ... so it wouldn't be "perfect" as hashing it first, but I think it would be, overall, just as effective. Even if not, ... I'd bet we have another field that would suffice.

Anyway, I am just trying to understand what it's trying to get across? Kind of lost by the wording. I don't know if I am going to use splitting yet, ... it would be a challenge in the language I am using. I haven't even actually seen how slow/costly it is to use the scroll api yet. Just trying to keep it as cheap as possible.

However, it seems to me it is telling me that the "slice(doc) = floorMod(hashCode(doc._id), max)" is somehow expensive. If it is expensive, what is expensive about it? Is it the hashing?

If I'm understanding correctly, ... "floorMod(doc._id, max)", without the hashing would probably work for me, being an autoincrement ID ... But again, I don't understand why it's putting so much emphasis on the split-key .. Hashing isn't that expensive, is it? Even like sha groups are pretty cheap, no?

I guess final question being - what is the point it's trying to get across here? Is it just trying to tell me I can customize the splitting for edge cases? Is it telling me, by default, the spitting is quite expensive due to the hashing of the doc_id? Am I overthinking this?

HenningAndersen · June 16, 2019, 3:40pm

Hi @abcarroll,

the hashing when doing sliced scroll is not the issue.

If #slices == #shards, elasticsearch will simply serve a shard per sliced scroll. This is cheap.

But if ES has to split a shard into multiple slices, each of the slices will individually figure out which docs belong to the slice based on the formula. The expensive part is fetching the _id, which is not stored as a doc-value (_id is special handled) as well as keeping a bitset containing the documents that belong to each specific slice. See this github issue for a discussion of this.

Using a doc-value for slicing is much cheaper and this is the point being explained in the documentation.

system · July 14, 2019, 3:40pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Question on scroll, routing, and slicing combination Elasticsearch	1	809	July 7, 2017
Retrieving millions of large documents Elasticsearch	7	430	September 25, 2023
Slicing without point in time Elasticsearch point-in-time	1	427	September 24, 2023
What does the _split api do internally when creating a new index Elasticsearch	0	15	November 5, 2024
Sort by _id field Elasticsearch	3	7309	March 21, 2019

Documentation for scroll API is a bit confusing!

Related topics