Documentation for scroll API is a bit confusing!

I asked this in the (freenode) IRC channel, and if I happen to get an answer there in the mean-time, I will follow up. However, it's a quite specific question so I figured I'd "double-post" here.

In the scroll documentation, it says ..

"By default the splitting is done on the shards first and then locally on each shard using the _id field with the following formula: slice(doc) = floorMod(hashCode(doc._id), max)"

It then goes on and words it as if this is super expensive (possibly misunderstanding) and further states ...

"To avoid this cost entirely it is possible to use the doc_values of another field to do the slicing"...

And further goes on to talk about using a timestamp field in an append-only index.

I don't have an append-only index so using the timestamp field is out the window out of the gate. However, I don't understand what it's saying is actually expensive - or what cost I am avoiding to use the doc_values of another field?

Again, it further goes on saying, when using doc_value of another field that it ... needs to be numeric and a single value and set once when the document is created and never updated and high cardinality.

.... All of which is true for my "_id" field. Obviously, my "_id" field is not analyzed, so if I have my understanding of doc_values correctly, it has a doc_value field. To be clear, my "_id" field is actually just an auto_increment from mySQL, ... so it wouldn't be "perfect" as hashing it first, but I think it would be, overall, just as effective. Even if not, ... I'd bet we have another field that would suffice.

Anyway, I am just trying to understand what it's trying to get across? Kind of lost by the wording. I don't know if I am going to use splitting yet, ... it would be a challenge in the language I am using. I haven't even actually seen how slow/costly it is to use the scroll api yet. Just trying to keep it as cheap as possible.

However, it seems to me it is telling me that the "slice(doc) = floorMod(hashCode(doc._id), max)" is somehow expensive. If it is expensive, what is expensive about it? Is it the hashing?

If I'm understanding correctly, ... "floorMod(doc._id, max)", without the hashing would probably work for me, being an autoincrement ID ... But again, I don't understand why it's putting so much emphasis on the split-key .. Hashing isn't that expensive, is it? Even like sha groups are pretty cheap, no?

I guess final question being - what is the point it's trying to get across here? Is it just trying to tell me I can customize the splitting for edge cases? Is it telling me, by default, the spitting is quite expensive due to the hashing of the doc_id? Am I overthinking this?

Hi @abcarroll,

the hashing when doing sliced scroll is not the issue.

If #slices == #shards, elasticsearch will simply serve a shard per sliced scroll. This is cheap.

But if ES has to split a shard into multiple slices, each of the slices will individually figure out which docs belong to the slice based on the formula. The expensive part is fetching the _id, which is not stored as a doc-value (_id is special handled) as well as keeping a bitset containing the documents that belong to each specific slice. See this github issue for a discussion of this.

Using a doc-value for slicing is much cheaper and this is the point being explained in the documentation.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.