I asked this in the (freenode) IRC channel, and if I happen to get an answer there in the mean-time, I will follow up. However, it's a quite specific question so I figured I'd "double-post" here.
In the scroll documentation, it says ..
"By default the splitting is done on the shards first and then locally on each shard using the _id field with the following formula: slice(doc) = floorMod(hashCode(doc._id), max)"
It then goes on and words it as if this is super expensive (possibly misunderstanding) and further states ...
"To avoid this cost entirely it is possible to use the doc_values of another field to do the slicing"...
And further goes on to talk about using a timestamp field in an append-only index.
I don't have an append-only index so using the timestamp field is out the window out of the gate. However, I don't understand what it's saying is actually expensive - or what cost I am avoiding to use the doc_values of another field?
Again, it further goes on saying, when using doc_value of another field that it ... needs to be numeric and a single value and set once when the document is created and never updated and high cardinality.
.... All of which is true for my "_id" field. Obviously, my "_id" field is not analyzed, so if I have my understanding of doc_values correctly, it has a doc_value field. To be clear, my "_id" field is actually just an auto_increment from mySQL, ... so it wouldn't be "perfect" as hashing it first, but I think it would be, overall, just as effective. Even if not, ... I'd bet we have another field that would suffice.
Anyway, I am just trying to understand what it's trying to get across? Kind of lost by the wording. I don't know if I am going to use splitting yet, ... it would be a challenge in the language I am using. I haven't even actually seen how slow/costly it is to use the scroll api yet. Just trying to keep it as cheap as possible.
However, it seems to me it is telling me that the "slice(doc) = floorMod(hashCode(doc._id), max)" is somehow expensive. If it is expensive, what is expensive about it? Is it the hashing?
If I'm understanding correctly, ... "floorMod(doc._id, max)", without the hashing would probably work for me, being an autoincrement ID ... But again, I don't understand why it's putting so much emphasis on the split-key .. Hashing isn't that expensive, is it? Even like sha groups are pretty cheap, no?
I guess final question being - what is the point it's trying to get across here? Is it just trying to tell me I can customize the splitting for edge cases? Is it telling me, by default, the spitting is quite expensive due to the hashing of the doc_id? Am I overthinking this?