Improving the default routing hash function

I noticed that the default routing hash function is DJB. This function is
particularly poor at routing when the input keys are short and are mildly
different. For example, basic two digit hex based values "00" -> "FF"
produce very large hot spots on clusters of size 11, 16, and 17 and others.
By "large" I mean that on an uniform input distribution, the largest shards
is over 2x larger (sometimes up to 4x!) than the smallest shard.

I feel it is reasonable to assume from a usability standpoint that if the
routing key is an order of magnitude larger than the modulus that the
resulting document distribution in the shards to be uniform. In our case,
we had 255 distinct routing keys over 17 shards and the smallest shard is
40% the size of the largest. Furthermore, we know that the number of
documents per routing key is roughly the same.

This almost feels like a bug (maybe it is). It is certainly unexpected.
Something like FNV seems like a good alternative. I would add the the Java
string hash alternative isn't much better in a lot of cases, especially
short inputs.

Is there a particular reason for using DJB? Any chance of changing the
default or including something like FNV out-of-the box? I would also
suggest a note in the documentation about the potential for hotspot simply
due to routing key selection. Any thoughts in general?

Thanks,
Andrew White

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cff4dd98-411e-47f8-9679-6d44f5f97806%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

1 Like

Hi Andrew,

This is indeed an issue. For your information, elasticsearch will switch to
murmur3 in the next major version. For backward compatibility, old indices
will still use DJB, but newly created indices will use murmur3. There is
more background about this issue at

I don't know why DJB was chosen, but I believe that the fact that it
performs well on incremental ids (0, 1, 2, 3, ...) and the default number
of shards played a role into this choice (wild guess).

On Sun, Jan 18, 2015 at 9:22 PM, Andrew White andrew@datarank.com wrote:

I noticed that the default routing hash function is DJB. This function is
particularly poor at routing when the input keys are short and are mildly
different. For example, basic two digit hex based values "00" -> "FF"
produce very large hot spots on clusters of size 11, 16, and 17 and others.
By "large" I mean that on an uniform input distribution, the largest shards
is over 2x larger (sometimes up to 4x!) than the smallest shard.

I feel it is reasonable to assume from a usability standpoint that if the
routing key is an order of magnitude larger than the modulus that the
resulting document distribution in the shards to be uniform. In our case,
we had 255 distinct routing keys over 17 shards and the smallest shard is
40% the size of the largest. Furthermore, we know that the number of
documents per routing key is roughly the same.

This almost feels like a bug (maybe it is). It is certainly unexpected.
Something like FNV seems like a good alternative. I would add the the Java
string hash alternative isn't much better in a lot of cases, especially
short inputs.

Is there a particular reason for using DJB? Any chance of changing the
default or including something like FNV out-of-the box? I would also
suggest a note in the documentation about the potential for hotspot simply
due to routing key selection. Any thoughts in general?

Thanks,
Andrew White

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/cff4dd98-411e-47f8-9679-6d44f5f97806%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/cff4dd98-411e-47f8-9679-6d44f5f97806%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7i5oYWVNg1Qnyt2U9-gFYyoKGabSb49tWgCOcLTW5O2g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

1 Like