We're currently optimzing the sharding setup of our Elasticsearch index to (surprise) decrease response times. Currently the amount of routing keys is equal to the amount of shards. We're looking for a setup, where all documents in a shard are of one routing key only. Currently the distribution over the shards is very uneven. Some shards are even empty.
This is how it is at the moment and how it should look like
Current
shard:0 -> routes:bmx, cyclocrosser
shard:1 -> routes: track-bike
shard:2 -> routes: shard:3 -> routes: downhill
Wanted
shard:0 -> routes:bmx
shard:1 -> routes: track-bike
shard:2 -> routes: cyclocrosser
shard:3 -> routes: downhill
Is there any possibility to make sure, that one routing key will be routed only to one shard?
We know that the routing is based on djb2 / http://www.cse.yorku.ca/~oz/hash.html#djb2. Is there any option to influence this behavior and can someone offer deeper insights, how the routing works internally.
Thanks for the reply. You're right it's the only way to achieve this in proper and save way.
To summarize the outcome: It's not possible.
Why? To work for the most use cases the routing is not directly based on the routing keys since the distribution of the documents might end up in a very unequal manner, if the distribution of routing key is like that (not for my case but in general it might be). The hashing of the routing key achieves this and even the disappearance of document having a certain routing will not end up in an empty shard.
You can create a workaround based on the knowledge of the used hashing function (Murmur) but this might break, if the Elasticsearch teams decides to changes the hashing function. And this happened already, so it's not save to rely on such a hidden feature.
The only way to achieve this is by creating a single index for each routing key as pointed out by Igor_Motov.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.