I have searched the docs and Googled, but have not found a definitive answer.
My questions:
Can an integer field be used for routing?
If yes, does the integer size have much significance on the distribution (this is a hash quality question)? (I know the recommended max shards is about 30,000, so even a short covers this more than twice).
Are there any considerations/design points I should be aware of?
Is djb2 or murmur3 used?
From what I can see, it seems that converting a number to a 4-char string seems my best option.
Yes, that's fine. Fun fact: by default we use the document ID, which could be an integer as well
no
Are you sure that you need custom routing? Custom routing is only the best approach in a very small percentage of use cases. It tends to work better the smaller the subset of data is that you route by.
The reason for this is that once you use custom routing, searches on that data will only be executed in a single thread as we only have to search a single shard. In comparison, using a simple term filter would fan the search out across N threads (number of primaries) and the filter itself is really really efficient, so it would return in a few ms at most (depending on your dataset).
Routing in general also adds complexity which you may not need.
Thanks much. Luca! This is good to know. I deduce ES serializes the integers to binary and then hashes the binary values (as opposed to merely applying modulo to the values), right?
My use case has a relatively large amount of data that cannot easily be distributed evenly across shards. Think of the key data as "what" and "where". A "where" like NYC would, of course, be hot. Podunk, not so much. A "what" like burgers would be hot since there are many burger places, relatively speaking. Midget cage fighting, pretty rare.
So, what I do is hash the "what" and "where" in a special way that provides ranges. This hash is a number, hence my original question. This approach allows a pseudo-random distribution of data, thus giving me theoretically balanced shards. Or, at least close enough.
So, I hash at query time, send to ES, it hashes and modulo divides and knows the one shard to speak to. Makes sense to me.
Now, since inverted indices' speed is determined by the dictionary size, not the document count, and since all shards would be almost identical dictionary-wise (since the data is splattered across shards), I am keeping the number of shards very low. I'm aiming at no more than 5 per node. For all I know, 1 could be the optimum.
I don't know the answer to your question about binary values offhand.
Keeping the number of shards low is generally a good idea. In a search use case like yours a good rule of thumb would be to have a shard with a size of 5-10GB. Anything above still works, but in most cases you would benefit from the concurrency that multiple shards give you.
However, using a low number of shards (1 shard?) will negate any _routing performance gains and makes me question why you still care about routing. Typically _routing works better if you have lots and lots of data and many shards.
Even if you have large amounts of data, Elasticsearch will distribute it evenly across shards for you.
Unfortunately you'd have to run performance tests to know for sure, but my suggestion is still to just forget about the routing and rely on bool=> filter queries to drill down to one what / where. Depending on the filter we either cache them or just re-run them as they are cheap, also independently. So searching for where:NYC AND what:burger and then searching for where:NYC AND what:pizza would make the NYC part cached if you're using a geo query for this.
1 (up to 5) shards per node is for the case of expanding to many nodes, in which case routing is a big deal. Sure, on one node it isn't as important. But who wants to broadcast every query to every node?
I'm only using routing to mitigate queries going to multiple nodes. So, 5 shards enables 5X node expansion before having to reindex.
Having said that, each shard's dictionary would be almost identical, so it's a lot of duplication, hence my goal of a low shard count. And, since multiple threads can hit one shard for reading purposes, I'm not sure that many shards would help in this regard.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.