I have searched the docs and Googled, but have not found a definitive answer.
- Can an integer field be used for routing?
- If yes, does the integer size have much significance on the distribution (this is a hash quality question)? (I know the recommended max shards is about 30,000, so even a short covers this more than twice).
- Are there any considerations/design points I should be aware of?
- Is djb2 or murmur3 used?
From what I can see, it seems that converting a number to a 4-char string seems my best option.
Thanks much. Luca! This is good to know. I deduce ES serializes the integers to binary and then hashes the binary values (as opposed to merely applying modulo to the values), right?
My use case has a relatively large amount of data that cannot easily be distributed evenly across shards. Think of the key data as "what" and "where". A "where" like NYC would, of course, be hot. Podunk, not so much. A "what" like burgers would be hot since there are many burger places, relatively speaking. Midget cage fighting, pretty rare.
So, what I do is hash the "what" and "where" in a special way that provides ranges. This hash is a number, hence my original question. This approach allows a pseudo-random distribution of data, thus giving me theoretically balanced shards. Or, at least close enough.
So, I hash at query time, send to ES, it hashes and modulo divides and knows the one shard to speak to. Makes sense to me.
Now, since inverted indices' speed is determined by the dictionary size, not the document count, and since all shards would be almost identical dictionary-wise (since the data is splattered across shards), I am keeping the number of shards very low. I'm aiming at no more than 5 per node. For all I know, 1 could be the optimum.
Do you have any thoughts on this approach?
I don't know the answer to your question about binary values offhand.
Keeping the number of shards low is generally a good idea. In a search use case like yours a good rule of thumb would be to have a shard with a size of 5-10GB. Anything above still works, but in most cases you would benefit from the concurrency that multiple shards give you.
However, using a low number of shards (1 shard?) will negate any _routing performance gains and makes me question why you still care about routing. Typically _routing works better if you have lots and lots of data and many shards.
Even if you have large amounts of data, Elasticsearch will distribute it evenly across shards for you.
Unfortunately you'd have to run performance tests to know for sure, but my suggestion is still to just forget about the routing and rely on
filter queries to drill down to one what / where. Depending on the filter we either cache them or just re-run them as they are cheap, also independently. So searching for
where:NYC AND what:burger and then searching for
where:NYC AND what:pizza would make the NYC part cached if you're using a geo query for this.
Hey Luca, thanks for your reply!
1 (up to 5) shards per node is for the case of expanding to many nodes, in which case routing is a big deal. Sure, on one node it isn't as important. But who wants to broadcast every query to every node?
I'm only using routing to mitigate queries going to multiple nodes. So, 5 shards enables 5X node expansion before having to reindex.
Having said that, each shard's dictionary would be almost identical, so it's a lot of duplication, hence my goal of a low shard count. And, since multiple threads can hit one shard for reading purposes, I'm not sure that many shards would help in this regard.
Thanks for the size tip. That helps!
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.