I'm experiencing a problem with routing, which I can't figure out.
I have a server setup with a certain number of shards (512).
The data I store in Elasticsearch belongs to ~200 separate units, so that I'm using the strategy of assigning a numeric id to each unit (unique, incremental, and always smaller than the number of shards), and to store the data using as routing value the id of the unit.
For example, unit "foo" has id 1, while unit "bar" has id 98. Both ids match their respective routing value.
While storing the data, incrementally (unit by unit), everything was working fine for the first ~100 units (I don't know the exact number), but at some point, ES started reusing the shards, eg. data addressed by routing 98 is stored along with data addressed by routing 1.
This internal reusing strategy appears to be some sort of hashing, since it's consistent and deterministic - data with id 98 will always go in the shard with data with id 1, never with id 2.
Does anybody has any suggestion? I didn't find anything obvious in the documentation, and it obviously causes a serious problem, since I need complete compartmentalization of the data.
Unless your cluster consists of 512 nodes, there are chances you would get better performance out of your elasticsearch cluster by having fewer shards. The optimal number is typically about the number of nodes that you have in your cluster.
Indeed routing is based on hashing. The hashing algorithm that we use in 1.x is called djb2 and is a bit naive which is the reason why it will be replaced with murmur3 for indices created on elasticsearch 2.0. I guess you are experiencing collisions because the current hash function is a bit simplistic. However I'm still a bit confused how you were hoping to distribute 200 units across 512 shards since there are more shards than units.
By the way, you don't need to provide ids to your unit for routing, you can just provide the name of your unit as a routing value.
A bit of clarification first!
- there is a single node
- the number of shards is overallocated
- the idea is to store data for a given unit into a single shard, but no more than one unit per shard
so, based on what you're saying, and a second read of the custom routing page:
- custom routing guarantees that documents with the same routing value will locate to the same shard, but not that all the documents within a shard will have the same routing value.
is that correct? if so, there's no way to directly reference a certain shard, right?
This is correct: Elasticsearch does not allow you to choose the shard, only to make sure that documents that share a common property will be on the same shard. If you need to ensure that data are isolated, then it would be better to have different indices.