I can see that it is possible to allocate shards to particular nodes, and (some) routing of data is also possible, but... how can I ensure that data will always end up in their own shard?
Scenario:
I have 3 nodes with 2 shards in each node.
[P0 R2] [P1 R0] [P2 R1]
From documentation I know that:
shard_num = hash(_routing) % num_primary_shards
I want to route bucket1's data... The hash of (bucket1) = 9; 9 % 3 = 0, so this data goes to P0.
Then, I want to route bucket2's data... The hash of (bucket2) = 33; 33 % 3 = 0, so this data goes to P0 too.
There's not really a way to do that in Elasticsearch. Routing only guarantees that one routing value always maps to a certain subset of shards... but it doesn't prevent other routing values from also mapping to the same set or subset of shards.
If you absolutely must have this level of separation, I think an index is the smallest unit of division that you can use. E.g. each customer gets their own index, which you can control how and where they are allocated (and prevent other customer data from being indexed to the same index). I would just try to make sure the indices are as small as possible to prevent performance problems with too many shards. Ideally most of the indices will be a single shard.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.