If I understand correctly, the benefits of creating indices with multiple primary shards is to allow for parallelization in reads/writes as well as the ability to segregate documents to specific shards via routing.
However, if we have the ability to handle the routing of searches and indexing at the application layer and can add new indexes as needed based on capacity, why would we not just create all indexes with one primary shard?
We would then add replicas for each node in the cluster for resiliency and improved performance.
Wouldn't this approach be more flexible than determining the number of primary shards up front?
Am I missing something that a multi-primary shard index provides that can't be accomplished with replicas?
If I understand correctly, the benefits of creating indices with multiple primary shards is to allow for parallelization in reads/writes as well as the ability to segregate documents to specific shards via routing.
Yes, that's pretty much it. Customising the routing is possible but not strongly recommended because it can leave you with imbalances between the shards.
However, if we have the ability to handle the routing of searches and indexing at the application layer and can add new indexes as needed based on capacity, why would we not just create all indexes with one primary shard?
If you have the ability to route searches and indexing at the application level then you can indeed do all the work yourself. If you'd rather let Elasticsearch do that for you then you can have multiple primary shards in an index.
Prior to the introduction of the split index API you couldn't increase the number of shards in an existing index, so you had to decide at index creation whether there was any risk of exceeding the capacity of a single shard and set the number of primaries accordingly. Now that this feature has landed it may make more sense to start with a single primary shard and split it as the need arises.
Thanks, our primary reasoning behind the single primary shard approach had to do with the limitations on number of aliases. If we were to let ES handle the routing for some of our use cases such as routing users to 'their' shard, then we would be required to create aliases that filtered by user id. This approach is not scalable once the number of users crosses into the 10s of thousands.
So, given that we have to handle the routing ourselves, it makes sense to create single primary shard indexes and add indexes as capacity depends, no?
Just to clarify, the necessity for aliases for each user was not just for the filtering - we could handle that in the queries. The aliases were needed to handle the routing when querying across multiple indexes - some of which were custom routed and others that weren't.
We couldn't come up with a way to handle these types of queries without aliases. Maybe there is a way to do so that we missed?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.