By default, Elasticsearch replicates data from primary shards to replica shards. However, in some cases, this can lead to performance issues. For example:
When the cluster experiences high query volume and has many replicas (e.g., 1 primary shard and 100 replica shards), the node hosting the primary shard may experience higher CPU and bandwidth usage compared to other nodes.
When nodes in the cluster are distributed across multiple data centers, if the primary shard is in Data Center A and many replica shards are in Data Center B, all replica shards syncing data from nodes in Data Center A can consume the inter-data center dedicated bandwidth.
To address these issues, it would be beneficial to allow certain replica shards to sync data from other shards, thereby reducing the load on the primary shard. This can be achieved by implementing custom data synchronization strategies or utilizing advanced Elasticsearch features to optimize data replication across the cluster.
Yes, query requests are evenly distributed across all shards. However, when there are a large number of write operations (especially update operations), the write load on the primary shard can become very high. With multiple replicas, the CPU and bandwidth usage of the node hosting the primary shard will be significantly higher than that of other nodes.
My current solution is to increase the number of primary shards from 1 to multiple, but this will increase the total number of shards, leading to a decrease in query performance and an increase in query load.
Are you saying that multiple indexes can be created to synchronize index data using CCR?
For example, if there are 10 data nodes in data centers A, B, and C each, can we create three indexes, each set with 1 primary shard and 9 replica shards, and then ensure that the shards of these three indexes are distributed across the corresponding data centers, is that correct?
However, CCR is a paid feature. Are there any other ways to achieve this?
Yes, you'd normally have one (or more) clusters in each data center, with CCR pulling the data from the central leader cluster. For really big setups you can use chained replication to further reduce load on the central cluster.
Not really. This kind of problem only arises when you're running a large cluster, which costs a lot of money anyway just for the infrastructure, so paid features like CCR actually end up saving more in infra savings than they cost.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.