Optimal Shard Strategy for High Search Load with Elasticsearch Cluster

Hi,

I am working on optimizing an Elasticsearch cluster and am seeking advice on the best shard strategy. Here are the specifics of my setup:

  • Node Details: 4 data nodes running on AWS r6g.large instances, with 3 master nodes
  • Index: Single index containing 7 million documents
  • Current Shard Configuration: number_of_shards: 2, number_of_replicas: 1
  • Workload Characteristics: Predominantly search-heavy, with peak search requests reaching up to 100 req/s. During peak times, the average CPU usage on our data nodes reaches 70%.

Given these parameters, I am exploring how to adjust the number of primary and replica shards to better manage the search workload and reduce the impact on node performance.

I would appreciate any insights or recommendations on how to configure my shard setup to ensure optimal performance and scalability.

Thank you in advance for your help!

What is the size of your index? Which version of Elasticsearch are you using? What type of queries are you using?

  • size index: 3GB
  • version of Elasticsearch: 7.10
  • type of query: all of our queries are like below (simple query including text match)

We use ngram (min-gram: 2, max-gram: 3) to tokenize text fields because they include Japanese words.

{
    "query": {
        "bool": {
            "filter": [
                { "term": { "some_flag": true } },
                {
                    "bool": {
                        "should": [
                            {
                                "multi_match": {
                                    "query": "some word",
                                    "fields": ["text_a", "text_b", "text_c"]
                                }
                            },
                            {
                                "match_phrase": {
                                    "text_d": "some word"
                                }
                            }
                        ]
                    }
                }
            ]
        }
    },
    "size": 100,
    "from": 0,
    "sort": {
        "updated_at": "desc"
    }
}

It looks like the complete index is small enough to fully fit within the page cache, assuming you do not have a lot of other data in the cluster as well.

I would therefore recommend changing to a single primary shard and configure 3 replica shards so all nodes hold a full copy of the index in a single shard. If the page cache is unable to hold the full shard I would try reducing the heap size below the recommended 50%. This way each node can serve each request and there is les coordination overhead.

There has been a lot of performance improvement with respect to performance in Elasticsearch version 8.x, so I would recommend you upgrade to the latest version.

1 Like

Thank you!

If I change to a single primary shard and configure 3 replica shards, I think only one data node can handle indexing requests. Wouldn't this cause an imbalance in CPU load? In our cluster, while it's not very frequent, there is always running processes to index user data.

All nodes perform indexing into the shards they hold so the node holding the primary may do a bit more work but I would not expect a big difference. I would recommend you test it to find out.

1 Like

Thanks for the info.

Does "all nodes perform indexing into the shards they hold" mean that because replica shards do the job of replicating documents from the primary shard, the CPU load does not become too unbalanced?

I'll try testing it out to see what actually happens.

Although we store the original data in an RDB, I'm also a bit concerned about the risk of the data node with the only one primary shard going down.

If the node with the primary shard goes down one of the replica shards will be promoted to primary so that is not an issue.

1 Like