How to reshard the indices to overcome latency during high traffic in elasticsearch cluster

During high traffic times, our Elasticsearch cluster is experiencing latency, and we are considering a resharding strategy to optimize performance. Below is our current index setup and the proposed resharding plan:

Current Indices:

billing-index-v0.0: 45 GB, 20 shards (over-sharded)
billing-index-v1.0: 197 GB, 20 shards (over-sharded)
billing-index-v2.0: 1.4 GB, 20 shards (over-sharded)
billing-index-v4.0: 947 GB, 20 shards (over-populated)

Proposed Resharding: (25 GB Aimed per shard)

billing-index-v0.1: 45 GB, 2 shards
billing-index-v1.1: 197 GB, 8 shards
billing-index-v2.1: 1.4 GB, 2 shards
billing-index-v4.1: 947 GB, 40 shards

Questions:

  1. Will the proposed resharding strategy help alleviate latency during high traffic times?
  2. Are there any potential drawbacks or considerations we should be aware of?
  3. Should we consider a different resharding approach based on our current setup?

Query patterns:

{
  "size": 20,
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "filter": {
              "term": {
                "accountId": "9812374287423"
              }
            },
            "must_not": {
              "exists": {
                "field": "linkedAccount"
              }
            }
          }
        }
      ],
      "must_not": [
        {
          "term": {
            "isDeleted": {
              "value": true
            }
          }
        }
      ],
      "must": [
        {
          "multi_match": {
            "query": "Kelly",
            "fields": [
              "firstName.autocomplete",
              "lastName.autocomplete",
              "fullName.autocomplete",
              "emailId.autocomplete",
              "phone.autocomplete",
              "companyName.autocomplete",
              "country.autocomplete",
              "state.autocomplete",
              "city.autocomplete",
              "address.autocomplete"
            ]
          }
        }
      ]
    }
  },
  "sort": [
    {
      "modifiedDate": {
        "order": "desc"
      }
    }
  ]
}

Hardware specifications: 4 Nodes, 16 Core CPUs, 32 GB Memory.

Any help is much appreciated. Thanks in advance.

How many primary and replica shards do these indices have?

That depends on what is causing the slowness. How fast is it during non-peak times and how slow is it during peak times?

It would help if you could provide some additional context around the use case, e.g.:

  • Which version of Elasticsearch are you using?
  • How many concurrent queries are you serving? How are these distributed across the indices?
  • Are you continously adding/updating/deleteing data from these indices? If so, at what rate?
  • What is the average size of your documents?

Given that you have more data than can fit in the operating system page cache, the type of storage used is very important and can very well be the bottleneck. What type of storage are you using? Local SSDs?

Have you checked how the storage is performing?

Do you have monitoring installed so you can see how high CPU and heap usage is?

Hi @Christian_Dahlqvist
Thank you for taking time to write back.

How many primary and replica shards do these indices have?

20 Primary shards each with 1 replica = 40 shards in total, per index

  • Which version of Elasticsearch are you using?
  • How many concurrent queries are you serving? How are these distributed across the indices?
  • Are you continously adding/updating/deleteing data from these indices? If so, at what rate?
  • What is the average size of your documents?

Elasticsearch version 7.9

Application server request for
High traffic - Search (/search route) 350/min , 5/s
Low traffic - Search (/search route) 160/min , 2.5/s

We are using alias name for reading from the indices billing-alias
billing-index-v0.0, Search, Update only,
Indexing Rate - Total Shards 0.18/S, Primary Shards - 0.09/s
Search Rate - 155/s (high traffic) 33/s low traffic (number of search requests being executed across primary and replica shards)

billing-index-v1.0, Search, Update Only
Indexing Rate - Total Shards 1.78/S, Primary Shards - 0.88/s
Search Rate - 152/s (high traffic) 33/s low traffic

billing-index-v2.0, Search, Update Only
Indexing Rate - Total Shards 0.04/S, Primary Shards - 0.02/s
Search Rate - 157/s (high traffic) 34/s low traffic

billing-index-v3.0, Search, Update Only
Indexing Rate - Total Shards 4.65/S, Primary Shards - 2.28/s
Search Rate - 157/s (high traffic) 40/s low traffic

billing-index-v4.0, Search, Update Only
Indexing Rate - Total Shards 19.6/s, Primary Shards - 9.58/s
Search Rate - 158/s (high traffic) 33/s low traffic

billing-index-v5.0, Write, Search, Update (Current write index where new docs getting created)
Indexing Rate - Total Shards 16.64/S, Primary Shards - 8.05/s
Search Rate - 155/s (high traffic) 33/s low traffic

Average size of the document is 900 bytes.

What type of storage are you using? Local SSDs?

Yes.

Have you checked how the storage is performing?

Disk IOPS
Write:
Max: 0.08k/s
Mean: 0.07/s
Min: 0.06/s

Read
Max: 1.99k/s
Mean: 1.19/s
Min: 0.8k/s

Do you have monitoring installed so you can see how high CPU and heap usage is?

CPU is well within 20 percent even during high traffic time and as far heap usage is concerned no memory pressure observed. No frequent GCs and always stays within 70 and cycles

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.