Elasticsearch Performance Issues Upgrading v6.2.3 to v7.13.1

We've been working on an upgrade from Elasticsearch cluster running v6.2.3 to v7.13.1 and running into a major performance roadblock.

We are running the versions side-by-side and trying to cut-over to the new version but when we push the load above 10% of the overall volume of our traffic the new Elasticsearch v7 cluster performance degrades drastically. The query time from the v7 cluster jumps to be 3x worse than the v6 performance.

If we drop the load down back to 10%, the query latency recovers and the v7 version runs at 3x better than the v6 version. This seems to indicate some kind of bottleneck we're facing with the way the cluster is handling the requests.

I've been reading through the documentation and looking for any settings that might cause this behavior.

I was hoping to post here and see if anyone had experienced something similar with suggestions of what to try to remediate the issue.

Richard

How large is your cluster? What is the workload like?

@Christian_Dahlqvist I'll do my best to summarize below.

Our Elasticsearch v6.2.3 cluster is a 15 node cluster. The machines are running Deb 9. We're running the Oracle JVM 1.8. Of the 15 nodes, 13 nodes are our older 128 GB RAM, Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz (56 cores), and a 1 TB SSD. And 2 nodes are our newer hardware which is 192 GB RAM, Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz (96 cores), and a 1 TB SSD. The network is a 60 GB uplink port-channel group.

Our Elasticsearch v7.13.1 cluster is a 12 node cluster. The machines are running Deb 10. The JVM is running with the bundled one with ES v7 which is openjdk version "16" 2021-03-16. The nodes are our newer hardware which is 192 GB RAM, Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz (96 cores), and a 1 TB SSD. The network is a 160 GB uplink port-channel group.

Both of the above clusters sit behind an HA Proxy load balancer which funnels requests based on the least connections to them.

The index being hit is 2 shards with a primary store size of 35.5gb (v6) and 27.8gb (v7). The index document size varies around 4-6 million or so depending on our daily inventory levels.

The primary index in question is heavily hit most of the business day. We've arranged the clusters so that this index lives on half of the cluster exclusively using routing allocation for the shards. Those nodes are set to be data-only role and not master eligible. The replicas is configured to match the total number of nodes that it lives on to ensure a read replica is available on each node.

There is one exception, the new v7 cluster is split 3/9 with 3 nodes being master eligible plus all other roles and the other 9 nodes are set to be data only with this index being the one assigned to them. This replicates closely to the set up of the v6 cluster with its 8/15 split dedicated to the index in question.

Traffic through the HA Proxy is directed to the nodes based on whether its a search request for this specific index or not. All requests for this index are routed directly to the data-only role nodes and those queries are set to _local preference to optimize for performance.

The workload itself is running searches that primarily utilize aggregate queries. Structurally most of them look like the below.

{
  "aggs": {
    "top_hit_by_dup_key": {
      "aggs": {
        "find_max_score": {
          "max": {
            "script": {
              "source": "_score"
            }
          }
        },
        "top_hit": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "_score": {
                  "order": "desc"
                }
              }
            ],
            "_source": {
              "includes": [
                ...
              ]
            }
          }
        }
      },
      "terms": {
        "field": "mydupkey",
        "order": [
          {
            "find_max_score": "desc"
          }
        ],
        "size": 2500
      }
    }
  },
  "from": 0,
  "query": {
    "function_score": {
      "functions": [
        ...
      ],
      "query": {
        "bool": {
          "filter": [
            ...
          ]
        }
      }
    }
  },
  "size": 0
}

We push load as hard as we can against the cluster during our runs the at the moment the bottleneck is Elasticsearch. We can achieve between 15-65k/min searches depending on the query complexity with v6.

The queries executed between v6 and v7 are identical using Nest .NET nightly builds for their respective Elasticsearch version. This was the workaround to use both in our code side-by-side.

We've measured the latency of the call verse the Elasticsearch Took time and the bottleneck seems to point to something within Elasticsearch v7. We're stumped as to what setting or configuration has changed in v7 that could cause such degradation of performance when under heavy load. Maybe lock contention of some type given the behavior.

Any ideas or suggestions would be appreciated. Let me know if you'd like an explanation of any of the above in more detail and I'll do my best.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.