Elasticsearch 7.6 is underperforming 6.8

Hello,

I upgraded Elastic Stack from 6.8.2 to 7.6.2 (currently both clusters are running on EKS with kubernetes nodes on AWS m5.4xlarge machines 16CPUs 64GB - these are NOT SSDs) and I am experiencing degraded performance on 7.6.2.

6.8.2 setup:             | 7.6.2 setup:
                         |
nodes:                   |
  client:                |   coordinator:
    replicas: 4          |     replicas: 6
    jvm heap: 2gb        |     jvm heap: 1gb
    cpu: 1               |     cpu: 1
    mem: 4gb             |     mem: 2gb
  data:                  |   data:
    replicas: 8          |     replicas: 8
    jvm heap: 8gb        |     jvm heap: 8gb
    cpu: 1               |     cpu: 6
    mem: 16gb            |     mem: 16gb
  master:                |   master:
    replicas: 3          |     replicas: 3
    jvm heap: 1gb        |     jvm heap: 4gb
    cpu: 1               |     cpu: 2
    mem: 2gb             |     mem: 8gb
                         |
indices:                 |
  number_of_shards:      |   
    2                    |     6
  number_of_replicas:    |
    1                    |     1 
  refresh_interval:      |
    30s                  |     60s
  active_primary_shards: |
    12241                |     1591
  active_shards:         |
    24482                |     3182

6.8.2 has daily indices so it has way more shards.
7.6.2 does not have time-based indices but has a Rollover strategy using time-series indices with the zero-padded suffix-n (i.e. fluentd-<k8s-namespace>-000001) and uses the Rollover API every hour and rolls over hot indices if 7gb or larger and shrinking warm indices. (Note: I am not using ILM or Curator since I am using dynamic index naming)

Everything else seems to be the same:

  • write.queue_sizes are 200
  • indices.memory.index_buffer_sizes are 50%

Although 6.8.2 has way more data (since 7.6.2 is fairly new), it is performing better. They are even receiving the same amount of data and both are getting a ton of 429s but 6.8.2 has no lag whereas 7.6.2 does. (according to topk(50, kafka_consumergroup_group_max_lag_seconds{group=~"logstash-es"}))

rate(elasticsearch_indices_indexing_index_total{cluster="$cluster",name=~"$name"}[$interval]) shows that 99% of the time, the data nodes are indexing!

I have tried:

  • more CPU (though CPU usage avg is low)
  • more memory (though memory usage avg is low)
  • more JVM heap (though JVM heap usage avg is low)

I have not tried:

  • more Logstash consumers (since the write queues were full, it didn't make sense)
  • write.queue_size to 1000 (as this will only take care of 429s)
  • more data nodes (I was going to try this soon)

What could I be missing?

Seems hard to compare if data and use is quite different between clusters, but why the low heap on coordinators as that would seem important to managing queries & results?

No high CPU, high heap or long GC on any of them? Disk IO queues, as not SSD?

Thread pool show queue?

1 Like

I can up the heap on the coordinator nodes and see if things improve but the heap usage is pretty low. Averages between 45-80%. Currently around 55%.
EDIT: My apologies. Looks like I made a mistake in my OP. The heap for the coordinator nodes are set at 4gb with 8gb of memory. I cannot edit the OP.

No high CPU or heap. GC time shows an average of 10ms on all data nodes. Disk usage is low. I am not sure how to check disk IO.

Thread pool shows there are queues for every data node at all times.

What type and size of storage are you using? Indexing can be very I/O intensive so disk performance can quickly become a bottleneck. Check logs and cluster stats for evidence of merge throttling, which can indicate your storage is too slow and not able to keep up. You might also be able to run íostat -x on the nodes and see what that tells you about disk utilization and iowait.

Both clusters (6.8.2 and 7.6.2) are currently using AWS's Throughput Optimized HDD ( "st1" ) volumes specifically the aws-ebs-st1s. And I am witnessing no problems for 6.8.2.

Looks like iostats is not installed in the official ES image (Centos). But I was able to pull up io_stats for one of the 8 data nodes from /_nodes/stats:

"io_stats" : {
  "devices" : [
    {
      "device_name" : "nvme1n1",
      "operations" : 72648455,
      "read_operations" : 1996898,
      "write_operations" : 70651557,
      "read_kilobytes" : 173918156,
      "write_kilobytes" : 2032345696
    }
  ],
  "total" : {
    "operations" : 72648455,
    "read_operations" : 1996898,
    "write_operations" : 70651557,
    "read_kilobytes" : 173918156,
    "write_kilobytes" : 2032345696
    }
  }
}

Nothing jumps out at me. Am I missing something? Or does this look normal?

If you are on HDD (spinning metal things), that's likely the bottleneck, though I'd think you'd see in both versions. You'll likely see iostat IO Queue, and merge throttling via:

GET /_node/stats - under merges, throttle time will increase, I assume (see total_throttled_time_in_millis) also index.is_throttled

Can't tell much from the ES stats, need that íostat -x output - you can install it easily with yum (and should be there, but maybe not in super minimal install).

yum install sysstat

I usually use:
iostat -xdk 2

There may be changes in disk I/O between the versions you are comparing, but I could not put my finger on it. Given the low I/I performance of the storage you are using I guess even a small change may make a big difference assuming indexing and query loads and patterns are the same. I would recommend upgrading to faster storage.

Agreed faster disks ideal, though I imagine cost matters (you can also add more disks, as ES will use multiple data paths) - same major OS version, e.g. Centos 6 vs. 7 across your K8S? Else this gets into the infinite fun of disk stacks, IO tuning, etc. that have mostly gone away in the world of SSDs, but your iostat output still helpful.

Wasn't there a flush or translog setting change V6 to 7? Something that might generate more fsyncs(), deadly to HDDs. Via translog.durability? Defaults seem same but you might check them and try async just to see if this is a key bottleneck (be sure you understand risk):
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-translog.html

For AWS, make sure EBS optimized if that works on these disk types; also are these on the same K8S cluster, or different, i.e. any cluster differences, and of course any resource sharing with other things. Would seem adding RAM is easy to try, but the node level IO info most useful but hard to do much about, as not a lot of knobs.

Just looked this up for both clusters:

6.8.2:
"merges" : {
  "current" : 0,
  "current_docs" : 0,
  "current_size_in_bytes" : 0,
  "total" : 1951274,
  "total_time_in_millis" : 1914707083,
  "total_docs" : 17775217277,
  "total_size_in_bytes" : 9534469482648,
  "total_stopped_time_in_millis" : 6173311,
  "total_throttled_time_in_millis" : 391270437,
  "total_auto_throttle_in_bytes" : 204626466137
},
7.6.2:
"merges" : {
  "current" : 0,
  "current_docs" : 0,
  "current_size_in_bytes" : 0,
  "total" : 144392,
  "total_time_in_millis" : 84154676,
  "total_docs" : 1193124340,
  "total_size_in_bytes" : 656514864909,
  "total_stopped_time_in_millis" : 0,
  "total_throttled_time_in_millis" : 8862331,
  "total_auto_throttle_in_bytes" : 71011081830
}

Not sure if these numbers are meaningful since the two ES clusters have been running for a different number of days.

Also, I've tried yum install sysstat but I got:

[elasticsearch@elastic-stack-coordinator-0 ~]$ yum install sysstat
Loaded plugins: fastestmirror, ovl
ovl: Error while doing RPMdb copy-up:
[Errno 13] Permission denied: '/var/lib/rpm/.dbenv.lock'
You need to be root to perform this command.

Looks like I'll have to create a new image with sysstats...

Yes. It matters! But I can talk to the team and try to switch to SSDs.

They are both running on the same AWS EKS cluster with more than enough RAM.

Thank you all for helping!

For the first few weeks 7.6.2 outperformed 6.8.2 (and everyone was happy). Not sure what has changed since then but it is now underperforming. I looked at everything I could think of and even restarted from scratch multiple times. I will try to upgrade to SSDs but it's probably a long shot given how much money we already spend on AWS. Thanks!

I've noticed that most of the os.mem.used_percent for all nodes (coordinator/data/master) are at max:

"os" : {
  "timestamp" : 1597067326462,
  "cpu" : {
    "percent" : 4,
    "load_average" : {
      "1m" : 0.92,
      "5m" : 0.94,
      "15m" : 0.83
    }
  },
  "mem" : {
    "total_in_bytes" : 66008121344,
    "free_in_bytes" : 755060736,
    "used_in_bytes" : 65253060608,
    "free_percent" : 1,
    "used_percent" : 99
  },

Is this normal? I am seeing this behavior in both clusters.

Yes, that is normal as it includes the OS page cache.

1 Like

Hmm, that load average of under 1 implies we are not waiting for disk very much; not what I'd expect on IO bound system with lots of parallel fsyncs() - still suggest you try async translog at some point; can change it dynamically for an hour and see if things improve; don't forget to change back.

You don't have root on these VM? Without that kinda hard, but note iostat may be installed just not in your path; look for it (find / -name iostat) else 'sar' might be install and vmstat can show IO and blocked info but not disk queues - BUT you can see external disk queue in AWS; suggest take a look on the nodes in Cloud Watch.

For the you posted, stats, you need to take them over time; the absolute numbers aren't that important, but how much they change say in a minute or 15 minutes is what matters to gauge if they are throttling at all now. V7 shows 8M millis of throttle vs. 84M in merging, or 10%; on 71G of size. V6 about 20%, but should get better data a few min apart. Clearly both are throttling but I have no idea if that's a lot or not.

So if IO is good, I wonder what the problem is...

I do not have root. I think by default. I am using the official Elastic helm charts (see here).

Luckily, it looks like I have vmstat (no iostat or sar):

[elasticsearch@elastic-stack-data-0 ~]$ vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  5      0 5130784 394744 33181804    0    0    24   287    3    2  5  1 73 21  0

Not sure what to make of it.

I was monitoring the numbers and now that number is 8940977(from 8862331).

Gee, I gotta figure out how to do nice quoting like that on this discuss platform.

Overall, seems no idea; all I can think of is faster disks or set async; the latter fast & easy to try; the former costs money & time.

I'm just guessing not much IO, hard to tell. And if in a container, not really sure how well it tells us about IO; kinda hazy in that sense, I'd think.

For vmstat run like 'vmstat 2 5' to get 10 seconds worth of data.

For ES stats, you have to get both clusters, say 1 min apart for 2-3 minutes so you can calculate the rate in throttled millis/minute to see if they are similar but not really worth the effort; just that you have some throttling but seems not huge. But your number went up 80K mills or 80 seconds which is a lot of it was only in a minute, though I don't know a lot about this stat.

Sorry can't be more helpful.

Wow. This is incredible. Setting index.translog.durability": "async" greatly improved indexing throughput and got rid of all of the lag. I will have to really understand the pros and cons before permanently settling on this method but this is amazing progress. Thank you so much, Steve!

The main con is that Elasticsearch no longer guarantees that your data is written to disk before acknowledging it, so an unexpected restart may result in lost data. It's up to you to decide whether that's a big deal or not. Maybe on EC2 it's not so bad, instances tend to just vanish rather than restarting with stale data.

Spinning disks really are terrible for indexing performance. Have you considered a hot/warm architecture in which you have a few small/fast "hot" nodes doing the indexing (with SSDs) and then after rollover you migrate those indices to larger/slower/cheaper "warm" nodes which only have spinning disks.

There's other advice in the manual on improving indexing speed too; in particular larger bulks targeting fewer shards should have a similar effect to using to async durability but without weaker durability guarantees.

That's a pretty strong indication that your cluster is underpowered. Serving 429s is something of a last resort to avoid completely blowing up the cluster and (depending on where exactly the 429s are coming from) it can be quite expensive in its own right. I recommend working towards hitting a 429 only very rarely.

2 Likes

That was my understanding. Was wondering if there were any other setbacks.

I have looked at this concept. I am hesitant about SSDs right now. AWS spending is pretty high as-is. But maybe I can even out the cost with lower-tier HDDs for the warm nodes. Definitely something to consider!

I have looked at this and we are already doing 7/10! The 3 are Disable swapping, Use faster hardware and Use cross-cluster replication to prevent searching from stealing resources from indexing.

Oh my. I thought this was normal! Even well-performing 6.8.2 cluster was showing a bunch of 429s. Hm... I will have to revisit this with my team. Maybe SSDs are necessary. Thank you, David!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.