Severe I/O performance degradation after Ubuntu 18.04 upgrade

Hi,

We recently upgraded the OS on a couple of independently running hot/warm clusters. Each cluster has 9 nodes (3 masters, 3 hot nodes and 3 warm nodes) and are running on pretty decent hardware (32GB-64GB RAM on each node, terrabytes of storage and Intel Xeon processors). Both clusters are running Elasticsearch 7.4. The clusters has been running great for years, and as the EOL of Ubuntu 16.04 was getting close, we had to upgrade to Ubuntu 18.04.

Whats done:

  • Upgraded from Ubuntu 16.04 to Ubuntu 18.04
  • Elasticsearch was uninstalled during the upgrade, and reinstalled with the same Elasticsearch 7.4 version and the same config prior to the Ubuntu 18.04 upgrade.
  • openjdk-8-jre-headless was upgraded to openjdk-11-jre-headless

It seemed to run perfectly fine for a few weeks, until we started to experience time-outs and a lot of backlog starting to pile up with e.g. shard allocations using 24 hours or more to move a shard from one node to another, or force merges never being completed.
It was very strange, as CPU and memory consumption was very low, but disk write speeds was as low as 20kbps per Elasticsearch thread according to htop. This affects all nodes in the cluster. Doing other tasks on the node while it was in this state was also extremely slow, like "apt update/apt install xyz" or similar.

I did a write speed test on our much smaller test cluster and compared to one of the clusters experiencing this problem:

root@node0101:~# dd if=/dev/zero of=/root/testfile bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1,1 GB, 1,0 GiB) copied, 1,18209 s, 908 MB/s

####

root@node0301:~# dd if=/dev/zero of=/root/testfile bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1,1 GB, 1,0 GiB) copied, 144,34 s, 7,4 MB/s

Here is a iotop screenshot of one of the clusters experiencing this problem:

We temporarily solved it by restarting Elasticsearch on all nodes. After restarting Elasticsearch on a node, we would again get normal performance. This happens to two of our clusters, on completely separate hardware/location/networks. I expect this problem to show up again in a week or so.

Any suggestions on what to do next? Elasticsearch logs don't give me anything of interest (I think anyways).

Sincerely,
Adrian M

Our jvm.options file is pretty much standard, but it was standard from version 6.4 I believe.

-Xms30g
-Xmx30g

## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

# pre-touch memory pages used by the JVM during initialization
-XX:+AlwaysPreTouch

# force the server VM (remove on 32-bit client JVMs)
-server

# explicitly set the stack size (reduce to 320k on 32-bit client JVMs)
-Xss1m

# set to headless, just in case
-Djava.awt.headless=true

# ensure UTF-8 encoding by default (e.g. filenames)
-Dfile.encoding=UTF-8

# use our provided JNA always versus the system one
-Djna.nosys=true

# use old-style file permissions on JDK9
-Djdk.io.permissionsUseCanonicalPath=true

# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0

# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Dlog4j.skipJansi=true

## heap dumps

# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.