Troubleshooting high load avg, low cpu, low I/O on a sizeable production cluster

elementalvoid · January 27, 2016, 8:38pm

Hello all. I'm hoping someone is able to offer some help figuring out what's happening on my cluster. I'm seeing very high load average but relatively little CPU and I/O usage. We’re in the process of getting Marvel installed but with slow restarts it’s going to take a while.

Any help figuring this out would be appreciated!

Here’s the details of our setup:

We run 3 dedicated master nodes (no data) and 34 rather large servers for data nodes (20 physical cores plus 20 hyperthread cores, 392G RAM, 7 SSD for data in a ZFS raid1, spinning disk for the os and logs). Each physical server has 2 ES nodes; we’ve configured the 'processors' setting to 20 for both threadpool.index and threadpool.search so as to not get large defaults for pool sizes. Heap for data nodes is rather large (72G); we know this is not recommended, it’s on our list of things to figure out. We’ve also configured 'zones' such that we have rack awareness; as a side effect of this we’re guaranteed not to have a primary and a replica on any single physical server.

When I say high load I mean values up into the 160 range for 1 minute load average. During these load peaks CPU usage per ES node (JVM) is between 100 and 140 or so. Combined that’s less than three full CPU cores. Disk I/O is very low; average wait is in the sub-millisecond range and %util is typically less than 5-8% per SSD. I/O wait (as reported by iostat) is between 0 and 0.2 percent.

During normal operation load is in the 5-12 range and CPU is still between 100 and 140 per ES node.

Looking at the hot_threads api nothing sticks out as an issue.

What I think would be the relevant cluster settings look like this:

    indices:
        cache:
            filter:
                size: 20%
        recovery:
            concurrent_streams: 8
            max_bytes_per_sec: 200mb
            file_chunk_size: 1024kb
    threadpool:
        bulk:
            queue_size: '200'
        index:
            processors: '20'
            queue_size: '100'
        search:
            processors: '20'
            size: '60'
            type: fixed

Note that we have a pending change to reduce the search threadpool size.

Here’s relevant node settings:

    action:
        disable_delete_all_indices: 'true'
    bootstrap:
        mlockall: 'true'
    cluster:
        routing:
            allocation:
                awareness:
                    attributes: zone
    index:
        merge:
            policy:
                use_compound_file: 'false'
        translog:
            flush_threshold_size: 1g
    indices:
        memory:
            index_buffer_size: 20%
        recovery:
            concurrent_streams: '3'
            max_bytes_per_sec: 50mb
    node:
        data: 'true'
        master: 'false'
        zone: zone-a

In both of the settings groups listed above they were retrieved from the cluster and node apis and then reduced and converted to yaml for human consumption.

elementalvoid · January 27, 2016, 8:56pm

Here's some iostat/vmstat logs... http://pastebin.com/HRcEcquC

elementalvoid · January 28, 2016, 3:50am

I should also mention that we're running ES 1.7.2 on most nodes. We're rolling through to upgrade to 1.7.4 though.

elementalvoid · February 16, 2016, 5:52pm

Turns out this was completely on us. We were processing deletes from our feed very inefficiently. Once we stopped performing ridiculously large numbers of searches (on the order of 10-20 thousand per minutes across 30+ indices) to find the index we needed to delete from everything "magically" dropped to normal.

Go figure.

Topic		Replies	Views
High CPU Usage and Load on Data Nodes Elasticsearch	4	2149	December 14, 2019
Help please with high CPU utilization on 1 node of cluster :) Elasticsearch	9	11825	July 5, 2017
High CPU usage / load average while no running queries Elasticsearch	16	23066	February 5, 2019
CPU high on Elastic cloud, instance almost stuck Elasticsearch	8	559	October 6, 2020
Elasticsearch high load/CPU usage Elasticsearch	10	9527	July 6, 2017

Troubleshooting high load avg, low cpu, low I/O on a sizeable production cluster

Related Topics