Troubleshooting high load avg, low cpu, low I/O on a sizeable production cluster


(Matt Klich (Elemental Void)) #1

Hello all. I'm hoping someone is able to offer some help figuring out what's happening on my cluster. I'm seeing very high load average but relatively little CPU and I/O usage. We’re in the process of getting Marvel installed but with slow restarts it’s going to take a while.

Any help figuring this out would be appreciated!

Here’s the details of our setup:

We run 3 dedicated master nodes (no data) and 34 rather large servers for data nodes (20 physical cores plus 20 hyperthread cores, 392G RAM, 7 SSD for data in a ZFS raid1, spinning disk for the os and logs). Each physical server has 2 ES nodes; we’ve configured the 'processors' setting to 20 for both threadpool.index and threadpool.search so as to not get large defaults for pool sizes. Heap for data nodes is rather large (72G); we know this is not recommended, it’s on our list of things to figure out. We’ve also configured 'zones' such that we have rack awareness; as a side effect of this we’re guaranteed not to have a primary and a replica on any single physical server.

When I say high load I mean values up into the 160 range for 1 minute load average. During these load peaks CPU usage per ES node (JVM) is between 100 and 140 or so. Combined that’s less than three full CPU cores. Disk I/O is very low; average wait is in the sub-millisecond range and %util is typically less than 5-8% per SSD. I/O wait (as reported by iostat) is between 0 and 0.2 percent.

During normal operation load is in the 5-12 range and CPU is still between 100 and 140 per ES node.

Looking at the hot_threads api nothing sticks out as an issue.

What I think would be the relevant cluster settings look like this:

    indices:
        cache:
            filter:
                size: 20%
        recovery:
            concurrent_streams: 8
            max_bytes_per_sec: 200mb
            file_chunk_size: 1024kb
    threadpool:
        bulk:
            queue_size: '200'
        index:
            processors: '20'
            queue_size: '100'
        search:
            processors: '20'
            size: '60'
            type: fixed

Note that we have a pending change to reduce the search threadpool size.

Here’s relevant node settings:

    action:
        disable_delete_all_indices: 'true'
    bootstrap:
        mlockall: 'true'
    cluster:
        routing:
            allocation:
                awareness:
                    attributes: zone
    index:
        merge:
            policy:
                use_compound_file: 'false'
        translog:
            flush_threshold_size: 1g
    indices:
        memory:
            index_buffer_size: 20%
        recovery:
            concurrent_streams: '3'
            max_bytes_per_sec: 50mb
    node:
        data: 'true'
        master: 'false'
        zone: zone-a

In both of the settings groups listed above they were retrieved from the cluster and node apis and then reduced and converted to yaml for human consumption.


(Matt Klich (Elemental Void)) #2

Here's some iostat/vmstat logs... http://pastebin.com/HRcEcquC


(Matt Klich (Elemental Void)) #3

I should also mention that we're running ES 1.7.2 on most nodes. We're rolling through to upgrade to 1.7.4 though.


(Matt Klich (Elemental Void)) #4

Turns out this was completely on us. We were processing deletes from our feed very inefficiently. Once we stopped performing ridiculously large numbers of searches (on the order of 10-20 thousand per minutes across 30+ indices) to find the index we needed to delete from everything "magically" dropped to normal.

Go figure.


(system) #5