Troubleshooting high load avg, low cpu, low I/O on a sizeable production cluster

Hello all. I'm hoping someone is able to offer some help figuring out what's happening on my cluster. I'm seeing very high load average but relatively little CPU and I/O usage. We’re in the process of getting Marvel installed but with slow restarts it’s going to take a while.

Any help figuring this out would be appreciated!

Here’s the details of our setup:

We run 3 dedicated master nodes (no data) and 34 rather large servers for data nodes (20 physical cores plus 20 hyperthread cores, 392G RAM, 7 SSD for data in a ZFS raid1, spinning disk for the os and logs). Each physical server has 2 ES nodes; we’ve configured the 'processors' setting to 20 for both threadpool.index and threadpool.search so as to not get large defaults for pool sizes. Heap for data nodes is rather large (72G); we know this is not recommended, it’s on our list of things to figure out. We’ve also configured 'zones' such that we have rack awareness; as a side effect of this we’re guaranteed not to have a primary and a replica on any single physical server.

When I say high load I mean values up into the 160 range for 1 minute load average. During these load peaks CPU usage per ES node (JVM) is between 100 and 140 or so. Combined that’s less than three full CPU cores. Disk I/O is very low; average wait is in the sub-millisecond range and %util is typically less than 5-8% per SSD. I/O wait (as reported by iostat) is between 0 and 0.2 percent.

During normal operation load is in the 5-12 range and CPU is still between 100 and 140 per ES node.

Looking at the hot_threads api nothing sticks out as an issue.

What I think would be the relevant cluster settings look like this:

    indices:
        cache:
            filter:
                size: 20%
        recovery:
            concurrent_streams: 8
            max_bytes_per_sec: 200mb
            file_chunk_size: 1024kb
    threadpool:
        bulk:
            queue_size: '200'
        index:
            processors: '20'
            queue_size: '100'
        search:
            processors: '20'
            size: '60'
            type: fixed

Note that we have a pending change to reduce the search threadpool size.

Here’s relevant node settings:

    action:
        disable_delete_all_indices: 'true'
    bootstrap:
        mlockall: 'true'
    cluster:
        routing:
            allocation:
                awareness:
                    attributes: zone
    index:
        merge:
            policy:
                use_compound_file: 'false'
        translog:
            flush_threshold_size: 1g
    indices:
        memory:
            index_buffer_size: 20%
        recovery:
            concurrent_streams: '3'
            max_bytes_per_sec: 50mb
    node:
        data: 'true'
        master: 'false'
        zone: zone-a

In both of the settings groups listed above they were retrieved from the cluster and node apis and then reduced and converted to yaml for human consumption.

Here's some iostat/vmstat logs... http://pastebin.com/HRcEcquC

I should also mention that we're running ES 1.7.2 on most nodes. We're rolling through to upgrade to 1.7.4 though.

Turns out this was completely on us. We were processing deletes from our feed very inefficiently. Once we stopped performing ridiculously large numbers of searches (on the order of 10-20 thousand per minutes across 30+ indices) to find the index we needed to delete from everything "magically" dropped to normal.

Go figure.