Hello all. I'm hoping someone is able to offer some help figuring out what's happening on my cluster. I'm seeing very high load average but relatively little CPU and I/O usage. We’re in the process of getting Marvel installed but with slow restarts it’s going to take a while.
Any help figuring this out would be appreciated!
Here’s the details of our setup:
We run 3 dedicated master nodes (no data) and 34 rather large servers for data nodes (20 physical cores plus 20 hyperthread cores, 392G RAM, 7 SSD for data in a ZFS raid1, spinning disk for the os and logs). Each physical server has 2 ES nodes; we’ve configured the 'processors' setting to 20 for both threadpool.index and threadpool.search so as to not get large defaults for pool sizes. Heap for data nodes is rather large (72G); we know this is not recommended, it’s on our list of things to figure out. We’ve also configured 'zones' such that we have rack awareness; as a side effect of this we’re guaranteed not to have a primary and a replica on any single physical server.
When I say high load I mean values up into the 160 range for 1 minute load average. During these load peaks CPU usage per ES node (JVM) is between 100 and 140 or so. Combined that’s less than three full CPU cores. Disk I/O is very low; average wait is in the sub-millisecond range and %util is typically less than 5-8% per SSD. I/O wait (as reported by iostat) is between 0 and 0.2 percent.
During normal operation load is in the 5-12 range and CPU is still between 100 and 140 per ES node.
Looking at the hot_threads api nothing sticks out as an issue.
What I think would be the relevant cluster settings look like this:
indices: cache: filter: size: 20% recovery: concurrent_streams: 8 max_bytes_per_sec: 200mb file_chunk_size: 1024kb threadpool: bulk: queue_size: '200' index: processors: '20' queue_size: '100' search: processors: '20' size: '60' type: fixed
Note that we have a pending change to reduce the search threadpool size.
Here’s relevant node settings:
action: disable_delete_all_indices: 'true' bootstrap: mlockall: 'true' cluster: routing: allocation: awareness: attributes: zone index: merge: policy: use_compound_file: 'false' translog: flush_threshold_size: 1g indices: memory: index_buffer_size: 20% recovery: concurrent_streams: '3' max_bytes_per_sec: 50mb node: data: 'true' master: 'false' zone: zone-a
In both of the settings groups listed above they were retrieved from the cluster and node apis and then reduced and converted to yaml for human consumption.