Hello all. I'm hoping someone is able to offer some help figuring out what's happening on my cluster. I'm seeing very high load average but relatively little CPU and I/O usage. We’re in the process of getting Marvel installed but with slow restarts it’s going to take a while.
Any help figuring this out would be appreciated!
Here’s the details of our setup:
We run 3 dedicated master nodes (no data) and 34 rather large servers for data nodes (20 physical cores plus 20 hyperthread cores, 392G RAM, 7 SSD for data in a ZFS raid1, spinning disk for the os and logs). Each physical server has 2 ES nodes; we’ve configured the 'processors' setting to 20 for both threadpool.index and threadpool.search so as to not get large defaults for pool sizes. Heap for data nodes is rather large (72G); we know this is not recommended, it’s on our list of things to figure out. We’ve also configured 'zones' such that we have rack awareness; as a side effect of this we’re guaranteed not to have a primary and a replica on any single physical server.
When I say high load I mean values up into the 160 range for 1 minute load average. During these load peaks CPU usage per ES node (JVM) is between 100 and 140 or so. Combined that’s less than three full CPU cores. Disk I/O is very low; average wait is in the sub-millisecond range and %util is typically less than 5-8% per SSD. I/O wait (as reported by iostat) is between 0 and 0.2 percent.
During normal operation load is in the 5-12 range and CPU is still between 100 and 140 per ES node.
Looking at the hot_threads api nothing sticks out as an issue.
What I think would be the relevant cluster settings look like this:
indices:
cache:
filter:
size: 20%
recovery:
concurrent_streams: 8
max_bytes_per_sec: 200mb
file_chunk_size: 1024kb
threadpool:
bulk:
queue_size: '200'
index:
processors: '20'
queue_size: '100'
search:
processors: '20'
size: '60'
type: fixed
Note that we have a pending change to reduce the search threadpool size.
Here’s relevant node settings:
action:
disable_delete_all_indices: 'true'
bootstrap:
mlockall: 'true'
cluster:
routing:
allocation:
awareness:
attributes: zone
index:
merge:
policy:
use_compound_file: 'false'
translog:
flush_threshold_size: 1g
indices:
memory:
index_buffer_size: 20%
recovery:
concurrent_streams: '3'
max_bytes_per_sec: 50mb
node:
data: 'true'
master: 'false'
zone: zone-a
In both of the settings groups listed above they were retrieved from the cluster and node apis and then reduced and converted to yaml for human consumption.