Hello, I'll detail my cluster specs first
Versions: All servers use CentOS 8 and Elasticstack 7.6.1
Logstash nodes:
LOGSTASH-BEATS (used for sending beats data to Elasticsearch)
LOGSTASH-SYSLOG (used for sending various syslog data sources to Elasticsearch)
Elasticsearch nodes
All nodes have 32 GB RAM (16 GB dedicated to JVM), 16 CPU cores and 5.5 TB SSD storage. All nodes have all roles:
ES-N1
ES-N2
ES-N3
The issue I'm seeing is that the system load in ES-N3 is far higher than the other two nodes:
top - 00:28:55 up 23:43, 1 user, load average: 15.20, 15.01, 14.32
Tasks: 287 total, 2 running, 285 sleeping, 0 stopped, 0 zombie
%Cpu(s): 5.1 us, 4.0 sy, 0.0 ni, 55.9 id, 34.5 wa, 0.2 hi, 0.2 si, 0.0 st
MiB Mem : 32001.9 total, 5102.8 free, 18229.2 used, 8669.9 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 13127.2 avail Mem
...and for node 1:
top - 00:31:15 up 23:46, 1 user, load average: 4.67, 4.98, 4.85
Tasks: 289 total, 1 running, 288 sleeping, 0 stopped, 0 zombie
%Cpu(s): 14.8 us, 1.6 sy, 0.0 ni, 74.4 id, 8.6 wa, 0.3 hi, 0.3 si, 0.0 st
MiB Mem : 32001.9 total, 494.4 free, 18139.6 used, 13367.9 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 13217.2 avail Mem
...and node 2:
top - 00:33:01 up 23:48, 1 user, load average: 5.00, 4.99, 5.40
Tasks: 292 total, 1 running, 291 sleeping, 0 stopped, 0 zombie
%Cpu(s): 8.7 us, 1.1 sy, 0.0 ni, 81.7 id, 8.1 wa, 0.2 hi, 0.3 si, 0.0 st
MiB Mem : 32001.9 total, 207.7 free, 18355.7 used, 13438.5 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 12999.9 avail Mem
I can't see a problem with CPU load, but the wa value on ES-N3 (34%) is significantly higher than the other two nodes, which from what I've read could indicate a storage I/O problem. However I've used various tools to check the storage and the read/writes for the Elasticsearch data partition on node 3 is pretty much the same as the other two nodes.
The only way I've been able to get the system load back to normal levels is to stop logstash on LOGSTASH-BEATS stopping the ingestion of all Windows event log messages.
Stopping all queries does not make any difference to the problem.
I've looked at the HOT_THREADS stats and node 3 isn't much different to the other two nodes and the JVM heap usage looks OK for all nodes.
I just can't see why this particular node is under so much pressure, can anyone help?