High system load on one node in three node cluster

Hello, I'll detail my cluster specs first

Versions: All servers use CentOS 8 and Elasticstack 7.6.1

Logstash nodes:

LOGSTASH-BEATS (used for sending beats data to Elasticsearch)
LOGSTASH-SYSLOG (used for sending various syslog data sources to Elasticsearch)

Elasticsearch nodes

All nodes have 32 GB RAM (16 GB dedicated to JVM), 16 CPU cores and 5.5 TB SSD storage. All nodes have all roles:

ES-N1
ES-N2
ES-N3

The issue I'm seeing is that the system load in ES-N3 is far higher than the other two nodes:

top - 00:28:55 up 23:43,  1 user,  load average: 15.20, 15.01, 14.32
Tasks: 287 total,   2 running, 285 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.1 us,  4.0 sy,  0.0 ni, 55.9 id, 34.5 wa,  0.2 hi,  0.2 si,  0.0 st
MiB Mem :  32001.9 total,   5102.8 free,  18229.2 used,   8669.9 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  13127.2 avail Mem

...and for node 1:

top - 00:31:15 up 23:46,  1 user,  load average: 4.67, 4.98, 4.85
Tasks: 289 total,   1 running, 288 sleeping,   0 stopped,   0 zombie
%Cpu(s): 14.8 us,  1.6 sy,  0.0 ni, 74.4 id,  8.6 wa,  0.3 hi,  0.3 si,  0.0 st
MiB Mem :  32001.9 total,    494.4 free,  18139.6 used,  13367.9 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  13217.2 avail Mem

...and node 2:

top - 00:33:01 up 23:48,  1 user,  load average: 5.00, 4.99, 5.40
Tasks: 292 total,   1 running, 291 sleeping,   0 stopped,   0 zombie
%Cpu(s):  8.7 us,  1.1 sy,  0.0 ni, 81.7 id,  8.1 wa,  0.2 hi,  0.3 si,  0.0 st
MiB Mem :  32001.9 total,    207.7 free,  18355.7 used,  13438.5 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  12999.9 avail Mem

I can't see a problem with CPU load, but the wa value on ES-N3 (34%) is significantly higher than the other two nodes, which from what I've read could indicate a storage I/O problem. However I've used various tools to check the storage and the read/writes for the Elasticsearch data partition on node 3 is pretty much the same as the other two nodes.

The only way I've been able to get the system load back to normal levels is to stop logstash on LOGSTASH-BEATS stopping the ingestion of all Windows event log messages.

Stopping all queries does not make any difference to the problem.

I've looked at the HOT_THREADS stats and node 3 isn't much different to the other two nodes and the JVM heap usage looks OK for all nodes.

I just can't see why this particular node is under so much pressure, can anyone help?

1 Like

Is Logstash and beats configured to send data to all nodes?

Is data and shards distributed evenly across the cluster?

Hello Christian thanks for getting back to me; data and shards seem reasonably balanced to me:

GET /_cat/allocation?v

shards disk.indices disk.used disk.avail disk.total disk.percent host         ip           node
   104        1.9tb     1.9tb      3.5tb      5.4tb           35 10.166.95.83 10.166.95.83    es-n3
   104        1.7tb     1.7tb      3.6tb      5.4tb           32 10.166.95.81 10.166.95.81    es-n1
   105        1.8tb     1.9tb      3.5tb      5.4tb           35 10.166.95.82 10.166.95.82    es-n2

I did wonder if Logstash was hammering node 3 for some reason so I installed iftop on LOGSTASH-BEATS which shows it sending data to the nodes pretty much equally.

What does the hot threads API give for that node?

This is the Logstash output in case it makes any difference:

output {

  elasticsearch {
     hosts => ["https://es-n1:9200", "https://es-n2:9200", "https://es-n3:9200"]
     cacert => '/etc/logstash/certs/elasticsearch-ca.pem'
     user => "${ES_USER}"
     password => "${ES_PWD}"
     manage_template => false
     index => "%{[@metadata][beat]}-%{[@metadata][version]}-%{+YYYY.MM}"
   }
  if  [winlog][event_id] in ["4720", "4723", "4724", "4726", "4727", "4728", "4729", "4731", "4730", "4734"] or [winlog][event_data][ObjectClass] == "groupPolicyContainer" or([winlog][event_id] in ["515", "516"] and [winlog][task] == "ZONE_OP") {
   elasticsearch {
     hosts => ["https://es-n1:9200", "https://es-n2:9200", "https://es-n3:9200"]
     cacert => '/etc/logstash/certs/elasticsearch-ca.pem'
     user => "${ES_USER}"
     password => "${ES_PWD}"
     manage_template => false
     index => "ltwinlogbeat-%{[@metadata][version]}-%{+YYYY.MM}"
   }
  } 
}

I couldn't fit the entire hot threads output in so I've tried to just include the important bits:

GET /_nodes/wyth-es-n3/hot_threads

13.4% (67.2ms out of 500ms) cpu usage by thread 'elasticsearch[es-n3][write][T#6]'

..............................

10.8% (54ms out of 500ms) cpu usage by thread 'elasticsearch[es-n3][write][T#16]'

................................

8.9% (44.3ms out of 500ms) cpu usage by thread 'elasticsearch[es-n3][write][T#8]'

.................................

For what its worth I got to the bottom of the problem, it appears some of our VMware hosts have latency issues with the SSD storage as moving node 3 to a different host had an immediate effect on the system load causing it to return to normal. I had already tried this before posting here but I must've moved it to a host that also had issues with the SSD storage as it didn't make a difference the first time I tried that.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.