High system load on one node in three node cluster

kernelpanic · July 1, 2020, 11:49pm

Hello, I'll detail my cluster specs first

Versions: All servers use CentOS 8 and Elasticstack 7.6.1

Logstash nodes:

LOGSTASH-BEATS (used for sending beats data to Elasticsearch)
LOGSTASH-SYSLOG (used for sending various syslog data sources to Elasticsearch)

Elasticsearch nodes

All nodes have 32 GB RAM (16 GB dedicated to JVM), 16 CPU cores and 5.5 TB SSD storage. All nodes have all roles:

ES-N1
ES-N2
ES-N3

The issue I'm seeing is that the system load in ES-N3 is far higher than the other two nodes:

top - 00:28:55 up 23:43,  1 user,  load average: 15.20, 15.01, 14.32
Tasks: 287 total,   2 running, 285 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.1 us,  4.0 sy,  0.0 ni, 55.9 id, 34.5 wa,  0.2 hi,  0.2 si,  0.0 st
MiB Mem :  32001.9 total,   5102.8 free,  18229.2 used,   8669.9 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  13127.2 avail Mem

...and for node 1:

top - 00:31:15 up 23:46,  1 user,  load average: 4.67, 4.98, 4.85
Tasks: 289 total,   1 running, 288 sleeping,   0 stopped,   0 zombie
%Cpu(s): 14.8 us,  1.6 sy,  0.0 ni, 74.4 id,  8.6 wa,  0.3 hi,  0.3 si,  0.0 st
MiB Mem :  32001.9 total,    494.4 free,  18139.6 used,  13367.9 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  13217.2 avail Mem

...and node 2:

top - 00:33:01 up 23:48,  1 user,  load average: 5.00, 4.99, 5.40
Tasks: 292 total,   1 running, 291 sleeping,   0 stopped,   0 zombie
%Cpu(s):  8.7 us,  1.1 sy,  0.0 ni, 81.7 id,  8.1 wa,  0.2 hi,  0.3 si,  0.0 st
MiB Mem :  32001.9 total,    207.7 free,  18355.7 used,  13438.5 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  12999.9 avail Mem

I can't see a problem with CPU load, but the wa value on ES-N3 (34%) is significantly higher than the other two nodes, which from what I've read could indicate a storage I/O problem. However I've used various tools to check the storage and the read/writes for the Elasticsearch data partition on node 3 is pretty much the same as the other two nodes.

The only way I've been able to get the system load back to normal levels is to stop logstash on LOGSTASH-BEATS stopping the ingestion of all Windows event log messages.

Stopping all queries does not make any difference to the problem.

I've looked at the HOT_THREADS stats and node 3 isn't much different to the other two nodes and the JVM heap usage looks OK for all nodes.

I just can't see why this particular node is under so much pressure, can anyone help?

Christian_Dahlqvist · July 2, 2020, 5:23am

Is Logstash and beats configured to send data to all nodes?

Is data and shards distributed evenly across the cluster?

kernelpanic · July 2, 2020, 9:17am

Hello Christian thanks for getting back to me; data and shards seem reasonably balanced to me:

GET /_cat/allocation?v

shards disk.indices disk.used disk.avail disk.total disk.percent host         ip           node
   104        1.9tb     1.9tb      3.5tb      5.4tb           35 10.166.95.83 10.166.95.83    es-n3
   104        1.7tb     1.7tb      3.6tb      5.4tb           32 10.166.95.81 10.166.95.81    es-n1
   105        1.8tb     1.9tb      3.5tb      5.4tb           35 10.166.95.82 10.166.95.82    es-n2

I did wonder if Logstash was hammering node 3 for some reason so I installed iftop on LOGSTASH-BEATS which shows it sending data to the nodes pretty much equally.

Christian_Dahlqvist · July 2, 2020, 9:20am

What does the hot threads API give for that node?

kernelpanic · July 2, 2020, 9:20am

This is the Logstash output in case it makes any difference:

output {

  elasticsearch {
     hosts => ["https://es-n1:9200", "https://es-n2:9200", "https://es-n3:9200"]
     cacert => '/etc/logstash/certs/elasticsearch-ca.pem'
     user => "${ES_USER}"
     password => "${ES_PWD}"
     manage_template => false
     index => "%{[@metadata][beat]}-%{[@metadata][version]}-%{+YYYY.MM}"
   }
  if  [winlog][event_id] in ["4720", "4723", "4724", "4726", "4727", "4728", "4729", "4731", "4730", "4734"] or [winlog][event_data][ObjectClass] == "groupPolicyContainer" or([winlog][event_id] in ["515", "516"] and [winlog][task] == "ZONE_OP") {
   elasticsearch {
     hosts => ["https://es-n1:9200", "https://es-n2:9200", "https://es-n3:9200"]
     cacert => '/etc/logstash/certs/elasticsearch-ca.pem'
     user => "${ES_USER}"
     password => "${ES_PWD}"
     manage_template => false
     index => "ltwinlogbeat-%{[@metadata][version]}-%{+YYYY.MM}"
   }
  } 
}

kernelpanic · July 2, 2020, 9:31am

I couldn't fit the entire hot threads output in so I've tried to just include the important bits:

GET /_nodes/wyth-es-n3/hot_threads

13.4% (67.2ms out of 500ms) cpu usage by thread 'elasticsearch[es-n3][write][T#6]'

..............................

10.8% (54ms out of 500ms) cpu usage by thread 'elasticsearch[es-n3][write][T#16]'

................................

8.9% (44.3ms out of 500ms) cpu usage by thread 'elasticsearch[es-n3][write][T#8]'

.................................

kernelpanic · July 6, 2020, 5:12pm

For what its worth I got to the bottom of the problem, it appears some of our VMware hosts have latency issues with the SSD storage as moving node 3 to a different host had an immediate effect on the system load causing it to return to normal. I had already tried this before posting here but I must've moved it to a host that also had issues with the SSD storage as it didn't make a difference the first time I tried that.

system · August 3, 2020, 5:12pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Heavy load on a small Elasticsearch cluster Elasticsearch	5	597	July 6, 2017
High cpu load, just a one node working on 100% Elasticsearch	3	319	April 25, 2022
The load of ES cluster CPU is high, but the utilization rate is not high Elasticsearch	7	468	December 14, 2021
Monitoring: Elastic Data Nodes - System Load Elasticsearch	3	1058	December 25, 2017
High CPU Usage in Elastic Elasticsearch	4	510	July 27, 2022

High system load on one node in three node cluster

Related topics