Performance Issue with Elasticsearch 7.11.1 Cluster

dawiro · April 19, 2021, 2:03pm

Hi,
We've had an issue show itself in recent days with one of our clusters. High level behaviour is as follows:

Ingestion Rate has collapsed
504 errors reported by logstash
One node showing exceptionally high load average but very low cpu utilisation
Many unassigned shards
Very, very slow shard recovery
Cluster is generally idling based on cpu metrics

Here is an excerpt of hot_threads output. See that hot_threads appears to show nothing for the node in question:

::: {localhost}{aM-i6k0mTl-otjyi7QDPkw}{xS8UQvm3TIOe0nKXjhTyGw}{xxx.xxx.16.139}{xxx.xxx.16.139:9300}{cdhilmrstw}{aws_availability_zone=eu-west-1a, ml.machine_memory=66206564352, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=33285996544, transform.node=true}
   Hot threads at 2021-04-19T11:19:24.846Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {localhost}{c_tUDkOiRN-vtPw-f_yOaw}{UCJaYfDMTQ-Mu_EOs-JGBQ}{xxx.xxx.17.80}{xxx.xxx.17.80:9300}{cdhilmrstw}{aws_availability_zone=eu-west-1a, ml.machine_memory=66206564352, xpack.installed=true, transform.node=true, ml.max_open_jobs=20, ml.max_jvm_size=33285996544}
   Hot threads at 2021-04-19T11:19:24.845Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

   19.6% (97.9ms out of 500ms) cpu usage by thread 'elasticsearch[localhost][write][T#13]'
     2/10 snapshots sharing following 45 elements
       app//org.elasticsearch.common.util.CollectionUtils.ensureNoSelfReferences(CollectionUtils.java:168)
       app//org.elasticsearch.common.util.CollectionUtils.ensureNoSelfReferences(CollectionUtils.java:140)
       org.elasticsearch.ingest.common.ScriptProcessor.execute(ScriptProcessor.java:89)
       app//org.elasticsearch.ingest.Processor.execute(Processor.java:41)
       app//org.elasticsearch.ingest.CompoundProcessor.innerExecute(CompoundProcessor.java:131)
       app//org.elasticsearch.ingest.CompoundProcessor.lambda$innerExecute$1(CompoundProcessor.java:150)
       app//org.elasticsearch.ingest.CompoundProcessor$$Lambda$7593/0x0000000801d64918.accept(Unknown Source)
       app//org.elasticsearch.ingest.Processor.execute(Processor.java:46)
       app//org.elasticsearch.ingest.CompoundProcessor.innerExecute(CompoundProcessor.java:131)
       app//org.elasticsearch.ingest.CompoundProcessor.lambda$innerExecute$1(CompoundProcessor.java:150)
       app//org.elasticsearch.ingest.CompoundProcessor$$Lambda$7593/0x0000000801d64918.accept(Unknown Source)
       app//org.elasticsearch.ingest.Processor.execute(Processor.java:46)
       app//org.elasticsearch.ingest.CompoundProcessor.innerExecute(CompoundProcessor.java:131)
       app//org.elasticsearch.ingest.CompoundProcessor.lambda$innerExecute$1(CompoundProcessor.java:150)
       app//org.elasticsearch.ingest.CompoundProcessor$$Lambda$7593/0x0000000801d64918.accept(Unknown Source)

I see the following logged by the master node:

[2021-04-19T13:12:01,164][INFO ][o.e.c.c.C.CoordinatorPublication] [localhost] after [10s] publication of cluster state version [1112461] is still waiting for {localhost}{aM-i6k0mTl-otjyi7QDPkw}{xS8UQvm3TIOe0nKXjhTyGw}{xxx.xxx.16.139}{xxx.xxx.16.139:9300}{cdhilmrstw}{aws_availability_zone=eu-west-1a, ml.machine_memory=66206564352, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=33285996544, transform.node=true} [SENT_APPLY_COMMIT]

Here is a sample of top output from the node in question:

top - 14:08:12 up 48 days,  2:32,  1 user,  load average: 20.69, 18.15, 16.24
Tasks: 227 total,   1 running, 125 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.4 sy,  0.0 ni, 24.7 id, 74.9 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 64654848 total,  1144444 free, 36197708 used, 27312696 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 27653688 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 2184 elastic+  20   0   96.8g  36.6g   2.8g S   0.0 59.3  37832:37 java
 2554 kibana    20   0 1175496 331068  27124 S   0.0  0.5 341:15.01 node

I tried excluding the node from the cluster which was showing this behaviour but the issue moved with the shards to the node shown above. So, given that hot threads doesn't seem to be highlighting anything I am suspecting something with the shards. However, cluster state is yellow. So there are unassigned shards but they are recoverable from primaries.

Note that prior to this issue cluster has been stable and a solid performer. And disk space is not an issue. Would appreciate some input on how to get a better handle on this problem.

Thx
D

system · May 17, 2021, 2:04pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch 7.3 CPU Usage Elasticsearch	3	868	September 18, 2019
Investigating elasticsearch load issues Elasticsearch	9	1219	July 6, 2017
ES High cpu issues Elasticsearch	11	1068	September 6, 2018
Massive performance issues on our production cluster Elasticsearch	5	2565	July 6, 2017
High CPU usage periodically Elasticsearch	9	630	October 26, 2023

Performance Issue with Elasticsearch 7.11.1 Cluster

Related topics