Performance Issue with Elasticsearch 7.11.1 Cluster

Hi,
We've had an issue show itself in recent days with one of our clusters. High level behaviour is as follows:

  • Ingestion Rate has collapsed
  • 504 errors reported by logstash
  • One node showing exceptionally high load average but very low cpu utilisation
  • Many unassigned shards
  • Very, very slow shard recovery
  • Cluster is generally idling based on cpu metrics

Here is an excerpt of hot_threads output. See that hot_threads appears to show nothing for the node in question:

::: {localhost}{aM-i6k0mTl-otjyi7QDPkw}{xS8UQvm3TIOe0nKXjhTyGw}{xxx.xxx.16.139}{xxx.xxx.16.139:9300}{cdhilmrstw}{aws_availability_zone=eu-west-1a, ml.machine_memory=66206564352, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=33285996544, transform.node=true}
   Hot threads at 2021-04-19T11:19:24.846Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {localhost}{c_tUDkOiRN-vtPw-f_yOaw}{UCJaYfDMTQ-Mu_EOs-JGBQ}{xxx.xxx.17.80}{xxx.xxx.17.80:9300}{cdhilmrstw}{aws_availability_zone=eu-west-1a, ml.machine_memory=66206564352, xpack.installed=true, transform.node=true, ml.max_open_jobs=20, ml.max_jvm_size=33285996544}
   Hot threads at 2021-04-19T11:19:24.845Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

   19.6% (97.9ms out of 500ms) cpu usage by thread 'elasticsearch[localhost][write][T#13]'
     2/10 snapshots sharing following 45 elements
       app//org.elasticsearch.common.util.CollectionUtils.ensureNoSelfReferences(CollectionUtils.java:168)
       app//org.elasticsearch.common.util.CollectionUtils.ensureNoSelfReferences(CollectionUtils.java:140)
       org.elasticsearch.ingest.common.ScriptProcessor.execute(ScriptProcessor.java:89)
       app//org.elasticsearch.ingest.Processor.execute(Processor.java:41)
       app//org.elasticsearch.ingest.CompoundProcessor.innerExecute(CompoundProcessor.java:131)
       app//org.elasticsearch.ingest.CompoundProcessor.lambda$innerExecute$1(CompoundProcessor.java:150)
       app//org.elasticsearch.ingest.CompoundProcessor$$Lambda$7593/0x0000000801d64918.accept(Unknown Source)
       app//org.elasticsearch.ingest.Processor.execute(Processor.java:46)
       app//org.elasticsearch.ingest.CompoundProcessor.innerExecute(CompoundProcessor.java:131)
       app//org.elasticsearch.ingest.CompoundProcessor.lambda$innerExecute$1(CompoundProcessor.java:150)
       app//org.elasticsearch.ingest.CompoundProcessor$$Lambda$7593/0x0000000801d64918.accept(Unknown Source)
       app//org.elasticsearch.ingest.Processor.execute(Processor.java:46)
       app//org.elasticsearch.ingest.CompoundProcessor.innerExecute(CompoundProcessor.java:131)
       app//org.elasticsearch.ingest.CompoundProcessor.lambda$innerExecute$1(CompoundProcessor.java:150)
       app//org.elasticsearch.ingest.CompoundProcessor$$Lambda$7593/0x0000000801d64918.accept(Unknown Source)

I see the following logged by the master node:

[2021-04-19T13:12:01,164][INFO ][o.e.c.c.C.CoordinatorPublication] [localhost] after [10s] publication of cluster state version [1112461] is still waiting for {localhost}{aM-i6k0mTl-otjyi7QDPkw}{xS8UQvm3TIOe0nKXjhTyGw}{xxx.xxx.16.139}{xxx.xxx.16.139:9300}{cdhilmrstw}{aws_availability_zone=eu-west-1a, ml.machine_memory=66206564352, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=33285996544, transform.node=true} [SENT_APPLY_COMMIT]

Here is a sample of top output from the node in question:

top - 14:08:12 up 48 days,  2:32,  1 user,  load average: 20.69, 18.15, 16.24
Tasks: 227 total,   1 running, 125 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.4 sy,  0.0 ni, 24.7 id, 74.9 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 64654848 total,  1144444 free, 36197708 used, 27312696 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 27653688 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 2184 elastic+  20   0   96.8g  36.6g   2.8g S   0.0 59.3  37832:37 java
 2554 kibana    20   0 1175496 331068  27124 S   0.0  0.5 341:15.01 node

I tried excluding the node from the cluster which was showing this behaviour but the issue moved with the shards to the node shown above. So, given that hot threads doesn't seem to be highlighting anything I am suspecting something with the shards. However, cluster state is yellow. So there are unassigned shards but they are recoverable from primaries.

Note that prior to this issue cluster has been stable and a solid performer. And disk space is not an issue. Would appreciate some input on how to get a better handle on this problem.

Thx
D

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.