Hi,
We've had an issue show itself in recent days with one of our clusters. High level behaviour is as follows:
- Ingestion Rate has collapsed
- 504 errors reported by logstash
- One node showing exceptionally high load average but very low cpu utilisation
- Many unassigned shards
- Very, very slow shard recovery
- Cluster is generally idling based on cpu metrics
Here is an excerpt of hot_threads output. See that hot_threads appears to show nothing for the node in question:
::: {localhost}{aM-i6k0mTl-otjyi7QDPkw}{xS8UQvm3TIOe0nKXjhTyGw}{xxx.xxx.16.139}{xxx.xxx.16.139:9300}{cdhilmrstw}{aws_availability_zone=eu-west-1a, ml.machine_memory=66206564352, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=33285996544, transform.node=true}
Hot threads at 2021-04-19T11:19:24.846Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
::: {localhost}{c_tUDkOiRN-vtPw-f_yOaw}{UCJaYfDMTQ-Mu_EOs-JGBQ}{xxx.xxx.17.80}{xxx.xxx.17.80:9300}{cdhilmrstw}{aws_availability_zone=eu-west-1a, ml.machine_memory=66206564352, xpack.installed=true, transform.node=true, ml.max_open_jobs=20, ml.max_jvm_size=33285996544}
Hot threads at 2021-04-19T11:19:24.845Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
19.6% (97.9ms out of 500ms) cpu usage by thread 'elasticsearch[localhost][write][T#13]'
2/10 snapshots sharing following 45 elements
app//org.elasticsearch.common.util.CollectionUtils.ensureNoSelfReferences(CollectionUtils.java:168)
app//org.elasticsearch.common.util.CollectionUtils.ensureNoSelfReferences(CollectionUtils.java:140)
org.elasticsearch.ingest.common.ScriptProcessor.execute(ScriptProcessor.java:89)
app//org.elasticsearch.ingest.Processor.execute(Processor.java:41)
app//org.elasticsearch.ingest.CompoundProcessor.innerExecute(CompoundProcessor.java:131)
app//org.elasticsearch.ingest.CompoundProcessor.lambda$innerExecute$1(CompoundProcessor.java:150)
app//org.elasticsearch.ingest.CompoundProcessor$$Lambda$7593/0x0000000801d64918.accept(Unknown Source)
app//org.elasticsearch.ingest.Processor.execute(Processor.java:46)
app//org.elasticsearch.ingest.CompoundProcessor.innerExecute(CompoundProcessor.java:131)
app//org.elasticsearch.ingest.CompoundProcessor.lambda$innerExecute$1(CompoundProcessor.java:150)
app//org.elasticsearch.ingest.CompoundProcessor$$Lambda$7593/0x0000000801d64918.accept(Unknown Source)
app//org.elasticsearch.ingest.Processor.execute(Processor.java:46)
app//org.elasticsearch.ingest.CompoundProcessor.innerExecute(CompoundProcessor.java:131)
app//org.elasticsearch.ingest.CompoundProcessor.lambda$innerExecute$1(CompoundProcessor.java:150)
app//org.elasticsearch.ingest.CompoundProcessor$$Lambda$7593/0x0000000801d64918.accept(Unknown Source)
I see the following logged by the master node:
[2021-04-19T13:12:01,164][INFO ][o.e.c.c.C.CoordinatorPublication] [localhost] after [10s] publication of cluster state version [1112461] is still waiting for {localhost}{aM-i6k0mTl-otjyi7QDPkw}{xS8UQvm3TIOe0nKXjhTyGw}{xxx.xxx.16.139}{xxx.xxx.16.139:9300}{cdhilmrstw}{aws_availability_zone=eu-west-1a, ml.machine_memory=66206564352, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=33285996544, transform.node=true} [SENT_APPLY_COMMIT]
Here is a sample of top output from the node in question:
top - 14:08:12 up 48 days, 2:32, 1 user, load average: 20.69, 18.15, 16.24
Tasks: 227 total, 1 running, 125 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.4 sy, 0.0 ni, 24.7 id, 74.9 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 64654848 total, 1144444 free, 36197708 used, 27312696 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 27653688 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2184 elastic+ 20 0 96.8g 36.6g 2.8g S 0.0 59.3 37832:37 java
2554 kibana 20 0 1175496 331068 27124 S 0.0 0.5 341:15.01 node
I tried excluding the node from the cluster which was showing this behaviour but the issue moved with the shards to the node shown above. So, given that hot threads doesn't seem to be highlighting anything I am suspecting something with the shards. However, cluster state is yellow. So there are unassigned shards but they are recoverable from primaries.
Note that prior to this issue cluster has been stable and a solid performer. And disk space is not an issue. Would appreciate some input on how to get a better handle on this problem.
Thx
D