I've got an on-prem hosted ElasticStack set up that looks like
[apps that send logs via http]->Logstash Shipper->Kafka->3 Logstash Ingesters->6 ElasticSearches (3 master, 3 data). Everything is on 7.16.2, and has been happening the same since I first installed the stack on version 7.7.0. I've noticed that whenever I'm ingesting a large amount of data (basically whenever any of my Kafka topics have a long-running queue) stats requests to the data nodes will continously timeout. Here's an example error message from my logs, which are filled with similar messages:
{"type": "server", "timestamp": "2022-05-18T17:17:13,209Z", "level": "WARN", "component": "o.e.c.InternalClusterInfoService", "cluster.name": "<cluster_name>", "node.name": "es-master2", "message": "failed to retrieve stats for node [evqj47plQ2O0j2mAX9UNAQ]: [es-data3][169.254.1.2:9300][cluster:monitor/nodes/stats[n]] request_id [2330765] timed out after [15005ms]", "cluster.uuid": "e3tEvDSqQxWdPPjUccHNNQ", "node.id": "KxB7zYzZSRyxkcknPWc5Pg" }
This affects REST api calls that I do, such as stats, _cat/shards, though notable not _nodes/hot_threads. It also causes the monitoring cluster to be unable to query stats from the data nodes (the graphs will just cut off as soon as those nodes get too busy.) I also cannot use the Index Management screen in Kibana's Stack Management while this is happening (it'll just time out.)
A few other observations/points:
All of these nodes are on one very large machine, and sharing a 4TB SSD. The machine has 256GB of memory and 128 cores, so general machine capacity is not an issue, but it's possible that disk contention may be? I'm not sure how to tell if I'm saturating the disk.
Supporting that theory, even when my logstashes are cranking through backed up logs at full pace, they're struggling to keep up with daily input on some topics, despite the fact that CPU usage is pretty low. Hot_threads has some lines that look like this: 100.0% [cpu=0.0%, other=100.0%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[es-data2][write][T#60]'
Searching older topics for similar issues, I see some people had issues that sounded similar because they had far too many shards for their stack. Our stack currently has about 900 shards, but this issue has been happening pretty much since setting up the stack on this machine when there were very few indices/shards.
At a guess, it might be storage speeds even with your SSD.
Can you you an iostat on it to see what the write speeds are like when this issue is happening?
With the ElasticStack down I was seeing ~2.0GB/s. Then, I started my whole ElasticStack except for the LS ingesters (so this would include LS Shipper, Kafka, and all 6 Elasticsearch instances, with data flowing in from apps running and getting as far as Kafka). With that, I was getting 1.6GB/s. When I start up the LS Ingesters, however, I'm only getting 8.4MB/s. So it does seem like some kind of disk saturation, though it seems weird that the iostat command isn't quite capturing it. Admittedly I don't know much about SSD/general storage mechanics... Is there some other resource that might be getting saturated causing the severely lowered write speeds here?
Can you run the extended stats, e.g. iostat -x and share the output?
It is worth noting that Elasticsearch does tend to perform a lot of smaller reads and writes (with fsync), and not large sequential ones like the one you simulate using dd. Your test is therefore not really representative of an Elasticsearch workload.
Gotcha, that's interesting. When I try running dd with conv=fdatasync (just going off what I'm finding here How to check hard disk performance - Ask Ubuntu ) it seems like I max out around 40 MB/s, which makes more sense considering the numbers we're seeing in iostat. Seems like I'm gonna need to get some more disks!
I'm still somewhat curious though, any idea why this would affect stats/other REST api queries, but not really cause any issue with regular data queries in Kibana?
Just for some resolution on this--we split each ES data node up onto its own SSD, as well as giving Kafka its own, and everything is working smoothly now. Thanks all!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.