Stats requests to data nodes timing out while ingestion is happening

I've got an on-prem hosted ElasticStack set up that looks like
[apps that send logs via http]->Logstash Shipper->Kafka->3 Logstash Ingesters->6 ElasticSearches (3 master, 3 data). Everything is on 7.16.2, and has been happening the same since I first installed the stack on version 7.7.0. I've noticed that whenever I'm ingesting a large amount of data (basically whenever any of my Kafka topics have a long-running queue) stats requests to the data nodes will continously timeout. Here's an example error message from my logs, which are filled with similar messages:

{"type": "server", "timestamp": "2022-05-18T17:17:13,209Z", "level": "WARN", "component": "o.e.c.InternalClusterInfoService", "cluster.name": "<cluster_name>", "node.name": "es-master2", "message": "failed to retrieve stats for node [evqj47plQ2O0j2mAX9UNAQ]: [es-data3][169.254.1.2:9300][cluster:monitor/nodes/stats[n]] request_id [2330765] timed out after [15005ms]", "cluster.uuid": "e3tEvDSqQxWdPPjUccHNNQ", "node.id": "KxB7zYzZSRyxkcknPWc5Pg" }

This affects REST api calls that I do, such as stats, _cat/shards, though notable not _nodes/hot_threads. It also causes the monitoring cluster to be unable to query stats from the data nodes (the graphs will just cut off as soon as those nodes get too busy.) I also cannot use the Index Management screen in Kibana's Stack Management while this is happening (it'll just time out.)

A few other observations/points:

  1. All of these nodes are on one very large machine, and sharing a 4TB SSD. The machine has 256GB of memory and 128 cores, so general machine capacity is not an issue, but it's possible that disk contention may be? I'm not sure how to tell if I'm saturating the disk.
  2. Supporting that theory, even when my logstashes are cranking through backed up logs at full pace, they're struggling to keep up with daily input on some topics, despite the fact that CPU usage is pretty low. Hot_threads has some lines that look like this:
    100.0% [cpu=0.0%, other=100.0%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[es-data2][write][T#60]'
  3. Searching older topics for similar issues, I see some people had issues that sounded similar because they had far too many shards for their stack. Our stack currently has about 900 shards, but this issue has been happening pretty much since setting up the stack on this machine when there were very few indices/shards.

Any guidance would be very much appreciated!

At a guess, it might be storage speeds even with your SSD.
Can you you an iostat on it to see what the write speeds are like when this issue is happening?

1 Like

Here's a a sample of iostat reading every 2 seconds while this is happening.

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb             614.20      4569.60     34578.00      45696     345780

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb             727.40      9489.60     31069.60      94896     310696

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb             662.20      4889.60     35587.20      48896     355872

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb             654.60       858.00     37545.20       8580     375452

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb             636.70      1465.60     33340.40      14656     333404

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb             598.90      4285.60     32392.80      42856     323928

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb             687.30      4144.40     34128.40      41444     341284

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb            1252.10      3045.20     31624.80      30452     316248

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb             959.20      6042.40     31581.20      60424     315812

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb            1011.70      3690.00     33173.20      36900     331732

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb             750.30      1080.00     35332.80      10800     353328

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb             928.10      2685.60     34116.80      26856     341168

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb             739.80       687.60     34478.40       6876     344784

~35MB/s seems a little low.

Interestingly, I did some testing with the dd command. The command I ran was:

dd if=/dev/zero of=/data/test/output bs=100k count=10k; rm -f /data/test/output

With the ElasticStack down I was seeing ~2.0GB/s. Then, I started my whole ElasticStack except for the LS ingesters (so this would include LS Shipper, Kafka, and all 6 Elasticsearch instances, with data flowing in from apps running and getting as far as Kafka). With that, I was getting 1.6GB/s. When I start up the LS Ingesters, however, I'm only getting 8.4MB/s. So it does seem like some kind of disk saturation, though it seems weird that the iostat command isn't quite capturing it. Admittedly I don't know much about SSD/general storage mechanics... Is there some other resource that might be getting saturated causing the severely lowered write speeds here?

Can you run the extended stats, e.g. iostat -x and share the output?

It is worth noting that Elasticsearch does tend to perform a lot of smaller reads and writes (with fsync), and not large sequential ones like the one you simulate using dd. Your test is therefore not really representative of an Elasticsearch workload.

1 Like

Here's the results with the extended flag:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.10   374.00  127.40  596.30  3830.40 30474.80    94.81   294.72  401.20   30.07  480.49   1.38 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.10     4.00  455.70  442.60  3325.20 27080.80    67.70   315.87  334.91   61.17  616.76   1.11 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     4.20  734.50  499.40  4312.80 30105.20    55.79   331.71  281.50   63.89  601.55   0.81 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.10  1018.70  251.40  590.20 11798.80 32866.00   106.14   265.69  323.49   25.55  450.40   1.19 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.10   689.60  277.60  656.30 11113.20 33378.00    95.28   268.11  283.00   33.37  388.59   1.07  99.99

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00   452.00  747.50  489.90  7261.20 25270.80    52.58   300.16  243.98   50.12  539.79   0.81 100.02

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.10   606.90  651.40  510.20  6527.20 24440.80    53.32   304.40  259.86   55.76  520.45   0.86  99.99

Gotcha, that's interesting. When I try running dd with conv=fdatasync (just going off what I'm finding here How to check hard disk performance - Ask Ubuntu ) it seems like I max out around 40 MB/s, which makes more sense considering the numbers we're seeing in iostat. Seems like I'm gonna need to get some more disks!

I'm still somewhat curious though, any idea why this would affect stats/other REST api queries, but not really cause any issue with regular data queries in Kibana?

What kind of disks are you using? The stats do not look great and they do look heavily loaded.

It might also be useful if you could share the full output of the cluster stats API.

Not sure exactly on the disk type, I can ping some people who would know though.

Here's a pastebin of my cluster stats:

(Notably this is with my LS ingesters not running; when they're running the api call just spins forever)

Edit: apparently the drive is a Samsung SSD 860, in a RAID1 array with an identical drive.

Just for some resolution on this--we split each ES data node up onto its own SSD, as well as giving Kafka its own, and everything is working smoothly now. Thanks all!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.