Elastic cluster slow down

Hi,

I have a 6 node elastic cluster each with c5.4xlarge instances and writes to EBS volume with 5TB [10000 IOPS]. We only store 3 days worth of data and have a nightly job that terminates indexes older than 3 days.This has been working pretty good without any issues since an year indexing 1.5B documents a day to 3 different indexes.

My issue:

Since we are using only 1.7-2TB out of the 5TB disk space i decided to create a new volume of 2.5TB [ 7500 IOPS] and attach to the servers and let the cluster take care of the balancing . I did this for couple of servers one server each day and once the cluster is balanced and all GREEN i did the other one. It's been a day since the cluster looks balanced and no errors but i started seeing lag in data being written to elastic at least by an hour now. Could this be because of reduction in the IOPS being provided? That is the only change that has been done.

Let me know if you need any other information.

Thanks,
puth

That is possible. What does the disk utilisation and iowait look like on ghenodes if you run ‘iostat-x’?

During non peak hrs this is what i see. node4 and node5 are the two servers whose disk has been changed from 5TB to 2.5TB.

node1| CHANGED | rc=0 >>
Linux 3.10.0-957.27.2.el7.x86_64 (node1) 09/19/2019 x86_64 (16 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
11.19 0.00 1.05 4.92 0.00 82.84

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 35.79 1243.80 155.17 56038.88 37467.66 133.68 0.82 0.60 0.31 2.85 0.37 51.39
nvme0n1 0.00 0.07 15.18 0.76 538.38 8.82 68.64 0.01 0.59 0.59 0.70 0.41 0.65

node2| CHANGED | rc=0 >>
Linux 3.10.0-957.27.2.el7.x86_64 (node2) 09/19/2019 x86_64 (16 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
12.93 0.00 1.15 5.88 0.00 80.04

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 39.94 1361.15 177.00 59852.43 42810.20 133.49 1.43 0.00 0.09 7.43 0.37 57.62
nvme0n1 0.00 0.08 15.44 0.77 441.25 8.64 55.52 0.01 0.51 0.49 0.85 0.36 0.58

node3 | CHANGED | rc=0 >>
Linux 3.10.0-957.27.2.el7.x86_64 (node3) 09/19/2019 x86_64 (16 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
11.18 0.00 1.06 5.15 0.00 82.61

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 36.11 1264.55 159.26 57192.80 38478.67 134.39 1.32 0.94 0.53 4.19 0.37 52.78
nvme0n1 0.00 0.08 14.04 0.68 465.34 8.17 64.34 0.01 0.55 0.55 0.74 0.38 0.56

node4| CHANGED | rc=0 >>
Linux 3.10.0-957.27.2.el7.x86_64 (node4) 09/19/2019 x86_64 (16 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
12.60 0.00 1.28 3.55 0.00 82.57

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 38.61 1415.75 202.24 56047.93 49355.41 130.29 6.70 4.15 2.43 16.19 0.29 47.08
nvme0n1 0.00 0.09 12.50 0.70 318.26 8.45 49.49 0.01 0.66 0.64 0.89 0.48 0.63

node5 | CHANGED | rc=0 >>
Linux 3.10.0-957.27.2.el7.x86_64 (node5) 09/19/2019 x86_64 (16 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
18.03 0.00 1.54 4.74 0.00 75.69

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 55.95 1747.38 260.74 61058.27 63573.05 124.13 9.83 4.90 2.94 18.04 0.28 56.51
nvme0n1 0.00 0.09 16.27 0.71 547.08 8.83 65.48 0.01 0.67 0.65 1.05 0.45 0.76

node6 | CHANGED | rc=0 >>
Linux 3.10.0-957.27.2.el7.x86_64 (node6) 09/19/2019 x86_64 (16 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
12.24 0.00 1.13 6.33 0.00 80.30

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 39.22 1311.83 168.91 58533.31 40822.02 134.20 1.32 0.90 0.35 5.22 0.38 56.14
nvme0n1 0.00 0.08 16.29 0.69 614.28 8.28 73.33 0.01 0.75 0.73 1.03 0.49 0.84

Is that taken while you are having problems?

No..i took it just now during non peak hrs when the cluster is looking good.... i will take it again when we are at our peak traffic mostly around 1 in the afternoon and let you know.

My Cluster seems to be Good now after letting it take time to rebalance for a day. Now i don't see any lag.

Admin you can close the ticket

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.