Elastic cluster slow down

putharekulu · September 18, 2019, 8:59pm

Hi,

I have a 6 node elastic cluster each with c5.4xlarge instances and writes to EBS volume with 5TB [10000 IOPS]. We only store 3 days worth of data and have a nightly job that terminates indexes older than 3 days.This has been working pretty good without any issues since an year indexing 1.5B documents a day to 3 different indexes.

My issue:

Since we are using only 1.7-2TB out of the 5TB disk space i decided to create a new volume of 2.5TB [ 7500 IOPS] and attach to the servers and let the cluster take care of the balancing . I did this for couple of servers one server each day and once the cluster is balanced and all GREEN i did the other one. It's been a day since the cluster looks balanced and no errors but i started seeing lag in data being written to elastic at least by an hour now. Could this be because of reduction in the IOPS being provided? That is the only change that has been done.

Let me know if you need any other information.

Thanks,
puth

Christian_Dahlqvist · September 19, 2019, 4:04am

That is possible. What does the disk utilisation and iowait look like on ghenodes if you run ‘iostat-x’?

putharekulu · September 19, 2019, 1:39pm

During non peak hrs this is what i see. node4 and node5 are the two servers whose disk has been changed from 5TB to 2.5TB.

node1| CHANGED | rc=0 >>
Linux 3.10.0-957.27.2.el7.x86_64 (node1) 09/19/2019 x86_64 (16 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
11.19 0.00 1.05 4.92 0.00 82.84

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 35.79 1243.80 155.17 56038.88 37467.66 133.68 0.82 0.60 0.31 2.85 0.37 51.39
nvme0n1 0.00 0.07 15.18 0.76 538.38 8.82 68.64 0.01 0.59 0.59 0.70 0.41 0.65

node2| CHANGED | rc=0 >>
Linux 3.10.0-957.27.2.el7.x86_64 (node2) 09/19/2019 x86_64 (16 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
12.93 0.00 1.15 5.88 0.00 80.04

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 39.94 1361.15 177.00 59852.43 42810.20 133.49 1.43 0.00 0.09 7.43 0.37 57.62
nvme0n1 0.00 0.08 15.44 0.77 441.25 8.64 55.52 0.01 0.51 0.49 0.85 0.36 0.58

node3 | CHANGED | rc=0 >>
Linux 3.10.0-957.27.2.el7.x86_64 (node3) 09/19/2019 x86_64 (16 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
11.18 0.00 1.06 5.15 0.00 82.61

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 36.11 1264.55 159.26 57192.80 38478.67 134.39 1.32 0.94 0.53 4.19 0.37 52.78
nvme0n1 0.00 0.08 14.04 0.68 465.34 8.17 64.34 0.01 0.55 0.55 0.74 0.38 0.56

node4| CHANGED | rc=0 >>
Linux 3.10.0-957.27.2.el7.x86_64 (node4) 09/19/2019 x86_64 (16 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
12.60 0.00 1.28 3.55 0.00 82.57

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 38.61 1415.75 202.24 56047.93 49355.41 130.29 6.70 4.15 2.43 16.19 0.29 47.08
nvme0n1 0.00 0.09 12.50 0.70 318.26 8.45 49.49 0.01 0.66 0.64 0.89 0.48 0.63

node5 | CHANGED | rc=0 >>
Linux 3.10.0-957.27.2.el7.x86_64 (node5) 09/19/2019 x86_64 (16 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
18.03 0.00 1.54 4.74 0.00 75.69

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 55.95 1747.38 260.74 61058.27 63573.05 124.13 9.83 4.90 2.94 18.04 0.28 56.51
nvme0n1 0.00 0.09 16.27 0.71 547.08 8.83 65.48 0.01 0.67 0.65 1.05 0.45 0.76

node6 | CHANGED | rc=0 >>
Linux 3.10.0-957.27.2.el7.x86_64 (node6) 09/19/2019 x86_64 (16 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
12.24 0.00 1.13 6.33 0.00 80.30

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 39.22 1311.83 168.91 58533.31 40822.02 134.20 1.32 0.90 0.35 5.22 0.38 56.14
nvme0n1 0.00 0.08 16.29 0.69 614.28 8.28 73.33 0.01 0.75 0.73 1.03 0.49 0.84

Christian_Dahlqvist · September 19, 2019, 1:42pm

Is that taken while you are having problems?

putharekulu · September 19, 2019, 1:45pm

No..i took it just now during non peak hrs when the cluster is looking good.... i will take it again when we are at our peak traffic mostly around 1 in the afternoon and let you know.

putharekulu · September 20, 2019, 3:21pm

My Cluster seems to be Good now after letting it take time to rebalance for a day. Now i don't see any lag.

Admin you can close the ticket

system · October 18, 2019, 3:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Indexing slows down dramatically as index size grows Elasticsearch	4	553	July 6, 2017
Best pratices for Ops guys Elasticsearch	11	1554	July 6, 2017
Elastic cluster slow down afre a few weeks of uptime(cluster recommendations) Elasticsearch	17	930	January 17, 2020
Elasticsearch Indexing Rate Elasticsearch	9	3472	July 5, 2017
Cannot Increase Write TPS in Elasticsearch by adding more nodes Elasticsearch	10	2585	July 6, 2017

Elastic cluster slow down

Related topics