I need help with the following. We have a cluster of 7 machine with about 300.000.000 documents every day. At some point during the day, not constantly the iowait on 1 or 2 nodes on cluster jumps to 60% and we start to get delays in processing the record. It happens randomly on every nodes. Each node has 8Tb of EBS SSD disks within LVM.
Below is the iostat output and ES configuration.
I will appreciate any help, since currently I am in dark waters and can't understand what happens.
IOSTAT
avg-cpu: %user %nice %system %iowait %steal %idle
8.05 0.00 2.55 63.09 0.13 26.18
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvdap1 0.00 15.00 25.00 6.00 424.00 184.00 19.61 0.00 0.13 0.13 0.40
xvdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdf 141.00 0.00 1100.00 0.00 29024.00 0.00 26.39 6.48 5.71 0.77 84.80
xvdg 160.00 0.00 1157.00 0.00 30304.00 0.00 26.19 6.90 5.87 0.81 93.20
xvdh 154.00 0.00 1154.00 0.00 29840.00 0.00 25.86 6.77 5.77 0.76 87.20
xvdi 159.00 0.00 1145.00 0.00 29880.00 0.00 26.10 6.32 5.48 0.74 85.20
xvdj 154.00 0.00 1135.00 0.00 29448.00 0.00 25.95 7.16 6.19 0.73 82.40
xvdk 151.00 0.00 1103.00 0.00 28928.00 0.00 26.23 5.91 5.28 0.77 85.20
xvdl 150.00 0.00 1058.00 0.00 28912.00 0.00 27.33 4.36 4.00 0.81 85.20
xvdm 150.00 0.00 1112.00 0.00 29424.00 0.00 26.46 5.92 5.19 0.83 92.80
dm-1 0.00 0.00 10171.00 0.00 237216.00 0.00 23.32 67.92 6.54 0.10 100.00
Node configuration
node.master: false
node.data: true
bootstrap.mlockall: true
discovery.zen.ping.multicast.enabled: false
network.host: eth0:ipv4
path.conf: /etc/elasticsearch
path.data: /ebs/elasticsearch
path.logs: /data/logs/elasticsearch
path.plugins: /usr/share/elasticsearch/plugins
indices.memory.index_buffer_size: 50%
index.translog.flush_threshold_ops: 50000
index.store.type: mmapfs
index.refresh_interval: 10s
threadpool.search.type: fixed
threadpool.search.size: 100
threadpool.search.queue_size: 200
threadpool.index.type: fixed
threadpool.index.size: 30
threadpool.index.queue_size: 1000
indices.fielddata.cache.size: 25%
indices.cluster.send_refresh_mapping: false
threadpool.bulk.queue_size: 3000
index.number_of_replicas: 1
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms
index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.debug: 500ms
index.search.slowlog.threshold.fetch.trace: 200ms
index.indexing.slowlog.threshold.index.warn: 10s
index.indexing.slowlog.threshold.index.info: 5s
index.indexing.slowlog.threshold.index.debug: 2s
index.indexing.slowlog.threshold.index.trace: 500ms