Unstable cluster performance

Hi!

I have an elastic cluster -
1 dedicated master node with 1GB of heap
6 data nodes ( 32 cpu, 31GB heap and about 100GB more for file cache; disk and network io utilization is 5% on all data nodes)

I have two main indices that I'm writing to using filebeat and logstash.

xvd_logs - 120 million documents per day and 300GB of data

{
"index_patterns": ["xvd_logs-*"],
"settings": {
"number_of_shards": 12,
"index": {
"refresh_interval": "10s"}},

wsk_log - 60 million documents per day and 60GB of data

{"order": 1,
"index_patterns": [
"wsk_log-*"
],
"settings": {
"index": {
"highlight": {
"max_analyzed_offset": "30000000"
},
"mapping": {
"total_fields": {
"limit": "10000"
}
},
"refresh_interval": "30s",
"number_of_shards": "12",
"merge": {
"scheduler": {
"max_thread_count": "1"}}}},

Both of these have static mappings that i've set in templates

Bulk indexing writes are piling up on all hosts at the same time. They pile up for 10-30 minutes and then all get indexed at the same time. Then cluster works normally for a few minutes.

curl -s --noproxy "*" -X GET "http://elastic:9200/_cat/tasks?detailed&v" | wc -l 

1363

    action                         task_id                         parent_task_id                  type      start_time    timestamp running_time ip           node        description
indices:data/write/bulk        Jffd7IX0RCa-UR228cW2Xw:18951228 -                               transport 1552583715789 17:15:15  6.7m         10.4.4.75    node3   requests[4096], indices[wsk_log-2019.03.14]
indices:data/write/bulk[s]     Jffd7IX0RCa-UR228cW2Xw:18951231 Jffd7IX0RCa-UR228cW2Xw:18951228 transport 1552583715793 17:15:15  6.7m         10.4.4.75    node3   requests[354], index[wsk_log-2019.03.14]
indices:data/write/bulk[s]     xt-LfJPdQPaO7uC-lX7TFw:17591898 Jffd7IX0RCa-UR228cW2Xw:18951231 netty     1552583715816 17:15:15  6.7m         10.100.100.105   node6 requests[354], index[wsk_log-2019.03.14]
indices:data/write/bulk[s][p]  xt-LfJPdQPaO7uC-lX7TFw:17591899 xt-LfJPdQPaO7uC-lX7TFw:17591898 direct    1552583715816 17:15:15  6.7m         10.100.100.105   node6 requests[354], index[wsk_log-2019.03.14]
indices:data/write/bulk        ODJXLTi6SiqylmA-7mi1dA:22685282 -                               transport 1552583716138 17:15:16  6.7m         10.4.4.67    node1   requests[4096], indices[wsk_log-2019.03.14]
indices:data/write/bulk[s]     ODJXLTi6SiqylmA-7mi1dA:22685285 ODJXLTi6SiqylmA-7mi1dA:22685282 transport 1552583716147 17:15:16  6.7m         10.4.4.67    node1   requests[322], index[wsk_log-2019.03.14]
indices:data/write/bulk[s]     xt-LfJPdQPaO7uC-lX7TFw:17591934 ODJXLTi6SiqylmA-7mi1dA:22685285 netty     1552583716162 17:15:16  6.7m         10.100.100.105   node6 requests[322], index[wsk_log-2019.03.14]
indices:data/write/bulk[s][p]  xt-LfJPdQPaO7uC-lX7TFw:17591935 xt-LfJPdQPaO7uC-lX7TFw:17591934 direct    1552583716162 17:15:16  6.7m         10.100.100.105   node6 requests[322], index[wsk_log-2019.03.14]
indices:data/write/bulk        Jffd7IX0RCa-UR228cW2Xw:18951289 -                               transport 1552583716156 17:15:16  6.7m         10.4.4.75    node3   requests[4096], indices[wsk_log-2019.03.14]
indices:data/write/bulk[s]     Jffd7IX0RCa-UR228cW2Xw:18951292 Jffd7IX0RCa-UR228cW2Xw:18951289 transport 1552583716162 17:15:16  6.7m         10.4.4.75    node3   requests[351], index[wsk_log-2019.03.14]
indices:data/write/bulk[s]     xt-LfJPdQPaO7uC-lX7TFw:17591938 Jffd7IX0RCa-UR228cW2Xw:18951292 netty     1552583716178 17:15:16  6.7m         10.100.100.105   node6 requests[351], index[wsk_log-2019.03.14]
indices:data/write/bulk[s][p]  xt-LfJPdQPaO7uC-lX7TFw:17591939 xt-LfJPdQPaO7uC-lX7TFw:17591938 direct    1552583716178 17:15:16  6.7m         10.100.100.105   node6 requests[351], index[wsk_log-2019.03.14]
...
...
...
indices:data/write/bulk[s]     Jffd7IX0RCa-UR228cW2Xw:18980372 Jffd7IX0RCa-UR228cW2Xw:18980363 transport 1552584122394 17:22:02  14.3ms       10.4.4.75    node3   requests[328], index[wsk_log-2019.03.14]
indices:data/write/bulk[s]     Jffd7IX0RCa-UR228cW2Xw:18980374 Jffd7IX0RCa-UR228cW2Xw:18980363 transport 1552584122394 17:22:02  13.9ms       10.4.4.75    node3   requests[353], index[wsk_log-2019.03.14]
indices:data/write/bulk[s]     Jffd7IX0RCa-UR228cW2Xw:18980375 Jffd7IX0RCa-UR228cW2Xw:18980363 transport 1552584122395 17:22:02  13.4ms       10.4.4.75    node3   requests[352], index[wsk_log-2019.03.14]
indices:data/write/bulk[s][p]  Jffd7IX0RCa-UR228cW2Xw:18980377 Jffd7IX0RCa-UR228cW2Xw:18980375 direct    1552584122395 17:22:02  13.4ms       10.4.4.75    node3   requests[352], index[wsk_log-2019.03.14]
indices:data/write/bulk[s]     Jffd7IX0RCa-UR228cW2Xw:18980376 Jffd7IX0RCa-UR228cW2Xw:18980363 transport 1552584122395 17:22:02  13.4ms       10.4.4.75    node3   requests[339], index[wsk_log-2019.03.14]

Filebeat logs have tons of "Timeout exceeded while awaiting headers" to all nodes:

|2019-03-14T19:29:27.164+0300|ERROR|pipeline/output.go:121|Failed to publish events: Post http://node3:9200/_bulk: net/http: request canceled (Client.Timeout exceeded while awaiting headers)|
|2019-03-14T20:00:23.654+0300|ERROR|elasticsearch/client.go:319|Failed to perform any bulk index operations: Post http://node2:9200/_bulk: net/http: request canceled (Client.Timeout exceeded while awaiting headers)|
|2019-03-14T20:00:24.949+0300|ERROR|pipeline/output.go:121|Failed to publish events: Post http://node1:9200/_bulk: net/http: request canceled (Client.Timeout exceeded while awaiting headers)|

filebeat monitoring page in kibana

I'm pretty sure that I caused it, because we want to write 2-3 times as many data next month, and i was playing around with different settings ( shard number, reshresh_interval, async translog, etc.). I reverted all obvious settings back but that didn't help :smile:

Adding GC logs from two nodes when the cluster is having performance problems:

node1

2019-03-15T12:07:53.165+0300: 229891.185: Total time for which application threads were stopped: 0.0303483 seconds, Stopping threads took: 0.0002258 seconds
2019-03-15T12:08:02.244+0300: 229900.263: [GC (Allocation Failure) 2019-03-15T12:08:02.244+0300: 229900.264: [ParNew
Desired survivor size 100302848 bytes, new threshold 6 (max 6)

  • age 1: 82932328 bytes, 82932328 total
  • age 2: 4693544 bytes, 87625872 total
  • age 3: 7913040 bytes, 95538912 total
    : 1628314K->112969K(1763584K), 0.0280728 secs] 7974487K->6459143K(24969920K), 0.0284540 secs] [Times: user=0.56 sys=0.00, real=0.03 secs]
    2019-03-15T12:08:02.272+0300: 229900.292: Total time for which application threads were stopped: 0.0317026 seconds, Stopping threads took: 0.0003983 seconds
    2019-03-15T12:08:02.277+0300: 229900.297: Total time for which application threads were stopped: 0.0039949 seconds, Stopping threads took: 0.0004859 seconds
    2019-03-15T12:08:02.279+0300: 229900.299: Total time for which application threads were stopped: 0.0024863 seconds, Stopping threads took: 0.0002720 seconds
    2019-03-15T12:08:15.812+0300: 229913.832: [GC (Allocation Failure) 2019-03-15T12:08:15.812+0300: 229913.832: [ParNew
    Desired survivor size 100302848 bytes, new threshold 6 (max 6)
  • age 1: 26757384 bytes, 26757384 total
  • age 2: 46704808 bytes, 73462192 total
  • age 3: 3417424 bytes, 76879616 total
  • age 4: 7833592 bytes, 84713208 total
    : 1680579K->118888K(1763584K), 0.0239332 secs] 8026752K->6465062K(24969920K), 0.0244503 secs] [Times: user=0.47 sys=0.02, real=0.02 secs]
    2019-03-15T12:08:15.836+0300: 229913.856: Total time for which application threads were stopped: 0.0285061 seconds, Stopping threads took: 0.0002723 seconds
    2019-03-15T12:08:20.962+0300: 229918.982: [GC (Allocation Failure) 2019-03-15T12:08:20.963+0300: 229918.982: [ParNew
    Desired survivor size 100302848 bytes, new threshold 3 (max 6)
  • age 1: 56353752 bytes, 56353752 total
  • age 2: 15291736 bytes, 71645488 total
  • age 3: 45819696 bytes, 117465184 total
  • age 4: 3416736 bytes, 120881920 total
  • age 5: 7833288 bytes, 128715208 total
    : 1686223K->165912K(1763584K), 0.0337720 secs] 8032396K->6512086K(24969920K), 0.0342510 secs] [Times: user=0.68 sys=0.01, real=0.03 secs]
    2019-03-15T12:08:20.997+0300: 229919.017: Total time for which application threads were stopped: 0.0384867 seconds, Stopping threads took: 0.0003864 seconds
    2019-03-15T12:08:23.674+0300: 229921.694: [GC (Allocation Failure) 2019-03-15T12:08:23.675+0300: 229921.694: [ParNew
    Desired survivor size 100302848 bytes, new threshold 6 (max 6)
  • age 1: 55795896 bytes, 55795896 total
  • age 2: 33573752 bytes, 89369648 total
  • age 3: 8971816 bytes, 98341464 total
    : 1733592K->143921K(1763584K), 0.0633568 secs] 8079766K->6545101K(24969920K), 0.0636862 secs] [Times: user=0.98 sys=0.01, real=0.06 secs]
    2019-03-15T12:08:23.738+0300: 229921.758: Total time for which application threads were stopped: 0.0668550 seconds, Stopping threads took: 0.0003665 seconds
    2019-03-15T12:08:23.741+0300: 229921.761: Total time for which application threads were stopped: 0.0027142 seconds, Stopping threads took: 0.0001819 seconds
    2019-03-15T12:08:26.708+0300: 229924.728: [GC (Allocation Failure) 2019-03-15T12:08:26.708+0300: 229924.728: [ParNew
    Desired survivor size 100302848 bytes, new threshold 2 (max 6)
  • age 1: 96512632 bytes, 96512632 total
  • age 2: 29608200 bytes, 126120832 total
  • age 3: 19138776 bytes, 145259608 total
  • age 4: 8860632 bytes, 154120240 total
    : 1711601K->192904K(1763584K), 0.0352229 secs] 8112781K->6594084K(24969920K), 0.0356263 secs] [Times: user=0.74 sys=0.01, real=0.04 secs]
    2019-03-15T12:08:26.744+0300: 229924.763: Total time for which application threads were stopped: 0.0397827 seconds, Stopping threads took: 0.0004029 seconds
    2019-03-15T12:08:28.506+0300: 229926.525: Total time for which application threads were stopped: 0.0044951 seconds, Stopping threads took: 0.0004016 seconds
    2019-03-15T12:08:29.064+0300: 229927.083: Total time for which application threads were stopped: 0.0039555 seconds, Stopping threads took: 0.0002878 seconds
    2019-03-15T12:08:29.379+0300: 229927.399: [GC (Allocation Failure) 2019-03-15T12:08:29.380+0300: 229927.399: [ParNew
    Desired survivor size 100302848 bytes, new threshold 2 (max 6)
  • age 1: 81767344 bytes, 81767344 total
  • age 2: 38230472 bytes, 119997816 total
    : 1760382K->171595K(1763584K), 0.0940012 secs] 8161562K->6619717K(24969920K), 0.0944432 secs] [Times: user=1.87 sys=0.00, real=0.10 secs]
    2019-03-15T12:08:29.474+0300: 229927.494: Total time for which application threads were stopped: 0.0987340 seconds, Stopping threads took: 0.0005029 seconds
    2019-03-15T12:08:29.481+0300: 229927.501: Total time for which application threads were stopped: 0.0072597 seconds, Stopping threads took: 0.0002741 seconds
    2019-03-15T12:08:29.910+0300: 229927.929: Total time for which application threads were stopped: 0.0042111 seconds, Stopping threads took: 0.0005745 seconds
    2019-03-15T12:08:32.372+0300: 229930.391: [GC (Allocation Failure) 2019-03-15T12:08:32.372+0300: 229930.392: [ParNew
    Desired survivor size 100302848 bytes, new threshold 2 (max 6)
  • age 1: 46041240 bytes, 46041240 total
  • age 2: 60850320 bytes, 106891560 total
    : 1739082K->148034K(1763584K), 0.0552605 secs] 8187205K->6632623K(24969920K), 0.0556744 secs] [Times: user=1.08 sys=0.02, real=0.06 secs]
    2019-03-15T12:08:32.427+0300: 229930.447: Total time for which application threads were stopped: 0.0600076 seconds, Stopping threads took: 0.0005303 seconds
    2019-03-15T12:08:32.432+0300: 229930.451: Total time for which application threads were stopped: 0.0039790 seconds, Stopping threads took: 0.0003908 seconds

node2

2019-03-15T12:08:25.890+0300: 229956.414: Total time for which application threads were stopped: 0.0330919 seconds, Stopping threads took: 0.0002186 seconds
2019-03-15T12:08:31.648+0300: 229962.172: [GC (Allocation Failure) 2019-03-15T12:08:31.648+0300: 229962.172: [ParNew
Desired survivor size 143917056 bytes, new threshold 3 (max 6)

  • age 1: 87153128 bytes, 87153128 total
  • age 2: 50197528 bytes, 137350656 total
  • age 3: 33901680 bytes, 171252336 total
  • age 4: 9472312 bytes, 180724648 total
  • age 5: 6235264 bytes, 186959912 total
    : 2406405K->227611K(2530368K), 0.0233377 secs] 9579354K->7400560K(16496128K), 0.0236024 secs] [Times: user=0.66 sys=0.02, real=0.02 secs]
    2019-03-15T12:08:31.672+0300: 229962.196: Total time for which application threads were stopped: 0.0264758 seconds, Stopping threads took: 0.0001619 seconds
    2019-03-15T12:08:42.706+0300: 229973.230: [GC (Allocation Failure) 2019-03-15T12:08:42.706+0300: 229973.231: [ParNew
    Desired survivor size 143917056 bytes, new threshold 6 (max 6)
  • age 1: 49637392 bytes, 49637392 total
  • age 2: 25997944 bytes, 75635336 total
  • age 3: 41338864 bytes, 116974200 total
    : 2476660K->185824K(2530368K), 0.0540454 secs] 9649609K->7405942K(16496128K), 0.0544034 secs] [Times: user=0.65 sys=0.65, real=0.05 secs]
    2019-03-15T12:08:42.761+0300: 229973.285: Total time for which application threads were stopped: 0.0623779 seconds, Stopping threads took: 0.0003725 seconds
    2019-03-15T12:08:42.763+0300: 229973.287: Total time for which application threads were stopped: 0.0021573 seconds, Stopping threads took: 0.0001647 seconds
    2019-03-15T12:08:42.765+0300: 229973.289: Total time for which application threads were stopped: 0.0016101 seconds, Stopping threads took: 0.0000717 seconds
    2019-03-15T12:08:51.202+0300: 229981.726: [GC (Allocation Failure) 2019-03-15T12:08:51.202+0300: 229981.726: [ParNew
    Desired survivor size 143917056 bytes, new threshold 6 (max 6)
  • age 1: 38139808 bytes, 38139808 total
  • age 2: 14452864 bytes, 52592672 total
  • age 3: 17936584 bytes, 70529256 total
  • age 4: 33019976 bytes, 103549232 total
    : 2435104K->132814K(2530368K), 0.0176074 secs] 9655222K->7352933K(16496128K), 0.0178895 secs] [Times: user=0.46 sys=0.04, real=0.02 secs]
    2019-03-15T12:08:51.220+0300: 229981.744: Total time for which application threads were stopped: 0.0202756 seconds, Stopping threads took: 0.0001998 seconds
    2019-03-15T12:08:58.912+0300: 229989.436: [GC (Allocation Failure) 2019-03-15T12:08:58.912+0300: 229989.436: [ParNew
    Desired survivor size 143917056 bytes, new threshold 6 (max 6)
  • age 1: 69549320 bytes, 69549320 total
  • age 2: 2470632 bytes, 72019952 total
  • age 3: 8552200 bytes, 80572152 total
  • age 4: 4901240 bytes, 85473392 total
  • age 5: 6274184 bytes, 91747576 total
    : 2382094K->112943K(2530368K), 0.0153911 secs] 9602213K->7333061K(16496128K), 0.0156701 secs] [Times: user=0.38 sys=0.04, real=0.02 secs]
    2019-03-15T12:08:58.928+0300: 229989.452: Total time for which application threads were stopped: 0.0180974 seconds, Stopping threads took: 0.0002262 seconds
    2019-03-15T12:08:58.930+0300: 229989.454: Total time for which application threads were stopped: 0.0021106 seconds, Stopping threads took: 0.0001279 seconds
    2019-03-15T12:09:02.461+0300: 229992.986: [GC (Allocation Failure) 2019-03-15T12:09:02.462+0300: 229992.986: [ParNew
    Desired survivor size 143917056 bytes, new threshold 4 (max 6)
  • age 1: 98317120 bytes, 98317120 total
  • age 2: 41850960 bytes, 140168080 total
  • age 3: 2275728 bytes, 142443808 total
  • age 4: 8542560 bytes, 150986368 total
  • age 5: 4860376 bytes, 155846744 total
  • age 6: 6224384 bytes, 162071128 total
    : 2362217K->195981K(2530368K), 0.0197906 secs] 9582335K->7416099K(16496128K), 0.0200575 secs] [Times: user=0.53 sys=0.04, real=0.02 secs]
    2019-03-15T12:09:02.482+0300: 229993.006: Total time for which application threads were stopped: 0.0226661 seconds, Stopping threads took: 0.0002500 seconds
    2019-03-15T12:09:05.499+0300: 229996.023: [GC (Allocation Failure) 2019-03-15T12:09:05.499+0300: 229996.023: [ParNew
    Desired survivor size 143917056 bytes, new threshold 3 (max 6)
  • age 1: 84513344 bytes, 84513344 total
  • age 2: 58521688 bytes, 143035032 total
  • age 3: 21902528 bytes, 164937560 total
  • age 4: 2090864 bytes, 167028424 total
    : 2445261K->222052K(2530368K), 0.0271073 secs] 9665379K->7459848K(16496128K), 0.0273890 secs] [Times: user=0.63 sys=0.04, real=0.03 secs]
    2019-03-15T12:09:05.526+0300: 229996.051: Total time for which application threads were stopped: 0.0297666 seconds, Stopping threads took: 0.0001703 seconds
    2019-03-15T12:09:10.088+0300: 230000.612: [GC (Allocation Failure) 2019-03-15T12:09:10.088+0300: 230000.612: [ParNew
    Desired survivor size 143917056 bytes, new threshold 2 (max 6)
  • age 1: 94554216 bytes, 94554216 total
  • age 2: 70974504 bytes, 165528720 total
  • age 3: 42203256 bytes, 207731976 total
    : 2471024K->260994K(2530368K), 0.0566352 secs] 9708821K->7515180K(16496128K), 0.0569177 secs] [Times: user=0.72 sys=0.91, real=0.06 secs]
    2019-03-15T12:09:10.145+0300: 230000.669: Total time for which application threads were stopped: 0.0646817 seconds, Stopping threads took: 0.0004572 seconds
    2019-03-15T12:09:10.147+0300: 230000.671: Total time for which application threads were stopped: 0.0020909 seconds, Stopping threads took: 0.0001169 seconds
    2019-03-15T12:09:10.149+0300: 230000.673: Total time for which application threads were stopped: 0.0015765 seconds, Stopping threads took: 0.0000555 seconds
    2019-03-15T12:09:14.358+0300: 230004.883: Total time for which application threads were stopped: 0.0024226 seconds, Stopping threads took: 0.0002730 seconds
    2019-03-15T12:09:15.322+0300: 230005.847: [GC (Allocation Failure) 2019-03-15T12:09:15.323+0300: 230005.847: [ParNew
    Desired survivor size 143917056 bytes, new threshold 6 (max 6)
  • age 1: 39115360 bytes, 39115360 total
  • age 2: 45699688 bytes, 84815048 total
    : 2510274K->179224K(2530368K), 0.0362748 secs] 9764460K->7501352K(16496128K), 0.0365657 secs] [Times: user=0.71 sys=0.04, real=0.04 secs]
    2019-03-15T12:09:15.359+0300: 230005.883: Total time for which application threads were stopped: 0.0388821 seconds, Stopping threads took: 0.0001546 seconds
    2019-03-15T12:09:15.361+0300: 230005.885: Total time for which application threads were stopped: 0.0020889 seconds, Stopping threads took: 0.0001092 seconds
    2019-03-15T12:09:15.363+0300: 230005.887: Total time for which application threads were stopped: 0.0016764 seconds, Stopping threads took: 0.0001010 seconds
    2019-03-15T12:09:28.265+0300: 230018.789: [GC (Allocation Failure) 2019-03-

Why have you set this to such a large value? This seems considerably larger thanwhat is recommended. What does heap usage look like over time on the Elasticsearch nodes?

Heap usage on 4 nodes from yesterday when i was having problems

node3-yesteday

Heap usage on node1 when i didn't have problems

Indexing on node1 when i didn't have problems

Indexing on node1 from yesterday

I'm not including all the nodes. But they have same problems.

Write thread pool stats while cluster is having issues

node_name   node_id                ephemeral_node_id      pid    host         ip           port name  type  active size queue queue_size rejected largest completed min max keep_alive
node1       ODJXLTi6SiqylmA-7mi1dA qCll0d6US2KKRUUDHBPfSw 110715 10.4.4.67    10.4.4.67    9300 write fixed     11   32     0        200        0      32   8425800  32  32           
node2       xt-LfJPdQPaO7uC-lX7TFw 4XEMPI0BQVqfzbii8MVksw 1      10.4.2.105   10.4.2.105   9301 write fixed      5   80     0        200        0      80   7945488  80  80           
node3       GX7h-MJNS-6fHoKrPE5CYw 5Qy2wYDxR5ayGdnBBUaqVg 1      10.4.2.105   10.4.2.105   9300 write fixed      9   80     0        200        0      80   7575050  80  80           
node4       2di2-HhiT96b5Dc_b3iFCg 3s0T1Y9BR-OHVLZ0yl6H2w 101272 10.4.4.68    10.4.4.68    9300 write fixed      4   32     0        200      618      32   7549929  32  32           
node5       4l82ClI6RcGrcX5LE8NjfQ r-cmAP5DQMS8kfimb6YaYA 1      10.4.252.141 10.4.252.141 9300 write fixed      0    0     0        200        0       0         0  16  16           
node6       2WL_ibqfQwqcaS9bieWhaA MaJJyEHRRwy0x-A4CPFutQ 5126   10.4.2.43    10.4.2.43    9300 write fixed      0   32     0        200        0      32   1186904  32  32           
node7       i4NMLaq2QZSzyE16d2HIjg MJsA5C9BRS6wby1tW73t0w 15453  10.4.252.140 10.4.252.140 9300 write fixed      0   24     0        200        0      24     70997  24  24           
node8       Jffd7IX0RCa-UR228cW2Xw ELSBsa-bRO6YjoJlRjKsvA 391302 10.4.4.75    10.4.4.75    9300 write fixed      3   48     0        200        0      48   7628514  48  48

I fixed the performance issue by decreasing logstash bulk request size for one of my main indices from 4096 to 128

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.