Hello,
I upgraded Elastic Stack from 6.8.2 to 7.6.2 (currently both clusters are running on EKS with kubernetes nodes on AWS m5.4xlarge machines 16CPUs 64GB - these are NOT SSDs) and I am experiencing degraded performance on 7.6.2.
6.8.2 setup: | 7.6.2 setup:
|
nodes: |
client: | coordinator:
replicas: 4 | replicas: 6
jvm heap: 2gb | jvm heap: 1gb
cpu: 1 | cpu: 1
mem: 4gb | mem: 2gb
data: | data:
replicas: 8 | replicas: 8
jvm heap: 8gb | jvm heap: 8gb
cpu: 1 | cpu: 6
mem: 16gb | mem: 16gb
master: | master:
replicas: 3 | replicas: 3
jvm heap: 1gb | jvm heap: 4gb
cpu: 1 | cpu: 2
mem: 2gb | mem: 8gb
|
indices: |
number_of_shards: |
2 | 6
number_of_replicas: |
1 | 1
refresh_interval: |
30s | 60s
active_primary_shards: |
12241 | 1591
active_shards: |
24482 | 3182
6.8.2 has daily indices so it has way more shards.
7.6.2 does not have time-based indices but has a Rollover strategy using time-series indices with the zero-padded suffix-n (i.e. fluentd-<k8s-namespace>-000001) and uses the Rollover API every hour and rolls over hot indices if 7gb or larger and shrinking warm indices. (Note: I am not using ILM or Curator since I am using dynamic index naming)
Everything else seems to be the same:
-
write.queue_sizes are 200 -
indices.memory.index_buffer_sizes are 50%
Although 6.8.2 has way more data (since 7.6.2 is fairly new), it is performing better. They are even receiving the same amount of data and both are getting a ton of 429s but 6.8.2 has no lag whereas 7.6.2 does. (according to topk(50, kafka_consumergroup_group_max_lag_seconds{group=~"logstash-es"}))
rate(elasticsearch_indices_indexing_index_total{cluster="$cluster",name=~"$name"}[$interval]) shows that 99% of the time, the data nodes are indexing!
I have tried:
- more CPU (though CPU usage avg is low)
- more memory (though memory usage avg is low)
- more JVM heap (though JVM heap usage avg is low)
I have not tried:
- more
Logstashconsumers (since thewritequeues were full, it didn't make sense) -
write.queue_sizeto 1000 (as this will only take care of 429s) - more
datanodes (I was going to try this soon)
What could I be missing?