Hello,
I upgraded Elastic Stack
from 6.8.2 to 7.6.2 (currently both clusters are running on EKS
with kubernetes
nodes on AWS
m5.4xlarge
machines 16CPUs
64GB
- these are NOT SSD
s) and I am experiencing degraded performance on 7.6.2.
6.8.2 setup: | 7.6.2 setup:
|
nodes: |
client: | coordinator:
replicas: 4 | replicas: 6
jvm heap: 2gb | jvm heap: 1gb
cpu: 1 | cpu: 1
mem: 4gb | mem: 2gb
data: | data:
replicas: 8 | replicas: 8
jvm heap: 8gb | jvm heap: 8gb
cpu: 1 | cpu: 6
mem: 16gb | mem: 16gb
master: | master:
replicas: 3 | replicas: 3
jvm heap: 1gb | jvm heap: 4gb
cpu: 1 | cpu: 2
mem: 2gb | mem: 8gb
|
indices: |
number_of_shards: |
2 | 6
number_of_replicas: |
1 | 1
refresh_interval: |
30s | 60s
active_primary_shards: |
12241 | 1591
active_shards: |
24482 | 3182
6.8.2 has daily indices so it has way more shards.
7.6.2 does not have time-based indices but has a Rollover strategy using time-series indices with the zero-padded suffix-n
(i.e. fluentd-<k8s-namespace>-000001
) and uses the Rollover API
every hour and rolls over hot indices if 7gb or larger and shrinking warm indices. (Note: I am not using ILM
or Curator
since I am using dynamic index naming)
Everything else seems to be the same:
-
write.queue_size
s are 200 -
indices.memory.index_buffer_size
s are 50%
Although 6.8.2 has way more data (since 7.6.2 is fairly new), it is performing better. They are even receiving the same amount of data and both are getting a ton of 429s but 6.8.2 has no lag whereas 7.6.2 does. (according to topk(50, kafka_consumergroup_group_max_lag_seconds{group=~"logstash-es"})
)
rate(elasticsearch_indices_indexing_index_total{cluster="$cluster",name=~"$name"}[$interval])
shows that 99% of the time, the data
nodes are indexing
!
I have tried:
- more CPU (though CPU usage avg is low)
- more memory (though memory usage avg is low)
- more JVM heap (though JVM heap usage avg is low)
I have not tried:
- more
Logstash
consumers (since thewrite
queues were full, it didn't make sense) -
write.queue_size
to 1000 (as this will only take care of 429s) - more
data
nodes (I was going to try this soon)
What could I be missing?