Hello,
I'm looking for brave and clever people who can help out someone in a pinch.
I've been struggling to find out why one of my (very busy) index is lagging behind. I've been reading different posts, suggestions, etc. I have enough resources and tried to optimize everything the best I could. Unfortunately, I can't go around testing all levers one by one due to the sheer number of events, and replicating the environment would cost $$$$.
When I stop sending data to the message queue it takes a lot of time until the last event is indexed, depending on how long was it running. As time passes, the index is getting more and more behind the current time.
The pipeline goes like this:
- A couple of Filebeat instances are forwarding JSON events from log files to a Kafka cluster.
- Logstash instances are pulling data from the message queue, doing some pre-processing, and sending batches to the Elasticsearch cluster.
- Elasticsearch nodes are indexing the events.
All VMs are in the same Google data center (so network latency is absolutely out of question).
Each event is relatively short (HTTP log-like) around 1100-1300 characters long. However, depending on some factors there are 20K-55K events every second (most of the time in the 40K-50K range).
I have multiple pipelines (30+) but those are a lot quieter. I also saw that the ES nodes can perform better (+20%) in the busiest periods.
That's why I think that either Logstash can't pull data at an optimal rate from Kafka or can't send enough batches to Elasticsearch. However, I can't figure out what can I change to have the nodes working harder. Obviously, I don't want to throw more hardware at the issue, since IMO there is a lot of wiggle room already.
I'll describe the most relevant information of the affected clusters, let me know if I left out anything.
Elasticsearch
version: 7.9.2
Multiple ES VMs in the cluster with an extra coordinating node. Currently, there aren't any search requests sent for the indices in question.
Data node resources: 32 vCPUs, 120 GB of memory, multiple SSDsMetrics:
- heap around 70% (as it's set)
- CPU 40-50%
- write throughput ~20%, write IOPS < 10%
- read throughput ~ 4%, read IOPS < 10%
Index settings:
- settings.index.codec: "best_compression"
- settings.index.refresh_interval : "60s",
- settings.index.number_of_shards : "30",
- settings.index.number_of_replicas : "0"
Index mapping:
- 80 fields
- no dynamic types/regex/extra analyzer
Logstash:
version: 7.9.2
Multiple Logstash VMs are connecting to Kafka and the ES cluster.
Node resources: 16 vCPUs, 14.4 GB of memory.Metrics:
- heap around 70% (as it's set)
- CPU 50-70%
- almost nonexistent I/O operations
Kibana shows 0.3 ms/e for the current version of the Logstash configuration file.
Pipeline settings:
pipeline.workers: 6
pipeline.batch.size: 1000
pipeline.ordered: falseLogstash queue setting:
queue.type: persisted
queue.drain: true
queue.max_bytes: 72mb
Kafka:
Metrics:
- CPU 40-55%
- almost nonexistent I/O operations
Topic setting:
- Logstash nodes*6 partition
- replication factor: 2
ConsumerConfig values:
auto.commit.interval.ms = 5000
auto.offset.reset = latest
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
heartbeat.interval.ms = 3000
isolation.level = read_uncommitted
key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 500
metadata.max.age.ms = 300000
metric.reporters = [ ]
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 32768
reconnect.backoff.max.ms = 50
reconnect.backoff.ms = 50
request.timeout.ms = 40000
retry.backoff.ms = 100
send.buffer.bytes = 131072
session.timeout.ms = 10000
value.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
I'm passionately waiting for any help that would elevate the indexing rate.