Hello,
Hope y'all having a good day!
Really need some assistance regarding some performance issues on our logging pipeline.
In a nutshell, we have this common setup:
[journalbeat] > [logstash] > [elasticsearch]
While doing performance tests, there is a huge gap between the total of http requests logged by journalbeat and the number in elasticsearch, I'm talking only ~5% of total requests is seen in ES, you would think that it'd be obvious to spot such a huge loss...
The simple performance test follows:
[jmeter] > [microservice] > [journalbeat] > [logstash]
When sending more than 700,000 http requests in 6:30 minutes (throughput ~1800req/s), kibana only reports an average of 30,000 requests.
Once the perf test starts, the journalbeat logs are showing the higher rate of libbeat published_events which kinda match the test throughput:
Sep 04 21:25:00 host docker[1624]: 2017/09/04 21:25:00.914012 metrics.go:39: INFO Non-zero
metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=22 libbe
at.logstash.publish.read_bytes=132 libbeat.logstash.publish.write_bytes=46565
libbeat.logstash.published_and_acked_events=166 libbeat.publisher.messages_in_worker_queues=166 libb
eat.publisher.published_events=166
Sep 04 21:25:30 host docker[1624]: 2017/09/04 21:25:30.914018 metrics.go:39: INFO Non-zero
metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=23 libbe
at.logstash.publish.read_bytes=144 libbeat.logstash.publish.write_bytes=235214
libbeat.logstash.published_and_acked_events=2009 libbeat.publisher.messages_in_worker_queues=2009 l
ibbeat.publisher.published_events=2009
Sep 04 21:26:00 host docker[1624]: 2017/09/04 21:26:00.914024 metrics.go:39: INFO Non-zero
metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=8 libbea
t.logstash.publish.read_bytes=60 libbeat.logstash.publish.write_bytes=177786
libbeat.logstash.published_and_acked_events=1644 libbeat.publisher.messages_in_worker_queues=1644 lib
beat.publisher.published_events=1644
...
The issue seem to reside with the logstash input beat plugin, when disabling all filters and setting the output to stdout with the dots codec, events ingestion rates follow (average of 140 events/s) :
/usr/share/logstash/bin/logstash "--path.settings" "/etc/logstash" | pv -Wbart > /dev/null
489KiB 1:39:14 [29.7 B/s] [141.1 B/s]
Stripped down logstash pipeline:
input {
beats {
port => "5043"
}
}
filter {
# if [type] == "journal" {
# json {
# source => "message"
# skip_on_invalid_json => true
# }
# }
}
output {
stdout {
codec => dots
}
}
Logstash monitoring (x-pack) shows an average of 200 events/s during perf tests for received and emitted:
A bit confused since libbeat shows the high rate of published events while logstash does not ingest more than the received rate which is quite low.
The first course of action was to tweak the journalbeat configuration:
output.logstash:
bulk_max_size: 3000
enabled: true
hosts: ["${LOGSTASH_HOST}"]
worker: 2
Adding the bulk_max_size
and increasing the number of worker
did not show significant variance in results. Same while experiencing with the option loadbalance
or bypassing the AWS ELB fronting the logstashes (by specifying directly the nodes IPs).
The second step was to tweak various tunable in logstash configuration to no avail:
pipeline:
#Those needs to be changed to the instance CPU number
workers: 4
output.workers: 2
batch:
size: 500
delay: 5
My understanding is that the number of workers
and output.workers
affect filter
and output
, but not input
. Increasing the batch size
and decreasing the delay
(which was 20) did not show better results.
Switching from the persistent queue
to the default in memory
did not improve the ingestion rate either which was also expected since i/o
is not an issue as seen on the logstash servers metrics during the perf test.
Resource wise, CPU consumption is low on logstash servers (~4%) and around ~40% on the servers hosting the microservice and running journalbeat.
Journalbeat is based on libbeat so I am assuming that most of advices related to filebeat might apply (options here: https://github.com/mheese/journalbeat/blob/master/etc/journalbeat.yml).
First, I'd like to confirm that my analysis is correct or not, does it make sense that the bottleneck is the input beat here? The fact that libbeat logs a high number of published events matching the perf test throughput made me almost rule out journalbeat.
I'd be glad to know if the issue lays somewhere else.
Now if the input beat is the culprit, what can be done about that?
I can't find any tunable about it in the configuration options, can't increase threads or other performance parameters.
Components versions:
journalbeat (5.5.0)
logstash (5.5.0)
logstash-input-beats (3.1.21)
Any ideas or suggestions regarding this would be greatly appreciated!
Thanks in advance,
G.