Logstash input beat super low ingestion rate

Hello,

Hope y'all having a good day!
Really need some assistance regarding some performance issues on our logging pipeline.

In a nutshell, we have this common setup:
[journalbeat] > [logstash] > [elasticsearch]

While doing performance tests, there is a huge gap between the total of http requests logged by journalbeat and the number in elasticsearch, I'm talking only ~5% of total requests is seen in ES, you would think that it'd be obvious to spot such a huge loss...

The simple performance test follows:
[jmeter] > [microservice] > [journalbeat] > [logstash]

When sending more than 700,000 http requests in 6:30 minutes (throughput ~1800req/s), kibana only reports an average of 30,000 requests.

Once the perf test starts, the journalbeat logs are showing the higher rate of libbeat published_events which kinda match the test throughput:

Sep 04 21:25:00 host docker[1624]: 2017/09/04 21:25:00.914012 metrics.go:39: INFO Non-zero 
metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=22 libbe
at.logstash.publish.read_bytes=132 libbeat.logstash.publish.write_bytes=46565 
libbeat.logstash.published_and_acked_events=166 libbeat.publisher.messages_in_worker_queues=166 libb
eat.publisher.published_events=166
Sep 04 21:25:30 host docker[1624]: 2017/09/04 21:25:30.914018 metrics.go:39: INFO Non-zero 
metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=23 libbe
at.logstash.publish.read_bytes=144 libbeat.logstash.publish.write_bytes=235214 
libbeat.logstash.published_and_acked_events=2009 libbeat.publisher.messages_in_worker_queues=2009 l
ibbeat.publisher.published_events=2009
Sep 04 21:26:00 host docker[1624]: 2017/09/04 21:26:00.914024 metrics.go:39: INFO Non-zero 
metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=8 libbea
t.logstash.publish.read_bytes=60 libbeat.logstash.publish.write_bytes=177786 
libbeat.logstash.published_and_acked_events=1644 libbeat.publisher.messages_in_worker_queues=1644 lib
beat.publisher.published_events=1644
...

The issue seem to reside with the logstash input beat plugin, when disabling all filters and setting the output to stdout with the dots codec, events ingestion rates follow (average of 140 events/s) :

/usr/share/logstash/bin/logstash "--path.settings" "/etc/logstash" | pv -Wbart > /dev/null
489KiB 1:39:14 [29.7 B/s] [141.1 B/s]

Stripped down logstash pipeline:

input {
    beats {
        port => "5043"
    }
}
filter {
#    if [type] == "journal" {
#        json {
#            source => "message"
#            skip_on_invalid_json => true
#        }
#    }
}

output {
    stdout {
        codec => dots
    }
}

Logstash monitoring (x-pack) shows an average of 200 events/s during perf tests for received and emitted:

45

A bit confused since libbeat shows the high rate of published events while logstash does not ingest more than the received rate which is quite low.

The first course of action was to tweak the journalbeat configuration:

output.logstash:
  bulk_max_size: 3000
  enabled: true
  hosts: ["${LOGSTASH_HOST}"]
  worker: 2

Adding the bulk_max_size and increasing the number of worker did not show significant variance in results. Same while experiencing with the option loadbalance or bypassing the AWS ELB fronting the logstashes (by specifying directly the nodes IPs).

The second step was to tweak various tunable in logstash configuration to no avail:

pipeline:
  #Those needs to be changed to the instance CPU number
  workers: 4
  output.workers: 2
  batch:
    size: 500
    delay: 5

My understanding is that the number of workers and output.workers affect filter and output, but not input. Increasing the batch size and decreasing the delay (which was 20) did not show better results.

Switching from the persistent queue to the default in memory did not improve the ingestion rate either which was also expected since i/o is not an issue as seen on the logstash servers metrics during the perf test.

Resource wise, CPU consumption is low on logstash servers (~4%) and around ~40% on the servers hosting the microservice and running journalbeat.

Journalbeat is based on libbeat so I am assuming that most of advices related to filebeat might apply (options here: https://github.com/mheese/journalbeat/blob/master/etc/journalbeat.yml).

First, I'd like to confirm that my analysis is correct or not, does it make sense that the bottleneck is the input beat here? The fact that libbeat logs a high number of published events matching the perf test throughput made me almost rule out journalbeat.

I'd be glad to know if the issue lays somewhere else.
Now if the input beat is the culprit, what can be done about that?
I can't find any tunable about it in the configuration options, can't increase threads or other performance parameters.

Components versions:

journalbeat (5.5.0)
logstash (5.5.0)
logstash-input-beats (3.1.21)

Any ideas or suggestions regarding this would be greatly appreciated!

Thanks in advance,
G.

First, I have no experience with journalbeat. Skimming the code, journalbeat has some kind of complicated async ACK code, but enables infinite retry. Makes me wonder if journalbeat buffers most messages. Have you checked CPU and memory usage of journalbeat?

When configuring multiple workers in logstash output, you will must to enable loadbalance. Journalbeat publishes single events to libbeat. The libbeat internal queue size can be adjusted via queue_size. But this setting should not really affect throughput.

In logstash output you can also try to enable pipelining: 5.

e.g.

output.logstash:
  bulk_max_size: ...
  hosts: ...
  worker: 2
  loadbalance: true
  pipelining: 5

Have you check journalbeat logs. Also try to run journalbeat in debug mode. beats->logstash connection employs some slow start. The send window starts with 10 events, exponentially growing to bulk_max_size. If logstash returns errors early, you might actually end up with very small batches being transmitted.

The event rate you reporting seems to be super low. It's true logstash-input-beats slows down processing, but I would suspect the network being the main problem here. Meaning stability (connection loss) and overall throughput (AWS applying some TCP rate limiting). Makes me question, what happens if you run logstash with dots codec on the same host journalbeat is running on?

This topic was automatically closed after 21 days. New replies are no longer allowed.