it is a little difficult to remote debug performance in multi-step pipelines. The bottleneck can be in different places or can be a sum of multiple smaller bottlenecks.
Let's take a step back and select an architecture first. Once we have settled on the architecture let's separate the pipeline, define some tests, tune in separation, and finally put everthing back together.
Redis. vs Kafka:
When using redis, you can collect events using Logstash only. When using Kafka you can collect events using Filebeat or Logstash.
Sizing requirements for redis (memory) and Kafka (disk) will be somewhat different. When using redis, the events are stored in a List in main memory by default. Filebeat can also load-balance to multiple redis instances.
When using Kafka, the events are stored on disk. But it depends on your retention policy when events will be evicted. Kafka will keep events on disk, event if all events have been consumed. Cleaning up disk space is subject to the configured retention policy. If your disks unexpectatly run full, this is not due to slow consumers, but due to the sizing and/or retention policy not matching storage requirements. If consumers are always behind and never have a chance to catch up, then events will get lost due to the retention policy.
When using kafka, the number of partitions (to be set at topic creation time) determine the amount of horizontal scaling you can have. When increasing the number of topics one also wants to have multiple kafka brokers and enabled replication (disabled by default). Without replication the whole system can become blocked if a broker goes down.
The discussions so far I'm assuming that we are settling on Kafka. What is nice about Kafka is that it somewhat decouples the original FB from the consumer FB/Logstash. When testing/tuning for throughput one can use a pre-filled topic and just change the name of the consumer group (by default Kafka removes a consumer groups state after 7 days) or reset the 'offset' to oldest in Filebeat/Logstash.
Next I would run some tests to get an idea about performance of the single components:
- fileabeat -> stdout: Get some base line performance of what the input system is capable
- filebeat -> kafka: Does network + kafka still allow high enough rates
- kafka -> filebeat -> stdout: How fast can we consume end re-encode events to JSON
- kafka -> logstash -> stdout with JSON codec : Same, but let's see if LS allows for higher rates (uses a different kafka client)
- (optional) FB/Logstash -> ES: see if the machines we run the collecting FB/LS instance do impact performance as well
- kafka -> FB/LS -> ES
It seems like you can already push fast enough to Kafka, so no need to run the first 2 tests. But it's still nice to have them in case we want to modify the Kafka server configuration (e.g. increase partitions, add brokers, configure replication, require minimum number of ACKs within cluster, ...).
For tests 1 and 2 you it is best to have an already written sample log file. Do not test with a live log file, as the writes on the active file can bias the outcome. Also remove the data directory (especially) registry between runs, so filebeat starts collecting from the beginning.
For test 1 run
./filebeat -c test-config.yml | pv -Warl > /dev/null. Filebeat will create JSON events, one per line and the
pv tool will show you current and average rate of lines being written by Filebeat.
For test 2 configure the kafka output and collect metrics from the HTTP API or via monitoring in Filebeat.
For the upcoming tests use Filebeat (from test 2) and prefill a kafka topic. We don't need to run the other tests with logs being actively published. This also enables us to try to tune for even higher event rates (logs written by your CISCO setup) then we are currently seeing. When changing the number of partitions on your test topic, you need to prefill it again.
Test 3 (kafka -> FB -> stdout): Filebeat uses end-to-end ACK. Only if the output has ACKed an event will Filebeat update the read offset in Kafka (special API call). This tests checks
kafka -> FB kafka input -> json parsing -> batching -> json encoding + induced latency due to end-to-end ACK. Use the same command as for the first test, but with the kafka input configured. Tuning potentials: number of partitions (e.g. up to one per physical disk * brokers) + number of consumers within a consumer group (e.g. multiple beats or multiple kafka inputs),
fetch.default settings in kafka input.
Test 4 is what you already did. I'm just not sure which codec you used when pushing to /dev/null (for complex events JSON encoding can add quite some CPU overhead).
Choose Filebeat or Logstash based on the outcome of tests 3 or 4. If both are fast enough we will select FB or Logstash depending on further testing. When comparing performance keep in mind that FB has full end-to-end ACK, which LS does not have.
Test 5 is somewhat interesting. You mentioned that directly pushing from the edge machine to ES gets you high throughput, but pushing from the 'collector' (reading from Kafka) Filebeat/Logstash kills throughput. This makes me wonder if the machine type, network setup, or just the FB/Logstash output configuration on the 'collector' machine play a role here.
e.g. prepare a log file to collect from and directly push to ES from these machines. Check if you can tune the output. For filebeat we can tune
output.elasticsearch.workers. In order to keep the outputs 'saturated' select
queue.mem.events >= output.elasticsearch.bulk_max_size * output.elasticsearch.workers. Quotas or rate limits in your network setup can also impact throughput. Check of
output.elasticsearch.compression_level has an impact as well (compression is disabled by default).
Test 6: setup check complete 'collector' pipeline using the sample Kafka topic you have prepared.
If it seems like your throughput is good enough after Test 6,
In case you change the number of Kafka partitions, add more brokers to Kafka, or add replication (or change ACK policy), also re-run Test 2, as these changes can impact the initial Filebeat->Kafka publishing as well.
When testing, please share configurations and metrics with us. We can build a better picture of your setup, the more information we have.