I have multiple machines in my environment (lets say 100).
On each of those machines I configured filebeat and metricbeat to send data to Kafka.
Each machine has 2 dedicated topics in kafka(one for metricbeat and one for filebeat), in total I have 100 * 2 topics = 200 topics in kafka.
I configured logstash to consume the data from kafka. I have two different pipelines (one for topics start with metricbeat*- and another for topics that start with filebeat-*).
The filebeat data I'm passing to s3 and the metricbeat data I'm passing to ES.
When I configure s3 output I need also to configure temporary_directory for the files. However, when I work with multiple logstash nodes (scale up) I see that only one of the logstash nodes has content in the temporary_directory and the second has an empty dir. That node, doesnt print anything in the log after the first offset adjustment "Setting offset for partition ..." while the second node keep working.
In s3 I see only one file is uploaded per minute(I set time_interval for 1) while I thought that I should get 2 files per minute. The logastash nodes are identical in aspect of configuration.
So the open questions are :
1.I have two logstash nodes but only one of them is working and uploading files into s3. How can I benefit from scaling logstash ?
2. If I'll be able to use both logstash nodes, there is a chance that the following scenario will happen
Logstash 1 contains all the even lines of a file
Logstash 2 contains all odd lines of a file
and they upload two different files to s3 that need to be merged and reordered.
How can I avoid the reorder part ?
Yes, your second Logstash instance is idle because there is only 1 partition per topic in your Kafka so only one Logstash instance consumes from it. When you have multiple consumers in a consumer group , Kafka assigns one consumer to one partition.
You are also specifying consumer_threads as 5 so even with that probably one thread is getting all the events from Kafka because there is only one partition while the other 4 threads (on a single Logstash instance) just sit idle.
I am not sure what your overall throughput expectations are but you could do either of the following depending on your use case:
Have multiple threads on a single Logstash instance and increase the number of partitions per topic to match the number of threads you create.
Have one thread per Logstash instance, two Logstash instances (or however more you want) but again match the number of partitions to the total number of Logstash instances.
See the description section on this link which explains more. There is also a link to a kafka page which has more info if you want to dig deeper.
Also unless you are doing any log parsing/transformation (which does not look like from your Logstash pipeline) in Logstash before shipping it to S3, there is a Kafka-to-s3 connector available, assuming you are all wired up with the Confluent platform.
First of all thank you for the clear explanation.
I got a few more questions related to your answer :
If I want to gain as much as possible from my current environment, isnt it best to configure multiple threads in both of my logstash instances ? Just as an example, lets assume I got 200 topics and each topic is exactly one partition = Total of 200 topics. Now As a best practice, shouldnt I divide those 200 partitions between my 2 logstash nodes ? 100 for each logstash node, which means 100 cosumer_threads per logstash instance ? Maybe I'll need even less than 100 consumer threads or my machine wont be able to handle 100 threads and 100 is an overkill but is this the big idea ?
2.I want to keep an order in the data that consumed from kafka (by multiple logstash nodes). Therefore, I have 1 partition for every topic. Once my logstash thread is consuming from a specific topic, is there any chance that a different logstash thread will consume from the same topic ? What happens if my logstash machine is down ? A new consumer can consume that topic ?