So we've been using a single filebeat as a listener for a GOOD amount of Juniper SRX firewalls (like 50 or so) and it's been working really well.
We recently did a test and ran a script that fires 10 firewall logs on an obscure port -- and we noticed in Kibana that we would see like 8...sometimes 6...sometimes 4...we were missing logs.
So I did some research and figured out that we didn't include a read_buffer on the UDP input, so I tried a read_buffer: 100MiB in the filebeat.yml and the logs spiked for a moment -- then it went back to normal volume.
So I added a BIG read_buffer of 1000MiB and boom, we start seeing the ACTUAL amount of firewall logs -- but it was processing too slow and couldn't catch up to real-time.
So I doubled the buffer to 2000MiB (2gb) and we still kept seeing the real-volume of logs but it couldn't keep up so we had to remove the read_buffer until we can figure this out.
@stephenb Thanks for the reply and the warm welcome!
Tough to say the exact target EPS -- but I can say, when the buffer was at 2GB and the full load of logs were coming across was at 3.1k per second roughly.
When it's "throttled" (with NO read_buffer) normal activity is roughly 800 per second (this seems to be the max that filebeat with our bad config can handle listening to UDP).
Average message size is a standard Juniper Log - so like 1.3KB.
I'll have to check on the number of CPUs though like I said before -- i do know that we never reached any critical CPU usage. (40%ish)
Here's the filebeat.yml
filebeat.inputs:
- type: udp
max_message_size: 10MiB
read_buffer: THIS IS WHERE I SET read_buffer: 1000MiB
host: "0.0.0.0:514"
fields:
logzio_codec: plain
token: XXXXXXXXXXXXXXXXXXXXXXXXXX
type: udpsyslog
fields_under_root: true
- type: tcp
max_message_size: 10MiB
host: "0.0.0.0:9000"
fields:
logzio_codec: plain
token: XXXXXXXXXXXXXXXXXXXXXXXXX
type: tcpsyslog
fields_under_root: true
#For version 6.x and lower uncomment the line below and remove the line after it
#filebeat.registry_file: 'C:\ProgramData\Filebeat'
filebeat.registry.path: 'D:\filebeat'
#The following processors are to ensure compatibility with version 7
processors:
- rename:
fields:
- from: "agent"
to: "beat_agent"
ignore_missing: true
- rename:
fields:
- from: "log.file.path"
to: "source"
ignore_missing: true
############################# Output ##########################################
output:
logstash:
hosts: ["listener.logz.io:5015"]
ssl:
certificate_authorities: ['D:\filebeat\COMODORSADomainValidationSecureServerCA.crt']
NOTE - the TCP input isn't really being used and the other log sources are negligible.
It's going up to Logz.io and I have direct visibility to know that ingestion isn't an issue. The listeners can handle 5TB/day traffic on this specific cluster -- we had it bumped way up so we could rule that out as a factor and have direct communication with the OPs team monitoring the log engines/listeners.
Super interested to hear your thoughts!
Feel free to DM me your email and I'll show my appreciation with an Amazon Giftcard (i'm that thankful, haha.)
@stephenb Yeah -- we're not using logstash anywhere in the pipeline.
It's just the filebeat listening to UDP 514 - and then sending over SSL straight up to Kafka which load balances for our ES cluster. Positive that has not backed up so far.
Is there anything we can do in the filebeat config to increase it's capacity/capability?
I mean increasing the read_buffer to 2GB allowed us to process the real amount of logs that are coming in -- it seems it just can't keep up with it, it was processing in filebeat slower than real-time.
The ingestion pipeline/Kafka is managed by our Operations team and shouldn't have an issue keeping up with the load.
When we restarted filebeat with the no read buffer (so it was sending the normal amount of throttled logs) the listeners responded and ingested everything immediately. The issue seems to be the UDP packet loss at the filebeat processing.
Is specifying the workers telling Kafka to increase workers? Or is that local on the filebeat machine? Cause i'm pretty positive that part of the pipeline is keeping up fine. I reached out to the OPs team to triple check that nothing was overstressed during those spikes.
Since we saw a significant improvement increasing the read_buffer (but it couldn't keep up) it seems like the slowdown was in the filebeat trying to handle the load. Though -- i'm definitely not the expert on the filebeat part.
Worker parallels Filebeat writing... That may increase throughput.
Buffering does not increase throughput, it is for absorbing spikes / back pressure.
Trying to help here, but a bit confused.....You show a config that says Logstash, then you tell me you are writing kafka, and then not show me that config, not sure how I can help but guess.
worker : 4
Should increase Filebeat write throughput to kafka that may help.
I am not saying that is the fix, but hard to tell when I am only getting partial info.
To be clear I / we I think are trying to increase the throughput of filebeat, throughput combination of input processing, and output for filebeat.
Looks like little processing it's about read and write, increasing the workers in the output (in this case to kafka) can / may increase overall throughput.
I don't know if you are trying to compress etc or other parameters in kafka output.
I apologize -- I do not have access to the Kafka/Log Engine config. I'm waiting on a reply from my OPs team, but lets assume the issue is at the server/filebeat level.
I will try the worker increase though -- that sounds like it may be the right solution.
Can you let me know if this is the correct formatting?
filebeat.inputs:
- type: udp
max_message_size: 10MiB
read_buffer: THIS IS WHERE I SET read_buffer: 1000MiB
host: "0.0.0.0:514"
fields:
logzio_codec: plain
token: XXXXXXXXXXXXXXXXXXXXXXXXXX
type: udpsyslog
fields_under_root: true
- type: tcp
max_message_size: 10MiB
host: "0.0.0.0:9000"
fields:
logzio_codec: plain
token: XXXXXXXXXXXXXXXXXXXXXXXXX
type: tcpsyslog
fields_under_root: true
#For version 6.x and lower uncomment the line below and remove the line after it
#filebeat.registry_file: 'C:\ProgramData\Filebeat'
filebeat.registry.path: 'D:\filebeat'
#The following processors are to ensure compatibility with version 7
processors:
- rename:
fields:
- from: "agent"
to: "beat_agent"
ignore_missing: true
- rename:
fields:
- from: "log.file.path"
to: "source"
ignore_missing: true
############################# Output ##########################################
output:
logstash:
hosts: ["listener.logz.io:5015"]
ssl:
certificate_authorities: ['D:\filebeat\COMODORSADomainValidationSecureServerCA.crt']
worker: 4
Just making sure the placement -- and formatting of the "worker: 4" is correct.
Also if I'm increasing the workers to 4 -- is each worker going to get it's own read_buffer: of 2GB? or should I keep/eliminate that part?
I am not looking for your kafka config. I am not a Kafka engine config expert so I can not help debug that. You said you have plenty of capacity there, ok good.
I am trying to focus on Filebeat, that was the topic of this discussion.
According to that filebeat.yml config you just posted you are sending the data to Logstash, using the Logstash output that is my assumption, if you were sending from Filebeat directly to Kafka I would expect to see the Kafka Filebeat output where you have the Logstash output.
This is what I am completely confused about.... But we will proceed.
This is what your config says... What is after Logstash I don't know I suspect it is kafka as you keep mentioning it UDP -> Filebeat -> Logstash -> Probably Kafka -> Probably Elasticsearch
If course it says logzio do I have no clue about the internals.
So back to my very first post , it possible Logstash is backing up, but we will proceed assume it is not...I suspect Filebeat is just not writing fast enough.
No don't add more buffer right now, change 1 parameter at a time, it's taken a lot of typing to try 1 parameter
Just start by simply adding 1 parameter..... Most people would format as follows. We might want to try some other settings this is the first I would try.
I was scratching my head last night thinking ... this feels I/O bound, it was bugging me when I woke up so I ran a quick test.
I suspect perhaps you have back pressure somewhere in your system, that is slowing filebeat writing down.
I think you should be able to get 3K / sec without any major tuning... then you might need to add the workers etc. You do have the a little processing going on with logzio_codec: plain but as you said you do not look CPU bound.
I just ran this... the most basic filebeat (yes TCP easier to netcat) but UDP should be basically the same
stream logfile to TCP > filebeat > logstash > elasticsearch
My Log Lines are about 400b (so a little smaller) than yours and I am getting sustained greater than 5K events / sec. That is without any tuning, you should be able to get your 3K/s.
I'm still waiting on confirmation that it's not a post-issue.
What convinces me that it's in the filebeat part of the equation is that when I increased the read_buffer: 2000MiB we started seeing the actual amount of logs. This says that something in the processing of the logs at the filebeat level is holding them up and that a memory buffer increased the capability -- temporarily, but it couldn't keep up with time.
I'll report back once I have my contact add the 4 workers and we can test it.
Just My perspective from a distance, I am often wrong..
I could possibly interpret that chart differently...
That show the logs that made it to elasticsearch with timestamps (important) .. so when the buffer increased more logs made it into filebeat... then eventually they made it to elasticsearch with the proper timestamps... so more made it through for a certain period giving the appears throughput went up... ... once that queue was full it started dropping messages ... so the queue just let some extra messages get processed but actually throughput rate did not really change... it can look misleading IMHO. Remember Discover is timestamp driven, the timestamp is most likely because of the codec is setting the originating timestamp ...
Its like you have bucket of water that is being filled faster than it can be drained, it can only be drained at a certain rate, but it filling faster and overflowing out... the queue made the bucket bigger but did not change the rate of fill / drain... so as soon as it was full it started dropping again.. but a little time it was filling up the queue it not dropping, those logs / timestamp made to elasticsearch so that timeframe got a full set making it look like throughput when up..
Perhaps.. Perhaps Not.
Easy way to test... put a little elasticsearch right on that filebeat box and pump it to elasticsearch
Just download unzip and run take about 5 mins... don't change anything ... bet it will fly...
In the chart that I showed -- that's what was shocking to me, once I opened op the buffer those were properly timestamped logs from Juniper getting indexed, so thats why I'm under the impression they're the actual load of logs that we're supposed to be seeing.
Yeah it can be confusing and I'm not saying I'm 100% right but I've seen those kind of spikes before and scratched my head.
In the end this is a throughput issue I believe.... it may not I don't know what that code it does but if it was super CPU intensive I think you would see your CPU spiking.
Was able to connect with the team today thats running the Filebeat and Junipers and ran some tests. So we tried Increasing workers to 4 and then to 7 and saw no change. We also saw no change in CPU/MEM utilization.
Then I thought to try and see if having additional workers would help with increased read_buffer:1000MiB and read_buffer:2000MiB. There was no change here as well.
I also had our Operations team making sure that the issue was NOT ingestion, we never even saw spikes in our pipeline.
The only change we saw was we were again seeing the actual amount of logs when the read_buffer was increased and then it wasn't able to keep up.
We also ran wireshark on the server to see how many logs were coming in vs getting to Kibana for a specific firewall -- server saw 1.7mil logs for FIREWALL1 for an hour and we only logged 560k for that amount of time. So we officially confirmed the issue.
The issue definitely appears to be up front with how much filebeat can LISTEN to and PROCESS. It seems like increasing the workers increases the network output capacity:
Right now there is no way to scale that input, generally that is not an issue.
Lets Isolate, can you please try this...
Notice I took out the codec / everything / buffer, queues everything... this simply reads from UDP and writes to a file. Please try this and report back.
Little imprecise but you should be able to run this and see how fast it is reading / writing....
wc -l /tmp/filebeat.out*
This read / wrote about ~6.8M records in a couple minutes this should give you an idea what the throughput is.... this was with TCP. I don't have an easy way to generate UDP and perhaps that is the difference / issue.
-rw------- 1 sbrown staff 1048575521 Mar 17 15:28 filebeat.out.4
-rw------- 1 sbrown staff 1048575526 Mar 17 15:29 filebeat.out.3
-rw------- 1 sbrown staff 1048575457 Mar 17 15:29 filebeat.out.2
-rw------- 1 sbrown staff 1048575812 Mar 17 15:30 filebeat.out.1
-rw------- 1 sbrown staff 484158849 Mar 17 15:30 filebeat.out
wc -l /tmp/filebeat.out*
712382 filebeat.out
1545012 filebeat.out.1
1545950 filebeat.out.2
1543683 filebeat.out.3
1547182 filebeat.out.4
6894209 total
If that looks like it is writing fast add the fields and codec etc back and we shall see what is next.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.