Filebeat as a UDP Syslog Listener Dropping Alot of Logs

So we've been using a single filebeat as a listener for a GOOD amount of Juniper SRX firewalls (like 50 or so) and it's been working really well.

We recently did a test and ran a script that fires 10 firewall logs on an obscure port -- and we noticed in Kibana that we would see like 8...sometimes 6...sometimes 4...we were missing logs.

So I did some research and figured out that we didn't include a read_buffer on the UDP input, so I tried a read_buffer: 100MiB in the filebeat.yml and the logs spiked for a moment -- then it went back to normal volume.

So I added a BIG read_buffer of 1000MiB and boom, we start seeing the ACTUAL amount of firewall logs -- but it was processing too slow and couldn't catch up to real-time.

So I doubled the buffer to 2000MiB (2gb) and we still kept seeing the real-volume of logs but it couldn't keep up so we had to remove the read_buffer until we can figure this out.

See image:

The filebeat is running on Windows 2K16 Server with a lot of ram and 2TB of space for this disk. Memory never went above 63% and CPU above 45%.

Any ideas are VERY appreciated as this is time sensitive.

Edit: It should be noted we confirmed that all of the logs we were looking for were getting to the filebeat server via WireShark.

Hi @sudont Welcome to the community.

Can you post your full filebeat.yml,

There are a number of parameters that can be tuned, just adding a queue will probably not solve the issue, that does not affect the throughput.

What is the target events / sec?

What is the avg message size are you really processing 10mb message?

What is the number of CPUs?

Also are you going straight to elasticsearch? Or are there other components.

Even though certain resources are not maxed you may need to scale horizontally, but let's take a look first.

@stephenb Thanks for the reply and the warm welcome!

Tough to say the exact target EPS -- but I can say, when the buffer was at 2GB and the full load of logs were coming across was at 3.1k per second roughly.

When it's "throttled" (with NO read_buffer) normal activity is roughly 800 per second (this seems to be the max that filebeat with our bad config can handle listening to UDP).

Average message size is a standard Juniper Log - so like 1.3KB.

I'll have to check on the number of CPUs though like I said before -- i do know that we never reached any critical CPU usage. (40%ish)

Here's the filebeat.yml

filebeat.inputs:
- type: udp
  max_message_size: 10MiB
  read_buffer: THIS IS WHERE I SET read_buffer: 1000MiB
  host: "0.0.0.0:514"
  fields:
    logzio_codec: plain
    token: XXXXXXXXXXXXXXXXXXXXXXXXXX
    type: udpsyslog
  fields_under_root: true
- type: tcp
  max_message_size: 10MiB
  host: "0.0.0.0:9000"
  fields:
    logzio_codec: plain
    token: XXXXXXXXXXXXXXXXXXXXXXXXX
    type: tcpsyslog
  fields_under_root: true
#For version 6.x and lower uncomment the line below and remove the line after it 
#filebeat.registry_file: 'C:\ProgramData\Filebeat' 
filebeat.registry.path: 'D:\filebeat'
#The following processors are to ensure compatibility with version 7
processors:
- rename:
    fields:
     - from: "agent"
       to: "beat_agent"
    ignore_missing: true
- rename:
    fields:
     - from: "log.file.path"
       to: "source"
    ignore_missing: true
############################# Output ##########################################
output:
  logstash:
    hosts: ["listener.logz.io:5015"]
    ssl:
      certificate_authorities: ['D:\filebeat\COMODORSADomainValidationSecureServerCA.crt']

NOTE - the TCP input isn't really being used and the other log sources are negligible.

It's going up to Logz.io and I have direct visibility to know that ingestion isn't an issue. The listeners can handle 5TB/day traffic on this specific cluster -- we had it bumped way up so we could rule that out as a factor and have direct communication with the OPs team monitoring the log engines/listeners.

Super interested to hear your thoughts!

Feel free to DM me your email and I'll show my appreciation with an Amazon Giftcard (i'm that thankful, haha.)

And your sure your Logstash is not backing up?

You can always test by sending direct to elasticsearch endpoint, eliminate variables.

First thing I would try is up your worker on the Logstash output try. Don't over read into the docs description.

worker: 4

I understand CPU is not maxed but that is not always obvious, as sometimes it is because you are not putting it all to work.

@stephenb Yeah -- we're not using logstash anywhere in the pipeline.

It's just the filebeat listening to UDP 514 - and then sending over SSL straight up to Kafka which load balances for our ES cluster. Positive that has not backed up so far.

Is there anything we can do in the filebeat config to increase it's capacity/capability?

I mean increasing the read_buffer to 2GB allowed us to process the real amount of logs that are coming in -- it seems it just can't keep up with it, it was processing in filebeat slower than real-time.

I just actually read the link -- are those workers we can add in the filebeat output?

Ok confused.

Your config you just posted

output:
  logstash:
    hosts: ["listener.logz.io:5015"]
    ssl:
      certificate_authorities: ....

If you are using kafka can you post your actual config

Yes worker is supported for Kafka output try setting

worker: 4

The ingestion pipeline/Kafka is managed by our Operations team and shouldn't have an issue keeping up with the load.

When we restarted filebeat with the no read buffer (so it was sending the normal amount of throttled logs) the listeners responded and ingested everything immediately. The issue seems to be the UDP packet loss at the filebeat processing.

Is specifying the workers telling Kafka to increase workers? Or is that local on the filebeat machine? Cause i'm pretty positive that part of the pipeline is keeping up fine. I reached out to the OPs team to triple check that nothing was overstressed during those spikes.

Since we saw a significant improvement increasing the read_buffer (but it couldn't keep up) it seems like the slowdown was in the filebeat trying to handle the load. Though -- i'm definitely not the expert on the filebeat part.

Worker parallels Filebeat writing... That may increase throughput.

Buffering does not increase throughput, it is for absorbing spikes / back pressure.

Trying to help here, but a bit confused.....You show a config that says Logstash, then you tell me you are writing kafka, and then not show me that config, not sure how I can help but guess.

worker : 4

Should increase Filebeat write throughput to kafka that may help.

I am not saying that is the fix, but hard to tell when I am only getting partial info.

To be clear I / we I think are trying to increase the throughput of filebeat, throughput combination of input processing, and output for filebeat.

Looks like little processing it's about read and write, increasing the workers in the output (in this case to kafka) can / may increase overall throughput.

I don't know if you are trying to compress etc or other parameters in kafka output.

I apologize -- I do not have access to the Kafka/Log Engine config. I'm waiting on a reply from my OPs team, but lets assume the issue is at the server/filebeat level.

I will try the worker increase though -- that sounds like it may be the right solution.

Can you let me know if this is the correct formatting?

filebeat.inputs:
- type: udp
  max_message_size: 10MiB
  read_buffer: THIS IS WHERE I SET read_buffer: 1000MiB
  host: "0.0.0.0:514"
  fields:
    logzio_codec: plain
    token: XXXXXXXXXXXXXXXXXXXXXXXXXX
    type: udpsyslog
  fields_under_root: true
- type: tcp
  max_message_size: 10MiB
  host: "0.0.0.0:9000"
  fields:
    logzio_codec: plain
    token: XXXXXXXXXXXXXXXXXXXXXXXXX
    type: tcpsyslog
  fields_under_root: true
#For version 6.x and lower uncomment the line below and remove the line after it 
#filebeat.registry_file: 'C:\ProgramData\Filebeat' 
filebeat.registry.path: 'D:\filebeat'
#The following processors are to ensure compatibility with version 7
processors:
- rename:
    fields:
     - from: "agent"
       to: "beat_agent"
    ignore_missing: true
- rename:
    fields:
     - from: "log.file.path"
       to: "source"
    ignore_missing: true
############################# Output ##########################################
output:
  logstash:
    hosts: ["listener.logz.io:5015"]
    ssl:
      certificate_authorities: ['D:\filebeat\COMODORSADomainValidationSecureServerCA.crt']
  worker: 4

Just making sure the placement -- and formatting of the "worker: 4" is correct.

Also if I'm increasing the workers to 4 -- is each worker going to get it's own read_buffer: of 2GB? or should I keep/eliminate that part?

I am not looking for your kafka config. I am not a Kafka engine config expert so I can not help debug that. You said you have plenty of capacity there, ok good.

I am trying to focus on Filebeat, that was the topic of this discussion.

According to that filebeat.yml config you just posted you are sending the data to Logstash, using the Logstash output that is my assumption, if you were sending from Filebeat directly to Kafka I would expect to see the Kafka Filebeat output where you have the Logstash output.

This is what I am completely confused about.... But we will proceed.

This is what your config says... What is after Logstash I don't know I suspect it is kafka as you keep mentioning it
UDP -> Filebeat -> Logstash -> Probably Kafka -> Probably Elasticsearch

If course it says logzio do I have no clue about the internals.

So back to my very first post , it possible Logstash is backing up, but we will proceed assume it is not...I suspect Filebeat is just not writing fast enough.

No don't add more buffer right now, change 1 parameter at a time, it's taken a lot of typing to try 1 parameter :slight_smile:

Just start by simply adding 1 parameter..... Most people would format as follows. We might want to try some other settings this is the first I would try.

output.logstash:
  hosts: ["listener.logz.io:5015"]
  ssl.certificate_authorities: ['D:\filebeat\COMODORSADomainValidationSecureServerCA.crt']
  worker: 4

Also I am not familiar with logzio ingestion perhaps you should check with them.

Hiya @sudont

I was scratching my head last night thinking ... this feels I/O bound, it was bugging me when I woke up so I ran a quick test.

I suspect perhaps you have back pressure somewhere in your system, that is slowing filebeat writing down.

I think you should be able to get 3K / sec without any major tuning... then you might need to add the workers etc. You do have the a little processing going on with logzio_codec: plain but as you said you do not look CPU bound.

I just ran this... the most basic filebeat (yes TCP easier to netcat) but UDP should be basically the same

filebeat-tcp-simple.yml

filebeat.inputs:
- type: tcp
  max_message_size: 10MiB
  host: "localhost:9000"
output.logstash:
  hosts: ["localhost:5044"]

I ran this all on my mac, all defaults.

stream logfile to TCP > filebeat > logstash > elasticsearch

My Log Lines are about 400b (so a little smaller) than yours and I am getting sustained greater than 5K events / sec. That is without any tuning, you should be able to get your 3K/s.

So in summary I think you need to look at what is downstream (after) slowing filebeat down.

1 Like

Wow, thanks for the additional work on this.

I'm still waiting on confirmation that it's not a post-issue.

What convinces me that it's in the filebeat part of the equation is that when I increased the read_buffer: 2000MiB we started seeing the actual amount of logs. This says that something in the processing of the logs at the filebeat level is holding them up and that a memory buffer increased the capability -- temporarily, but it couldn't keep up with time.

I'll report back once I have my contact add the 4 workers and we can test it.

Just My perspective from a distance, I am often wrong.. :slight_smile:

I could possibly interpret that chart differently...

That show the logs that made it to elasticsearch with timestamps (important) .. so when the buffer increased more logs made it into filebeat... then eventually they made it to elasticsearch with the proper timestamps... so more made it through for a certain period giving the appears throughput went up... ... once that queue was full it started dropping messages ... so the queue just let some extra messages get processed but actually throughput rate did not really change... it can look misleading IMHO. Remember Discover is timestamp driven, the timestamp is most likely because of the codec is setting the originating timestamp ...

Its like you have bucket of water that is being filled faster than it can be drained, it can only be drained at a certain rate, but it filling faster and overflowing out... the queue made the bucket bigger but did not change the rate of fill / drain... so as soon as it was full it started dropping again.. but a little time it was filling up the queue it not dropping, those logs / timestamp made to elasticsearch so that timeframe got a full set making it look like throughput when up..

Perhaps.. Perhaps Not.

Easy way to test... put a little elasticsearch right on that filebeat box and pump it to elasticsearch

Just download unzip and run take about 5 mins... don't change anything ... bet it will fly...

1 Like

I gotcha -- makes sense.

In the chart that I showed -- that's what was shocking to me, once I opened op the buffer those were properly timestamped logs from Juniper getting indexed, so thats why I'm under the impression they're the actual load of logs that we're supposed to be seeing.

Yeah it can be confusing and I'm not saying I'm 100% right but I've seen those kind of spikes before and scratched my head.

In the end this is a throughput issue I believe.... it may not I don't know what that code it does but if it was super CPU intensive I think you would see your CPU spiking.

You can still try the workers setting.

Let me know what you find out heading out.

Trying this later this afternoon, FYI.

Hey Stephen,

Was able to connect with the team today thats running the Filebeat and Junipers and ran some tests. So we tried Increasing workers to 4 and then to 7 and saw no change. We also saw no change in CPU/MEM utilization.

There was also no change to

Then I thought to try and see if having additional workers would help with increased read_buffer:1000MiB and read_buffer:2000MiB. There was no change here as well.

I also had our Operations team making sure that the issue was NOT ingestion, we never even saw spikes in our pipeline.

The only change we saw was we were again seeing the actual amount of logs when the read_buffer was increased and then it wasn't able to keep up.

We also ran wireshark on the server to see how many logs were coming in vs getting to Kibana for a specific firewall -- server saw 1.7mil logs for FIREWALL1 for an hour and we only logged 560k for that amount of time. So we officially confirmed the issue.

The issue definitely appears to be up front with how much filebeat can LISTEN to and PROCESS. It seems like increasing the workers increases the network output capacity:

Is there any way to add workers to the ingestion ?

Hi Scott,

Right now there is no way to scale that input, generally that is not an issue.

Lets Isolate, can you please try this...

Notice I took out the codec / everything / buffer, queues everything... this simply reads from UDP and writes to a file. Please try this and report back.

filebeat.inputs:
- type: udp
  host: "0.0.0.0:514"
output.file:
  path: "/tmp"
  filename: filebeat.out
  rotate_every_kb: 1024000

And see how fast it writes to disk / file.

Little imprecise but you should be able to run this and see how fast it is reading / writing....

wc -l /tmp/filebeat.out*

This read / wrote about ~6.8M records in a couple minutes this should give you an idea what the throughput is.... this was with TCP. I don't have an easy way to generate UDP and perhaps that is the difference / issue.

-rw-------  1 sbrown  staff  1048575521 Mar 17 15:28 filebeat.out.4
-rw-------  1 sbrown  staff  1048575526 Mar 17 15:29 filebeat.out.3
-rw-------  1 sbrown  staff  1048575457 Mar 17 15:29 filebeat.out.2
-rw-------  1 sbrown  staff  1048575812 Mar 17 15:30 filebeat.out.1
-rw-------  1 sbrown  staff   484158849 Mar 17 15:30 filebeat.out

wc -l /tmp/filebeat.out*
  712382 filebeat.out
 1545012 filebeat.out.1
 1545950 filebeat.out.2
 1543683 filebeat.out.3
 1547182 filebeat.out.4
 6894209 total

If that looks like it is writing fast add the fields and codec etc back and we shall see what is next.

Let me know what you find.

Okay -- will try tomorrow and let you know.

I can't find any examples of this syntax for queue/mem/disk etc -- can you tell me if my formatting/syntax is correct?

filebeat.inputs:
- type: udp
  max_message_size: 10MiB
  host: "0.0.0.0:514"
  fields:
    logzio_codec: plain
    token: XXXXXXXXXXXXXXXXXXXXXXX
    type: udpsyslog
  fields_under_root: true
  queue.mem:
    events: 4096
    flush.min_events: 2048
    flush.timeout: 1s
  queue.disk:
    path: "${path.data}/diskqueue"
    max_size: 100GB
    segment_size: 1GB
    read_ahead: 512
    write_ahead: 2048
    retry_interval: 1s
    max_retry_interval: 30s