Filebeat may be sending bad packets


(Richard Westmoreland) #1

Hello,

I'm working on a new Beat which has a tcp listener and decodes lumberjackv2. I ran into a problem with errors which I initially thought was due to my code, but as I've added more and more debugging and worked backwards, I think the issue may actually be originating from the official Filebeat that is sending the data.

I've confirmed that my listener is receiving the entire binary stream, starting with 2W, then 2C, the compressed payload is decompressed and completely intact, and all the seqs match up. This loops through fine but then instead of 2W I get garbage, the connection is reset and Filebeats and my beat have to start again where it left off.

The bad bytes are not consistent, I get (hex representation) fbc6, c804, 1897, 4e6f, etc. So this appears to be completely random. The previous payload is confirmed to be 100% received and decoded so I know it isn't leftover from that. I read the Golang documentation on reading from net.conn and this is blocking, so the 2 byte buffer should accurately reflect what is coming over tcp.

As per your support request guidance I ran filebeat in debug mode and I see different errors that closely match the time my beat gets the garbage bytes and resets the connection:

Error setting up harvesters: ...... too many open files
Error publishing events (retrying): ....... write: connection reset by peer
Failed to publish events caused by: EOF
Connect failed with: dial ....... socket: too many open files

After this goes awhile it is in a bad state and will no longer move data. If I restart my beat, this does not recover. However if I restart Filebeat, the connection works again until it gets to the above note errors.

Beat version: 5.1.2
Operating System: Centos7
Configuration:
    input_type: log
    output.logstash:  ssl.enabled: false, compression_level: 5, worker: 1, pipelining: 2, timeout: 10s, max_retries: 65500, bulk_max_size: 512

The beat I'm building is based on libbeat 5.2.3.

Obviously having too many open files itself is a problem that I need to fix regardless, but the concern I'm raising is why is Filebeat sending bad packets?

Maybe this helps - in my Filebeat debug output, it notes All prospectors are initialized and running with 2043 states to persist. The failure occurs at file 1017. My ulimit was 1024. I went ahead and set this to 8192 and tried the test again - file open errors are gone, but the result is the same. After file 1020, I get:

Failed to publish events caused by: client is not connected
Failed to publish events caused by: write tcp ....... write: connection reset by peer
Error publishing events caused by: client is not connected
Failed to publish events caused by: EOF

I've just now recompiled my beat with the extra debug logging commented out and reran the test. Surprisingly, Filebeat is able to load up all the prospectors without issue, and my listener is receiving data non-stop without any bad bytes. So somehow writing aggressively to console contributed to the same issue as attempting to open more file handles than the limit. I don't know why either of these would cause Filebeat to send bad data on the socket.

Ideas?


(Richard Westmoreland) #2

I figured it out. It wasn't Beats after all. My zlib reader was prematurely ending it's read as my process was running faster than filebeat could send to it. I had to put in a step that blocked until the exact expected number of bytes were read and this worked perfectly. This can be closed.


(Steffen Siering) #3

Note, this behaviour of incomplete reads is not due to filebeat, but is related to how network stacks work.


(system) #4

This topic was automatically closed after 21 days. New replies are no longer allowed.