Filebeat missing files


(Bud B) #1

Hi,

filebeat version 1.3.1 (amd64)
logstash 2.3.4
elasticsearch Version: 2.3.5, Build: 90f439f/2016-07-27T10:36:52Z, JVM: 1.8.0_101

all running on Ubuntu 14.04

I have 40 or so servers feeding application logs with filebeat to logstash and on to elasticsearch. Every day, none of the content from a few logs files from several different servers seems to make their way from the logging server running filebeat to elasticsearch. I suspect filebeat. I am running the 1.3.1 snapshot which fixed a startup issue and this may be related as I don't recall seeing the issue prior to filebeat 1.3. To me, the telling sign is that the missing log file is listed in /.filebeat but the listed "offset" is not equal to the actual size of the file. In one example, the file is 352565 bytes and the offset currently is "offset":314160. There are no errors in the logstash or elasticsearch logs.

Here are some details about my setup. Each server is running several instances of the application. Each instance of the application generates it's own log file and all of the logs go to the same directory. The log files generally grow to a few hundred K throughout the day. Several weeks of log files can accumulate on the servers -- one server I am looking at has 367 files in the directory. Nothing unique about the "missing" files relative to the ones that are processed.

Filebeat YAML file (sanitized):

output:
  logstash:
    enabled: true
    hosts:
      - elk.ourcompany.com:5044
    tls:
      certificate_authorities:
        - /etc/pki/tls/certs/filebeat.crt
    timeout: 15

filebeat:
  prospectors:
    -
      paths:
        - /var/log/corefile.log
        - "/var/log/application/*.log"
      document_type: application-fb
      ignore_older: 24h
    -
      paths:
        - "/var/log/transaction/*.xxx"
        - "/var/log/transaction/*.yyy"
        - "/var/log/transaction/*.zzz"
      document_type: transaction-fb
      ignore_older: 24h

Any ideas?


(ruflin) #2

What is the total number of files you have in your directories? Could it be that you hit some open file handler limit? Any errors in the filebeat log? Could you share the log?

Is there any log rotation in place? Based on the config file above I assume you are not working with any network shares?

As you have set ignore_older to 24h I'm also curious what the modification date of the file is that where the data was not shipped?

Any chance to test with filebeat-5.0 snapshots to see if you have the same issues? https://beats-nightlies.s3.amazonaws.com/index.html?prefix=filebeat/


(Bud B) #3

@ruflin, Last night I decided to restart filebeat on the servers.

Looking back, during the startup, There were about 300K errors related to too many open files:

70K of the errors were related to the registry:

registrar.go:109: Writing of registry returned error: open /.filebeat.new: too many open files. Continuing..

or

registrar.go:151: Failed to create tempfile (/.filebeat.new) for writing: open /.filebeat.new: too many open files

The other 240K errors were related to the log files:

log.go:105: Stop Harvesting. Unexpected file opening error: open /var/log/transaction/2015-03-19-051140-aaaaaaa.xxx: too many open files
log.go:233: Failed opening /var/log/transaction/2015-09-17-131704-bbbbbbb.xxx: open /var/log/transaction/2015-09-17-131704-bbbbbbb.xxx: too many open files

So the open file limit is 1024 and there are more than 1024 files in the directories -- however, most of the files are older than 24 hours and should be ignored, right?

-- Bud


(ruflin) #4

I think we can nail down your problem to too many open files then. Thanks for checking.

The problem is, that not only the harvester does need file descriptors, but for example also the output and the writing of the registry file. TBH as soon as filebeat 1.x runs once out of file descriptors things can go wrong as the state also depends on the registry file (which was not written) so the state also gets out of sync.

As the internal state is separated from the written state in 5.x this should be less an issue but still as soon as the max open file is reached, behaviour can get shaky. The prevent this kind of issues on environments with a low number of max open file handlers we introduced harvester_limit: https://www.elastic.co/guide/en/beats/filebeat/5.0/configuration-filebeat-options.html#harvester-limit

As you wrote that only a few files get updated in parallel I would recommend you to set harvester_limit to a low value like 50-100 which should be more then enough. You are right that because of ignore_older most of the files should already have been ignored.

It would be nice if you could test the following two things:

  • Filebeat 5.0 with the same config options you had above to see if the problem persists
  • In case you still hit the issue, try out the harvester_limit option to see if this solves the issue.

(Bud B) #5

@ruflin, Can I just drop in Filebeat 5.0.0-alpha5 and it will work with my existing logstash 2.3.4 and elasticsearch 2.3.5? If so, I'll try to deploy that tonight. -- Bud


(Bud B) #6

@ruflin, So I put Filebeat 5.0.0-alpha5 on a test server with a lot of old log files. I stopped the 1.3.1 version and started filebeat with:

/usr/share/filebeat/bin/filebeat -c /etc/filebeat/filebeat.yml -e

so I could see what is going on. I'm getting thousands of logs that look like:

filebeat.log:2016-09-22T11:49:27-05:00 ERR File is falling under ignore_older before harvesting is finished. Adjust your close_* settings: /var/log/transaction/2016-04-08-155323-aaaaaaa.xxx

Each file is listed multiple times with the same message.

It doesn't seem to make any progress -- nothing is sent to logstash (checked via tcpdump)...

Seem to me it is a startup issue where it isn't respecting the ignore_older: 24h

-- Bud


(ruflin) #7

Can you please test the beat1 release? https://www.elastic.co/blog/elastic-stack-release-5-0-0-beta1 There was a bug in alpha5 which could have cause the above issue but was fixed for the beta1 release.

Did you clean up the registry file before you restarted filebeat?


(Bud B) #8

@ruflin, OK. Looks much cleaner at startup now but I think my TLS/SSL config is broken. It seem like I can't connect to logstash:

2016-09-23T09:39:54-05:00 ERR Failed to publish events caused by: EOF
2016-09-23T09:39:54-05:00 INFO Error publishing events (retrying): EOF
2016-09-23T09:39:55-05:00 ERR Failed to publish events caused by: write tcp 192.168.154.41:45096->192.168.153.111:5044: write: broken pipe
2016-09-23T09:39:55-05:00 INFO Error publishing events (retrying): write tcp 192.168.154.41:45096->192.168.153.111:5044: write: broken pipe
2016-09-23T09:39:56-05:00 ERR Failed to publish events caused by: EOF
2016-09-23T09:39:56-05:00 INFO Error publishing events (retrying): EOF

I tried changing "tls' to 'ssl' in the .yml file. Any pointers for sorting this out? -- Bud


(Bud B) #9

@ruflin Or is filebeat 5.0 beta 1 incompatible with logstash 2.3.4 ?

I'm using self signed certificates. My beats input looks like this:

input {
    beats {
        port => 5044
        ssl => true
        ssl_certificate => "/etc/pki/tls/certs/filebeat.crt"
        ssl_key => "/etc/pki/tls/private/filebeat.key"
    }
}

I created the certificate with:

sudo openssl req -subj '/CN=elk.mycompany.com/' -x509 -days 3650 -batch -nodes -newkey rsa:2048 -keyout private/filebeat.key -out certs/filebeat.crt

No errors being logged on the logstash server side...

-- Bud


(Steffen Siering) #10

we on't test filebeat beta1 with logstash 2.3.4, but they should still work I'd assume. Can you try without TLS/SSL first so we can check network being ok?

In beta1 changing tls.certificate_authorities to ssl.certificate_authorities should be enough.


(Bud B) #11

@steffens, @ruflin Actually, I think I have filebeat 5.0 beta working fine with logstash 2.4 now. I still am getting the constant info/err logs though:

2016-09-27T17:01:10-05:00 ERR Failed to publish events caused by: EOF
2016-09-27T17:01:10-05:00 INFO Error publishing events (retrying): EOF
2016-09-27T17:01:20-05:00 ERR Failed to publish events caused by: EOF
2016-09-27T17:01:20-05:00 INFO Error publishing events (retrying): EOF
2016-09-27T17:01:30-05:00 ERR Failed to publish events caused by: EOF
2016-09-27T17:01:30-05:00 INFO Error publishing events (retrying): EOF
2016-09-27T17:01:35-05:00 ERR Failed to publish events caused by: EOF
2016-09-27T17:01:35-05:00 INFO Error publishing events (retrying): EOF

I thought they were SSL related errors but I ran in debug mode and they seem to correspond to backing off of polling currently open/active log files:

2016-09-27T16:58:48-05:00 DBG  End of file reached: /var/log/.../2016-09-27-143216-....log; Backoff now.
2016-09-27T16:58:48-05:00 DBG  End of file reached: /var/log/.../2016-09-27-143217-....xxx; Backoff now.

Is this possibly a bug. It is very chatty if not...


(Bud B) #12

Definitely seems to be related to actively harvested files. As soon as I create a new log, I start seeing those messages:

2016-09-27T20:47:40-05:00 INFO No non-zero metrics in the last 30s
2016-09-27T20:48:10-05:00 INFO No non-zero metrics in the last 30s
2016-09-27T20:48:40-05:00 INFO No non-zero metrics in the last 30s
2016-09-27T20:49:02-05:00 INFO Harvester started for file: /var/log/application/2016-09-28-014859-aaa.log
2016-09-27T20:49:02-05:00 INFO Harvester started for file: /var/log/transaction/2016-09-28-014900-aaa.xxx
2016-09-27T20:49:06-05:00 ERR Failed to publish events caused by: EOF
2016-09-27T20:49:06-05:00 INFO Error publishing events (retrying): EOF
2016-09-27T20:49:07-05:00 ERR Connecting error publishing events (retrying): EOF
2016-09-27T20:49:09-05:00 ERR Failed to publish events caused by: EOF
2016-09-27T20:49:09-05:00 INFO Error publishing events (retrying): EOF
2016-09-27T20:49:10-05:00 INFO Non-zero metrics in the last 30s: publish.events=913 filebeat.harvester.open_files=2 libbeat.logstash.published_and_acked_events=911 registrar.states.update=913 libbeat.logstash.publish.read_bytes=6221 libbeat.logstash.published_but_not_acked_events=1384 registar.states.current=2 registrar.writes=1 filebeat.harvester.started=2 libbeat.logstash.call_count.PublishEvents=3 libbeat.logstash.publish.read_errors=1 filebeat.harvester.running=2 libbeat.logstash.publish.write_bytes=62327 libbeat.publisher.published_events=911

The logs are written from a C application using fprintf(). Oh, and we don't fflush() after each line... -- Bud


(Bud B) #13

@steffens, @ruflin, Don't really know GO or exactly how filebeat works internally. I assume the error is being logged here:

is it possible that that due to buffering, "data" has a partial line when publishWindowed() is called and that's why it fails with EOF - it really hits the end of the buffer before it finds a newline?


(Steffen Siering) #14

If you get an EOF, it's due to logstash closing the connection. It has nothing todo with buffering in filebeat or partial lines (no partial line is forwarded to outputs).

A bug has been discovered in logstash input plugin closing connections while events are being processed and logstash is taking too long: https://github.com/logstash-plugins/logstash-input-beats/pull/142 . There has been a fixed plugin version being publisher for LS 5.0-beta1, but not 2.4 yet it seem. You can try to increase client_inactivity_timeout to a very big value.


(Steffen Siering) #15

I have been told the most recent published logstash-input-beats plugin should work with logstash 2.4 . I'd recommend updating the plugin.


(Bud B) #16

This looks promising. I did some preliminary testing with logstash 2.4 and the updated beats plugin against filebeat 5.0 beta1 and not seeing the EOF error. Thanks!!! -- Bud


(Bud B) #17

@steffens, So I rolled out the updated beats input plugin for logstash and the 5.0 beta 1 out across my servers (50ish) and for the most part it is working well so far. I have noticed a few "Failed to publish events caused by: EOF" in the logs. One server in particular seemed to be generating them constantly:

2016-10-03T08:19:14-05:00 ERR Failed to publish events caused by: EOF
2016-10-03T08:19:14-05:00 INFO Error publishing events (retrying): EOF
2016-10-03T08:19:19-05:00 ERR Failed to publish events caused by: EOF
2016-10-03T08:19:19-05:00 INFO Error publishing events (retrying): EOF
2016-10-03T08:19:24-05:00 ERR Failed to publish events caused by: EOF
2016-10-03T08:19:24-05:00 INFO Error publishing events (retrying): EOF
2016-10-03T08:19:44-05:00 ERR Failed to publish events caused by: EOF
2016-10-03T08:19:44-05:00 INFO Error publishing events (retrying): EOF
2016-10-03T08:19:49-05:00 ERR Failed to publish events caused by: EOF
2016-10-03T08:19:49-05:00 INFO Error publishing events (retrying): EOF

The one thing that might be somewhat unusual about this server is that it is logging periodic hearbeat messages every 30ish seconds.


(Steffen Siering) #18

Have you also increase client_inactivity_timeout to something like 15 minutes? Looks all EOF timestamps are a multiple of 5 seconds. Checking the code a close-signal is generated every client_inactivity_timeout no matter if logstash is processing data or not. With timer being active all the time, there is a chance a signal being generated right after having processed a batch. That is, the longer logstash takes to process data (or outputs in logstash blocking), the higher the chance client_inactivity_timeout might eventually trigger.

In general EOF is not that bad, as filebeat will automatically reconnect if this happens.

One filebeat or one logstash server?


(Bud B) #19

I set client_inactivity_timeout to 60 which seem like a good value when heartbeating every 30 seconds. Much better. Only comment now is that it is a really bad message if I understand it right:

2016-10-05T16:00:39-05:00 ERR Failed to publish events caused by: EOF
2016-10-05T16:00:39-05:00 INFO Error publishing events (retrying): EOF

The first message probably should not be an ERR and it really has nothing to do with the end of file. It is the result of logstash closing the socket due to the filebeat client being inactive. EOF make it sound like it was an issue with reading the source file not an issue with communicating with the server. The second message is somewhat redundant and odd that it is an "INFO Error". Makes it sound worse than it is. Should I open a ticket?.


(system) #20

This topic was automatically closed after 21 days. New replies are no longer allowed.