Filebeat missed one complete log file in log rotation

We are using Filebeat version 1.3.1 for our system and following is it's config :

filebeat:
prospectors:
paths:
- "/path/to/file.log*"
fields_under_root: true
document_type: "event-type"
fields:
host: "host-xyz"
output:
logstash:
hosts: ["logstash-server:6789"]
tls:
certificate_authorities: ["/path/to/cert"]
logging:
to_files: true
files:
path: /var/log/filebeat
name: filebeat.log
rotateeverybytes: 104857600 # = 100MB
keepfiles: 7
level: warning

Rotated files are named like file.log.1, file.log.2, ....

Now, for most of the cases it works fine. But once in a while filebeat misses one whole file. Like a complete one and not even processing a single even for that rotated log file.

Any help/pointers in this regard?

Thanks,
Siddharth

Can you post a ls output and mark the ones that were not read, perhaps we can spot a pattern. Also, posting the filebeat logs (preferably at debug level) might be useful for us to understand what's going on.

Thanks for your reply @tudor . The problem is that the issue is erratic and happens suddenly sometime. After reading your reply I changed the log level to "info" from "warning" ("debug" would be too much for prod I thought). With this I realised that info logs are stored in file for some time and after sometime the file is empty and it does not store any log. That is why I was not able to prepare log file for you.

With a further investigation I found that we use qsrotate (http://mod-qos.sourceforge.net/qsrotate.1.html) along with filebeat to rotate filebeat logs. And this double rotation of logs (one by filebeat keepfile property and other by qsrotate) was probably preventing filebeat to log. Now I have switched off qsrotate and we are relying on only filebeat for its log rotation. And with this we are still monitoring our production environment for the main issue (skipping processing of a few log file). Till the time we are monitoring, a question for you. Do you think that qsrotate might have caused the main issue as well?

@tudor Able to see the error again. Following are the filebeat INFO level logs :

2017-03-27T18:27:58+02:00 INFO Registry file updated. 6 states written.
2017-03-27T18:28:03+02:00 INFO Events sent: 46
2017-03-27T18:28:03+02:00 INFO Registry file updated. 6 states written.
2017-03-27T18:29:41+02:00 INFO Error publishing events (retrying): read tcp 192.168.4.168:59005->192.168.4.128:5044: i/o timeout
2017-03-27T18:30:20+02:00 INFO send fail
2017-03-27T18:30:20+02:00 INFO backoff retry: 1s
2017-03-27T18:30:24+02:00 ERR SSL client failed to connect with: dial tcp 192.168.4.128:5044: getsockopt: no route to host
2017-03-27T18:30:24+02:00 INFO Connecting error publishing events (retrying): dial tcp 192.168.4.128:5044: getsockopt: no route to host
2017-03-27T18:30:24+02:00 INFO send fail
2017-03-27T18:30:24+02:00 INFO backoff retry: 2s
2017-03-27T18:30:27+02:00 ERR SSL client failed to connect with: dial tcp 192.168.4.128:5044: getsockopt: no route to host
2017-03-27T18:30:27+02:00 INFO Connecting error publishing events (retrying): dial tcp 192.168.4.128:5044: getsockopt: no route to host
2017-03-27T18:30:27+02:00 INFO send fail
2017-03-27T18:30:27+02:00 INFO backoff retry: 4s
2017-03-27T18:30:31+02:00 INFO Events sent: 49
2017-03-27T18:30:31+02:00 INFO Registry file updated. 6 states written.
2017-03-27T18:30:31+02:00 INFO Events sent: 3
2017-03-27T18:30:31+02:00 INFO Registry file updated. 6 states written.
2017-03-27T18:30:31+02:00 INFO Events sent: 386
2017-03-27T18:30:31+02:00 INFO Registry file updated. 6 states written.
2017-03-27T18:30:38+02:00 INFO Events sent: 416
2017-03-27T18:30:38+02:00 INFO Registry file updated. 6 states written.
2017-03-27T18:30:46+02:00 INFO Events sent: 222
2017-03-27T18:30:46+02:00 INFO Registry file updated. 6 states written.
2017-03-27T18:30:53+02:00 INFO Error publishing events (retrying): EOF
2017-03-27T18:30:53+02:00 INFO send fail
.
.
.
.
.
2017-03-27T18:35:21+02:00 INFO backoff retry: 1s
2017-03-27T18:35:22+02:00 INFO Events sent: 19
2017-03-27T18:35:22+02:00 INFO Registry file updated. 6 states written.
2017-03-27T18:35:28+02:00 INFO Error publishing events (retrying): EOF
2017-03-27T18:35:28+02:00 INFO send fail
2017-03-27T18:35:28+02:00 INFO backoff retry: 1s
2017-03-27T18:35:29+02:00 INFO Events sent: 102
2017-03-27T18:35:29+02:00 INFO Registry file updated. 6 states written.
2017-03-27T18:35:36+02:00 INFO Error publishing events (retrying): EOF
2017-03-27T18:35:36+02:00 INFO send fail
2017-03-27T18:35:36+02:00 INFO backoff retry: 1s
2017-03-27T18:35:37+02:00 INFO Events sent: 47
2017-03-27T18:35:37+02:00 INFO Registry file updated. 6 states written.
2017-03-27T18:35:43+02:00 INFO Error publishing events (retrying): write tcp 192.168.4.168:23665->192.168.4.128:5044: write: broken pipe
2017-03-27T18:35:43+02:00 INFO send fail
2017-03-27T18:35:43+02:00 INFO backoff retry: 1s
2017-03-27T18:35:44+02:00 INFO Events sent: 97
2017-03-27T18:35:44+02:00 INFO Registry file updated. 6 states written.
2017-03-27T18:35:51+02:00 INFO Events sent: 33
2017-03-27T18:35:51+02:00 INFO Registry file updated. 6 states written.
2017-03-27T18:35:58+02:00 INFO Error publishing events (retrying): write tcp 192.168.4.168:23697->192.168.4.128:5044: write: broken pipe
2017-03-27T18:35:58+02:00 INFO send fail
2017-03-27T18:35:58+02:00 INFO backoff retry: 1s
2017-03-27T18:35:59+02:00 INFO Events sent: 87
2017-03-27T18:35:59+02:00 INFO Registry file updated. 6 states written.
2017-03-27T18:36:06+02:00 INFO Error publishing events (retrying): EOF
2017-03-27T18:36:06+02:00 INFO send fail
2017-03-27T18:36:06+02:00 INFO backoff retry: 1s
2017-03-27T18:36:07+02:00 INFO Events sent: 62
2017-03-27T18:36:07+02:00 INFO Registry file updated. 6 states written.
2017-03-27T18:36:13+02:00 INFO Error publishing events (retrying): EOF
2017-03-27T18:36:13+02:00 INFO send fail
.
.
.
.
.
2017-03-28T00:48:18+02:00 INFO send fail
2017-03-28T00:48:18+02:00 INFO backoff retry: 1s
2017-03-28T00:48:19+02:00 INFO Events sent: 5
2017-03-28T00:48:19+02:00 INFO Registry file updated. 6 states written.
2017-03-28T00:48:26+02:00 INFO Events sent: 3
2017-03-28T00:48:26+02:00 INFO Registry file updated. 6 states written.
2017-03-28T00:48:33+02:00 INFO Error publishing events (retrying): write tcp 192.168.4.168:52143->192.168.4.128:5044: write: broken pipe
2017-03-28T00:48:33+02:00 INFO send fail
2017-03-28T00:48:33+02:00 INFO backoff retry: 1s
2017-03-28T00:48:34+02:00 INFO Events sent: 14
2017-03-28T00:48:34+02:00 INFO Registry file updated. 6 states written.
2017-03-28T00:48:38+02:00 INFO Error publishing events (retrying): EOF
2017-03-28T00:48:38+02:00 INFO send fail
2017-03-28T00:48:38+02:00 INFO backoff retry: 1s
2017-03-28T00:48:39+02:00 INFO Events sent: 2
2017-03-28T00:48:39+02:00 INFO Registry file updated. 6 states written.
2017-03-28T00:48:46+02:00 INFO Error publishing events (retrying): EOF
2017-03-28T00:49:26+02:00 INFO send fail
2017-03-28T00:49:26+02:00 INFO backoff retry: 1s
2017-03-28T00:49:34+02:00 ERR SSL client failed to connect with: read tcp 192.168.4.168:52199->192.168.4.128:5044: read: connection reset by peer
2017-03-28T00:49:34+02:00 INFO Connecting error publishing events (retrying): read tcp 192.168.4.168:52199->192.168.4.128:5044: read: connection reset by peer
2017-03-28T00:49:34+02:00 INFO send fail
2017-03-28T00:49:34+02:00 INFO backoff retry: 2s
2017-03-28T00:49:36+02:00 INFO Events sent: 5
2017-03-28T00:49:36+02:00 INFO Registry file updated. 6 states written.
2017-03-28T00:49:41+02:00 INFO Events sent: 57
2017-03-28T00:49:41+02:00 INFO Registry file updated. 6 states written.
2017-03-28T00:49:48+02:00 INFO Events sent: 10
2017-03-28T00:49:48+02:00 INFO Registry file updated. 6 states written.
2017-03-28T00:49:56+02:00 INFO Error publishing events (retrying): EOF
2017-03-28T00:49:56+02:00 INFO send fail
2017-03-28T00:49:56+02:00 INFO backoff retry: 1s
2017-03-28T00:49:57+02:00 INFO Events sent: 4

I could see a couple of errors here. Any pointers what could we be doing wrong?

Any hints @ruflin / @tudor ?

The output errors should not lead to data loss as filebeat is keeping the file open until it is completely sent.

In your log file above, I didn't see anything related to filebeat potentially missing a file. Could you update to filebeat 5.x and see if you still see the issue? The file handling is quite different in 5.x and fixed a lot of previous issues.

About the beats logging: That is also something we should improve on our side and are working on.

Thank @ruflin for your reply.

You told The output errors should not lead to data loss as filebeat is keeping the file open until it is completely sent. But what I observed is that a complete file is missed and not a partial file. So that would mean that a harvestor never opened a particular file for processing, which was a bit scary.

Any suggestions, what could we do with current setup (i.e. current version of filebeat) since it is not easy for us to suddenly upgrade in production environment like that.

How fast are you logs rotating? Means how long does it one log file to get rotated over on peak time. My assumption is that you hit sometimes a race condition. One option to make it less likely is to decrease scan_frequency and for example only have - "/path/to/file.log in the pattern. We normally not recommend to do this as when you restart filebeat, it can still pick up old files. But this reduces the risk of race condition again as less files are tracked. Also it will still finish the file on renaming.

The biggest problem in 1.x is that there are no clean_* options so the registry file grows over time which make it likelier for you to hit an inode reuse. You could write a script to manually clean it up, but I don't recommend it.

So I still strongly recommend to upgrade even though I cannot be 100% sure by above assumption is correct (nothing like this in the log). Upgrade to 5.x should be rather simple, but yes test it before going into production :wink:

Thanks @ruflin. Upgrade did fix our issue. We finally found that the issue we were running into was https://github.com/elastic/beats/issues/1974 . Unfortunately, migrating from 1.3 to 5.3 also came out to be a bit painful. We missed https://www.elastic.co/guide/en/beats/libbeat/5.0/breaking-changes-5.0.html where the config were changed from tls to ssl for filebeat config. But the issue is that filebeat -configtest also does not give error with tls configuration (I think it should be filed as an issue) and we took some time to figure this out also.

But finally everything works fine now. Thanks a lot for your help :slight_smile:

Cheers,
Siddharth

Glad it's working for you now. About -configtest: This one is quite tricky as this mainly does validate the yaml file an some of the config options. But it does not check if there are "too many" config options ... But we are aware of the issue and try to improve it.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.