Filebeat is not closig the file descriptor once the file is deleted/renamed

swethamahesh · October 16, 2017, 10:15am

filebeat is not closing the file descriptor incase of file is been
removed/rotated with version 5.5 and due to which disk usage is full
14327 0 lrwx------ 1 xxx xxx 64 Oct 13 13:23 /proc/1079/fd/7 -> /tmp/ffi4PXrM7\ (deleted)

15953 0 l-wx------ 1 xxx xxx 64 Oct 13 13:23 /proc/1265/fd/3 -> /var/log/temp.log\ (deleted)
my filebeat configuration looks as below:
filebeat.prospectors:
document_type: log

encoding: utf-8

fields: {app_name: xxx, service_name: xxx}

fields_under_root: true

input_type: log

paths: [/var/log/temp.log]

close_inactive: 1m

close_renamed: true

close_removed: true

clean_older: 1m

please advice on this

Currently filebeat version used is filebeat-5.5.2

ruflin · October 16, 2017, 10:24am

As a reference, discussion started here: https://github.com/elastic/beats/issues/5391#issuecomment-336795894

So if filebeat queues up, it means the receiving end is probably not fast enough. Can you share your filebeat log and general setup? Please also share the full filebeat config and make sure the formatting stays correct (use 3 backticks before and after) or paste it into a gist and link it here.

swethamahesh · October 19, 2017, 11:12am

Hi Nicolas,

Please find the gist link below:

gist.github.com

https://gist.github.com/swethamahesh/c60e803e135c73fb3c365f8cc6698379

01_input_logstash.conf

input {
        beats {
                port => 5044
        }
}

filebeat.log

2017-10-13T17:02:41+02:00 INFO Metrics logging every 30s
2017-10-13T17:02:41+02:00 INFO Home path: [/usr/share/filebeat] Config path: [/etc/filebeat] Data path: [/var/lib/filebeat] Logs path: [/var/log/filebeat]
2017-10-13T17:02:41+02:00 INFO Setup Beat: filebeat; Version: 5.5.2
2017-10-13T17:02:41+02:00 INFO Max Retries set to: 3
2017-10-13T17:02:41+02:00 INFO Activated logstash as output plugin.
2017-10-13T17:02:41+02:00 INFO Publisher name: sdlswe-ops-vm0.sdlswe.com
2017-10-13T17:02:41+02:00 INFO Flush Interval set to: 1s
2017-10-13T17:02:41+02:00 INFO Max Bulk Size set to: 2048
2017-10-13T17:02:41+02:00 INFO filebeat start running.
2017-10-13T17:02:41+02:00 INFO Registry file set to: /var/lib/filebeat/registry

This file has been truncated. show original

filebeat.yml

fields: {component_name: xyz, instance_id: VNFC_XYZ,
  vnf_id: 1A}
fields_under_root: true
filebeat.prospectors:
- document_type: log
  encoding: utf-8
  fields: {app_name: oes, service_name: ops}
  fields_under_root: true
  input_type: log
  paths: [/var/log/ops/oes/temp.log]

This file has been truncated. show original

There are more than three files. show original

Thanks,
Swetha.

swethamahesh · October 23, 2017, 3:26pm

Hi Nicolas,

Could you please respond. we are blocked on this and due to huge number of file descriptor is hanging for deleted files the disk is getting full and internal communication inside node is getting lost.

Currently once we restart the filebeat the disk space recovered but i understand this is not the right approach to be followed.

Awaiting response.
Swetha

dmitriy_o · October 24, 2017, 8:07am

I have the same problem:
lsof | grep /data | grep delete

filebeat 6452 filebeat 6452 filebeat 6452 filebeat 6452 filebeat 6452 filebeat 6452 filebeat 6452 6453 filebeat 6452 6453 filebeat 6452 6453 filebeat 6452 6453 filebeat 6452 6453 filebeat 6452 6453 filebeat 6452 6454 filebeat 6452 6454 filebeat 6452 6454 filebeat 6452 6454 filebeat 6452 6454 filebeat 6452 6454 filebeat 6452 6455 filebeat 6452 6455 filebeat 6452 6455 filebeat 6452 6455 filebeat 6452 6455 filebeat 6452 6455 filebeat 6452 6456 filebeat 6452 6456 filebeat 6452 6456 filebeat 6452 6456 filebeat 6452 6456 filebeat 6452 6456 filebeat 6452 6457 filebeat 6452 6457 filebeat 6452 6457 filebeat 6452 6457 filebeat 6452 6457 elog 5r REG 253,2 104862910 6686924 /data/logs/crabwf/scripts/script.jsonl10874734891405433.tmp (deleted)
elog 16r REG 253,2 104864440 6687067 /data/logs/crabwf/scripts/script.jsonl10874936686553246.tmp (deleted)
elog 17r REG 253,2 104857899 6687122 /data/logs/crabwf/scripts/script.jsonl10875961843456607.tmp (deleted)
elog 18r REG 253,2 104859377 6687082 /data/logs/crabwf/scripts/script.jsonl10875163147399006.tmp (deleted)
elog 19r REG 253,2 104865067 6687083 /data/logs/crabwf/scripts/script.jsonl10875188623134531.tmp (deleted)
elog 20r REG 253,2 104866393 6687124 /data/logs/crabwf/scripts/script.jsonl10875984015936077.tmp (deleted)
elog 5r REG 253,2 104862910 6686924 /data/logs/crabwf/scripts/script.jsonl10874734891405433.tmp (deleted)
elog 16r REG 253,2 104864440 6687067 /data/logs/crabwf/scripts/script.jsonl10874936686553246.tmp (deleted)
elog 17r REG 253,2 104857899 6687122 /data/logs/crabwf/scripts/script.jsonl10875961843456607.tmp (deleted)
elog 18r REG 253,2 104859377 6687082 /data/logs/crabwf/scripts/script.jsonl10875163147399006.tmp (deleted)
elog 19r REG 253,2 104865067 6687083 /data/logs/crabwf/scripts/script.jsonl10875188623134531.tmp (deleted)
elog 20r REG 253,2 104866393 6687124 /data/logs/crabwf/scripts/script.jsonl10875984015936077.tmp (deleted)
elog 5r REG 253,2 104862910 6686924 /data/logs/crabwf/scripts/script.jsonl10874734891405433.tmp (deleted)
elog 16r REG 253,2 104864440 6687067 /data/logs/crabwf/scripts/script.jsonl10874936686553246.tmp (deleted)
elog 17r REG 253,2 104857899 6687122 /data/logs/crabwf/scripts/script.jsonl10875961843456607.tmp (deleted)
elog 18r REG 253,2 104859377 6687082 /data/logs/crabwf/scripts/script.jsonl10875163147399006.tmp (deleted)
elog 19r REG 253,2 104865067 6687083 /data/logs/crabwf/scripts/script.jsonl10875188623134531.tmp (deleted)
elog 20r REG 253,2 104866393 6687124 /data/logs/crabwf/scripts/script.jsonl10875984015936077.tmp (deleted)
elog 5r REG 253,2 104862910 6686924 /data/logs/crabwf/scripts/script.jsonl10874734891405433.tmp (deleted)
elog 16r REG 253,2 104864440 6687067 /data/logs/crabwf/scripts/script.jsonl10874936686553246.tmp (deleted)
elog 17r REG 253,2 104857899 6687122 /data/logs/crabwf/scripts/script.jsonl10875961843456607.tmp (deleted)
elog 18r REG 253,2 104859377 6687082 /data/logs/crabwf/scripts/script.jsonl10875163147399006.tmp (deleted)
elog 19r REG 253,2 104865067 6687083 /data/logs/crabwf/scripts/script.jsonl10875188623134531.tmp (deleted)
elog 20r REG 253,2 104866393 6687124 /data/logs/crabwf/scripts/script.jsonl10875984015936077.tmp (deleted)
elog 5r REG 253,2 104862910 6686924 /data/logs/crabwf/scripts/script.jsonl10874734891405433.tmp (deleted)
elog 16r REG 253,2 104864440 6687067 /data/logs/crabwf/scripts/script.jsonl10874936686553246.tmp (deleted)
elog 17r REG 253,2 104857899 6687122 /data/logs/crabwf/scripts/script.jsonl10875961843456607.tmp (deleted)
elog 18r REG 253,2 104859377 6687082 /data/logs/crabwf/scripts/script.jsonl10875163147399006.tmp (deleted)
elog 19r REG 253,2 104865067 6687083 /data/logs/crabwf/scripts/script.jsonl10875188623134531.tmp (deleted)
elog 20r REG 253,2 104866393 6687124 /data/logs/crabwf/scripts/script.jsonl10875984015936077.tmp (deleted)
elog 5r REG 253,2 104862910 6686924 /data/logs/crabwf/scripts/script.jsonl10874734891405433.tmp (deleted)
elog 16r REG 253,2 104864440 6687067 /data/logs/crabwf/scripts/script.jsonl10874936686553246.tmp (deleted)
elog 17r REG 253,2 104857899 6687122 /data/logs/crabwf/scripts/script.jsonl10875961843456607.tmp (deleted)
elog 18r REG 253,2 104859377 6687082 /data/logs/crabwf/scripts/script.jsonl10875163147399006.tmp (deleted)
elog 19r REG 253,2 104865067 6687083 /data/logs/crabwf/scripts/script.jsonl10875188623134531.tmp (deleted)

dmitriy_o · October 24, 2017, 8:27am

Maybe this is due to errors:
2017-10-24T11:21:00+03:00 ERR Failed to publish events caused by: read tcp xx.xx.82.6:18218->xx.xx.80.58:5044: i/o timeout
2017-10-24T11:21:00+03:00 INFO Error publishing events (retrying): read tcp xx.xx.82.6:18218->xx.xx.80.58:5044: i/o timeout
2017-10-24T11:21:00+03:00 ERR Failed to publish events caused by: write tcp xx.xx.82.6:21018->xx.xx.80.57:5044: write: connection reset by peer
2017-10-24T11:21:00+03:00 INFO Error publishing events (retrying): write tcp xx.xx.82.6:21018->xx.xx.80.57:5044: write: connection reset by peer
2017-10-24T11:21:00+03:00 ERR Failed to publish events caused by: write tcp xx.xx.82.6:61812->xx.xx.80.65:5044: write: connection reset by peer
2017-10-24T11:21:00+03:00 INFO Error publishing events (retrying): write tcp xx.xx.82.6:61812->xx.xx.80.65:5044: write: connection reset by peer
2017-10-24T11:21:00+03:00 ERR Failed to publish events caused by: write tcp xx.xx.82.6:30589->xx.xx.80.64:5044: write: connection reset by peer
2017-10-24T11:21:00+03:00 INFO Error publishing events (retrying): write tcp xx.xx.82.6:30589->xx.xx.80.64:5044: write: connection reset by peer
2017-10-24T11:21:00+03:00 ERR Failed to publish events caused by: write tcp xx.xx.82.6:37377->xx.xx.80.63:5044: write: connection reset by peer
2017-10-24T11:21:00+03:00 INFO Error publishing events (retrying): write tcp xx.xx.82.6:37377->xx.xx.80.63:5044: write: connection reset by peer
2017-10-24T11:21:00+03:00 ERR Failed to publish events caused by: write tcp xx.xx.82.6:38383->xx.xx.80.59:5044: write: connection reset by peer
2017-10-24T11:21:00+03:00 INFO Error publishing events (retrying): write tcp xx.xx.82.6:38383->xx.xx.80.59:5044: write: connection reset by peer

swethamahesh · October 24, 2017, 8:52am

Hi Dmitri,

As mentioned above by Nicolas looks like log file content not transferred completely and there is a connection failure from filebeat to logstash in mean while is the file is rotated and removed.

There could be chances where the file descriptor is left open.

Thanks,
Swetha. M

dmitriy_o · October 24, 2017, 9:14am

Hi Swetha,

we are waiting for Nicolas?

Thanks,
Dmitri

swethamahesh · October 24, 2017, 9:16am

yes, I waiting for his response.

Currently to overcome this issue we are restarting the filebeat.

We are looking forward for a proper solution for this issue.

vkakhnych · October 24, 2017, 9:38am

Have the same issue. Old logs are rename each hour, then delete after hour of inactivity by script. But filebeat still keep all or many of them open. Can't reproduce in test lab, only on prod get it. Maybe it's related to intensive of logging.
Config:

filebeat.prospectors:

- input_type: log

  paths:
    - /var/log/batch/Batch*-running.log

  close_inactive: 5m
  close_renamed: true
  clean_removed: true

close_removed is enabled by default (according to official documentation)

filebeat-5.6.3-1.x86_64
Red Hat Enterprise Linux Server release 6.7

vkakhnych · October 24, 2017, 1:32pm

There are three almost the same servers with similar services but only one has this issue.
And only service restart or close_timeout option really help.

dmitriy_o · October 25, 2017, 2:47am

"close_timeout: Only use this option if you understand that data loss is a potential side effect." - If you can not lose data?

vkakhnych · October 25, 2017, 6:29am

My logs are recreating (renaming) every hour and not populating after rename. So 2 hours is good for me. But of course I prefer correct work of filebeat instead restarting/timeout using.
It strange. filebeat fails three times on close that files in my configuration. It should close file on rename (didn't), then on 5 minutes of inactivity after rename (didn't) then on file delete in 1 hour after rename (also didn't).

ruflin · November 5, 2017, 10:30pm

Back from holidays Lets see what we can do here.

In general FB has an at least once guarantee to send files. So if not all lines of a file are confirmed by LS, the file is kept open. This can be "stopped" by close_timeout in the newer FB versions. In case the file is deleted before FB picks it up again, some of the log lines will be lost.

Having the send error in the logs is a very good indication that the connection to LS should be checked and the LS logs on why this is happening. There are quite a few other threads with the same issue.

In case these errors don't show up we need to dig deeper on the issue. Could you share your LS output configuration parts too for the ones that didn't so far?

@swethamahesh There seem to be some entry in your FB config which are not in the right place or indented correctly. It should not be related to this problem but thought it's worth mentioning.

vkakhnych · November 6, 2017, 7:15am

output {
sqs {
access_key_id => "XXXXXXXXXXXXXXXXXX"
secret_access_key => "YYYYYYYYYYYYYYYYYY"
region => "us-east-1"
queue => "sqs_elk"
codec => "json"
}
}

vkakhnych · November 6, 2017, 4:45pm

We have three almost the same systems with the same software and processes. And only one has this issue. The only thing that could be the reason is such filebeat and logstash logs content:

Logstash:

{:timestamp=>"2017-11-06T15:19:49.287000+0000", :message=>"CircuitBreaker::rescuing exceptions", :name=>"Beats input", :exception=>LogStash::Inputs::Beats::InsertingTo
QueueTakeTooLong, :level=>:warn}
{:timestamp=>"2017-11-06T15:19:49.307000+0000", :message=>"Beats input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the
current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::Inputs::BeatsSupport::CircuitBreaker::HalfOpenBreaker, :level=>:war
n}
{:timestamp=>"2017-11-06T15:19:50.361000+0000", :message=>"CircuitBreaker::Open", :name=>"Beats input", :level=>:warn}
{:timestamp=>"2017-11-06T15:19:50.362000+0000", :message=>"Beats input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the
current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::Inputs::BeatsSupport::CircuitBreaker::OpenBreaker, :level=>:warn}
{:timestamp=>"2017-11-06T15:19:50.362000+0000", :message=>"Beats input: the pipeline is blocked, temporary refusing new connection.", :reconnect_backoff_sleep=>0.5, :l
evel=>:warn}

And at the same time in filebeat logs:

2017-11-06T15:19:49Z ERR Failed to publish events caused by: EOF
2017-11-06T15:19:49Z INFO Error publishing events (retrying): EOF
2017-11-06T15:20:20Z ERR Failed to publish events caused by: read tcp 127.0.0.1:33444->127.0.0.1:5044: i/o timeout
2017-11-06T15:20:20Z INFO Error publishing events (retrying): read tcp 127.0.0.1:33444->127.0.0.1:5044: i/o timeout

Looks like LS never resume that files and FB keeps them open forever.

ruflin · November 8, 2017, 11:04pm

@steffens Could you have a look at the above?

steffens · November 9, 2017, 4:17pm

The logstash logs looks like form very old Logstash version. Logstash is complaining about the internal pipeline being stalled -> closes connections. The circuite breaker allows for new connections only after some timeout.

Consider updating your Logstash version.

Figure out why logstash is blocking? Unresponsive output? Some grok pattern going crazy?

vkakhnych · November 9, 2017, 4:22pm

You're right. There is version 2. We had issues with version 5 there. But it was before we installed filebeat there. Ok, we'll discuss about LS upgrade with team to check that.

vkakhnych · November 10, 2017, 10:52am

We moved to LS 5.6 (the last version from yum repo). Now we have no any ERR/WARN logs in LS and FB. But the situation is the same. Our log scheme is next:

Java processes write their logs to processname-running.log, every hour that logs are renaming (same inode but no any new output) and process recreate the same processname-running.log (new inode).
Cron script removes any log that older then 61 minutes. In general every our log lives 2 hours (the same inode).

But after removing looks like all that logs (or almost all) keep open by FB (we use lsof to confirm it). So what else can we do?
As I said before - we have three the same servers with the same FB+LS configs and versions but issue exist only on one of them and exist permanently.

Topic		Replies	Views
Filebeat might erroneously hold on to the file descriptor? Beats filebeat	5	1933	November 22, 2017
Filebeat is not closig the file descriptor and the file descriptors are dangling, when the file is deleted/renamed Beats filebeat	4	710	April 12, 2018
Filebeat holds on to a file descriptor too long even after a file was deleted? Beats filebeat	9	3051	March 5, 2018
Filebeat 5.0 beta 1 is not closing deleted files Beats filebeat	14	1893	November 1, 2016
Filebeat does not release log files issue Beats filebeat	19	2281	June 20, 2017

Filebeat is not closig the file descriptor once the file is deleted/renamed

Related topics