Filebeat is not closig the file descriptor once the file is deleted/renamed

filebeat is not closing the file descriptor incase of file is been
removed/rotated with version 5.5 and due to which disk usage is full
14327 0 lrwx------ 1 xxx xxx 64 Oct 13 13:23 /proc/1079/fd/7 -> /tmp/ffi4PXrM7\ (deleted)

15953 0 l-wx------ 1 xxx xxx 64 Oct 13 13:23 /proc/1265/fd/3 -> /var/log/temp.log\ (deleted)
my filebeat configuration looks as below:
filebeat.prospectors:
document_type: log

encoding: utf-8

fields: {app_name: xxx, service_name: xxx}

fields_under_root: true

input_type: log

paths: [/var/log/temp.log]

close_inactive: 1m

close_renamed: true

close_removed: true

clean_older: 1m

please advice on this

Currently filebeat version used is filebeat-5.5.2

As a reference, discussion started here: https://github.com/elastic/beats/issues/5391#issuecomment-336795894

So if filebeat queues up, it means the receiving end is probably not fast enough. Can you share your filebeat log and general setup? Please also share the full filebeat config and make sure the formatting stays correct (use 3 backticks before and after) or paste it into a gist and link it here.

Hi Nicolas,

Please find the gist link below:

Thanks,
Swetha.

Hi Nicolas,

Could you please respond. we are blocked on this and due to huge number of file descriptor is hanging for deleted files the disk is getting full and internal communication inside node is getting lost.

Currently once we restart the filebeat the disk space recovered but i understand this is not the right approach to be followed.

Awaiting response.
Swetha

I have the same problem:
lsof | grep /data | grep delete

filebeat 6452 elog 5r REG 253,2 104862910 6686924 /data/logs/crabwf/scripts/script.jsonl10874734891405433.tmp (deleted)
filebeat 6452 elog 16r REG 253,2 104864440 6687067 /data/logs/crabwf/scripts/script.jsonl10874936686553246.tmp (deleted)
filebeat 6452 elog 17r REG 253,2 104857899 6687122 /data/logs/crabwf/scripts/script.jsonl10875961843456607.tmp (deleted)
filebeat 6452 elog 18r REG 253,2 104859377 6687082 /data/logs/crabwf/scripts/script.jsonl10875163147399006.tmp (deleted)
filebeat 6452 elog 19r REG 253,2 104865067 6687083 /data/logs/crabwf/scripts/script.jsonl10875188623134531.tmp (deleted)
filebeat 6452 elog 20r REG 253,2 104866393 6687124 /data/logs/crabwf/scripts/script.jsonl10875984015936077.tmp (deleted)
filebeat 6452 6453 elog 5r REG 253,2 104862910 6686924 /data/logs/crabwf/scripts/script.jsonl10874734891405433.tmp (deleted)
filebeat 6452 6453 elog 16r REG 253,2 104864440 6687067 /data/logs/crabwf/scripts/script.jsonl10874936686553246.tmp (deleted)
filebeat 6452 6453 elog 17r REG 253,2 104857899 6687122 /data/logs/crabwf/scripts/script.jsonl10875961843456607.tmp (deleted)
filebeat 6452 6453 elog 18r REG 253,2 104859377 6687082 /data/logs/crabwf/scripts/script.jsonl10875163147399006.tmp (deleted)
filebeat 6452 6453 elog 19r REG 253,2 104865067 6687083 /data/logs/crabwf/scripts/script.jsonl10875188623134531.tmp (deleted)
filebeat 6452 6453 elog 20r REG 253,2 104866393 6687124 /data/logs/crabwf/scripts/script.jsonl10875984015936077.tmp (deleted)
filebeat 6452 6454 elog 5r REG 253,2 104862910 6686924 /data/logs/crabwf/scripts/script.jsonl10874734891405433.tmp (deleted)
filebeat 6452 6454 elog 16r REG 253,2 104864440 6687067 /data/logs/crabwf/scripts/script.jsonl10874936686553246.tmp (deleted)
filebeat 6452 6454 elog 17r REG 253,2 104857899 6687122 /data/logs/crabwf/scripts/script.jsonl10875961843456607.tmp (deleted)
filebeat 6452 6454 elog 18r REG 253,2 104859377 6687082 /data/logs/crabwf/scripts/script.jsonl10875163147399006.tmp (deleted)
filebeat 6452 6454 elog 19r REG 253,2 104865067 6687083 /data/logs/crabwf/scripts/script.jsonl10875188623134531.tmp (deleted)
filebeat 6452 6454 elog 20r REG 253,2 104866393 6687124 /data/logs/crabwf/scripts/script.jsonl10875984015936077.tmp (deleted)
filebeat 6452 6455 elog 5r REG 253,2 104862910 6686924 /data/logs/crabwf/scripts/script.jsonl10874734891405433.tmp (deleted)
filebeat 6452 6455 elog 16r REG 253,2 104864440 6687067 /data/logs/crabwf/scripts/script.jsonl10874936686553246.tmp (deleted)
filebeat 6452 6455 elog 17r REG 253,2 104857899 6687122 /data/logs/crabwf/scripts/script.jsonl10875961843456607.tmp (deleted)
filebeat 6452 6455 elog 18r REG 253,2 104859377 6687082 /data/logs/crabwf/scripts/script.jsonl10875163147399006.tmp (deleted)
filebeat 6452 6455 elog 19r REG 253,2 104865067 6687083 /data/logs/crabwf/scripts/script.jsonl10875188623134531.tmp (deleted)
filebeat 6452 6455 elog 20r REG 253,2 104866393 6687124 /data/logs/crabwf/scripts/script.jsonl10875984015936077.tmp (deleted)
filebeat 6452 6456 elog 5r REG 253,2 104862910 6686924 /data/logs/crabwf/scripts/script.jsonl10874734891405433.tmp (deleted)
filebeat 6452 6456 elog 16r REG 253,2 104864440 6687067 /data/logs/crabwf/scripts/script.jsonl10874936686553246.tmp (deleted)
filebeat 6452 6456 elog 17r REG 253,2 104857899 6687122 /data/logs/crabwf/scripts/script.jsonl10875961843456607.tmp (deleted)
filebeat 6452 6456 elog 18r REG 253,2 104859377 6687082 /data/logs/crabwf/scripts/script.jsonl10875163147399006.tmp (deleted)
filebeat 6452 6456 elog 19r REG 253,2 104865067 6687083 /data/logs/crabwf/scripts/script.jsonl10875188623134531.tmp (deleted)
filebeat 6452 6456 elog 20r REG 253,2 104866393 6687124 /data/logs/crabwf/scripts/script.jsonl10875984015936077.tmp (deleted)
filebeat 6452 6457 elog 5r REG 253,2 104862910 6686924 /data/logs/crabwf/scripts/script.jsonl10874734891405433.tmp (deleted)
filebeat 6452 6457 elog 16r REG 253,2 104864440 6687067 /data/logs/crabwf/scripts/script.jsonl10874936686553246.tmp (deleted)
filebeat 6452 6457 elog 17r REG 253,2 104857899 6687122 /data/logs/crabwf/scripts/script.jsonl10875961843456607.tmp (deleted)
filebeat 6452 6457 elog 18r REG 253,2 104859377 6687082 /data/logs/crabwf/scripts/script.jsonl10875163147399006.tmp (deleted)
filebeat 6452 6457 elog 19r REG 253,2 104865067 6687083 /data/logs/crabwf/scripts/script.jsonl10875188623134531.tmp (deleted)

Maybe this is due to errors:
2017-10-24T11:21:00+03:00 ERR Failed to publish events caused by: read tcp xx.xx.82.6:18218->xx.xx.80.58:5044: i/o timeout
2017-10-24T11:21:00+03:00 INFO Error publishing events (retrying): read tcp xx.xx.82.6:18218->xx.xx.80.58:5044: i/o timeout
2017-10-24T11:21:00+03:00 ERR Failed to publish events caused by: write tcp xx.xx.82.6:21018->xx.xx.80.57:5044: write: connection reset by peer
2017-10-24T11:21:00+03:00 INFO Error publishing events (retrying): write tcp xx.xx.82.6:21018->xx.xx.80.57:5044: write: connection reset by peer
2017-10-24T11:21:00+03:00 ERR Failed to publish events caused by: write tcp xx.xx.82.6:61812->xx.xx.80.65:5044: write: connection reset by peer
2017-10-24T11:21:00+03:00 INFO Error publishing events (retrying): write tcp xx.xx.82.6:61812->xx.xx.80.65:5044: write: connection reset by peer
2017-10-24T11:21:00+03:00 ERR Failed to publish events caused by: write tcp xx.xx.82.6:30589->xx.xx.80.64:5044: write: connection reset by peer
2017-10-24T11:21:00+03:00 INFO Error publishing events (retrying): write tcp xx.xx.82.6:30589->xx.xx.80.64:5044: write: connection reset by peer
2017-10-24T11:21:00+03:00 ERR Failed to publish events caused by: write tcp xx.xx.82.6:37377->xx.xx.80.63:5044: write: connection reset by peer
2017-10-24T11:21:00+03:00 INFO Error publishing events (retrying): write tcp xx.xx.82.6:37377->xx.xx.80.63:5044: write: connection reset by peer
2017-10-24T11:21:00+03:00 ERR Failed to publish events caused by: write tcp xx.xx.82.6:38383->xx.xx.80.59:5044: write: connection reset by peer
2017-10-24T11:21:00+03:00 INFO Error publishing events (retrying): write tcp xx.xx.82.6:38383->xx.xx.80.59:5044: write: connection reset by peer

Hi Dmitri,

As mentioned above by Nicolas looks like log file content not transferred completely and there is a connection failure from filebeat to logstash in mean while is the file is rotated and removed.

There could be chances where the file descriptor is left open.

Thanks,
Swetha. M

Hi Swetha,

we are waiting for Nicolas?

Thanks,
Dmitri

yes, I waiting for his response.

Currently to overcome this issue we are restarting the filebeat.

We are looking forward for a proper solution for this issue.

Have the same issue. Old logs are rename each hour, then delete after hour of inactivity by script. But filebeat still keep all or many of them open. Can't reproduce in test lab, only on prod get it. Maybe it's related to intensive of logging.
Config:

filebeat.prospectors:

- input_type: log

  paths:
    - /var/log/batch/Batch*-running.log

  close_inactive: 5m
  close_renamed: true
  clean_removed: true

close_removed is enabled by default (according to official documentation)

filebeat-5.6.3-1.x86_64
Red Hat Enterprise Linux Server release 6.7

There are three almost the same servers with similar services but only one has this issue.
And only service restart or close_timeout option really help.

"close_timeout: Only use this option if you understand that data loss is a potential side effect." - If you can not lose data?

My logs are recreating (renaming) every hour and not populating after rename. So 2 hours is good for me. But of course I prefer correct work of filebeat instead restarting/timeout using.
It strange. filebeat fails three times on close that files in my configuration. It should close file on rename (didn't), then on 5 minutes of inactivity after rename (didn't) then on file delete in 1 hour after rename (also didn't).

Back from holidays :slight_smile: Lets see what we can do here.

In general FB has an at least once guarantee to send files. So if not all lines of a file are confirmed by LS, the file is kept open. This can be "stopped" by close_timeout in the newer FB versions. In case the file is deleted before FB picks it up again, some of the log lines will be lost.

Having the send error in the logs is a very good indication that the connection to LS should be checked and the LS logs on why this is happening. There are quite a few other threads with the same issue.

In case these errors don't show up we need to dig deeper on the issue. Could you share your LS output configuration parts too for the ones that didn't so far?

@swethamahesh There seem to be some entry in your FB config which are not in the right place or indented correctly. It should not be related to this problem but thought it's worth mentioning.

output {
sqs {
access_key_id => "XXXXXXXXXXXXXXXXXX"
secret_access_key => "YYYYYYYYYYYYYYYYYY"
region => "us-east-1"
queue => "sqs_elk"
codec => "json"
}
}

We have three almost the same systems with the same software and processes. And only one has this issue. The only thing that could be the reason is such filebeat and logstash logs content:

Logstash:

{:timestamp=>"2017-11-06T15:19:49.287000+0000", :message=>"CircuitBreaker::rescuing exceptions", :name=>"Beats input", :exception=>LogStash::Inputs::Beats::InsertingTo
QueueTakeTooLong, :level=>:warn}
{:timestamp=>"2017-11-06T15:19:49.307000+0000", :message=>"Beats input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the
current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::Inputs::BeatsSupport::CircuitBreaker::HalfOpenBreaker, :level=>:war
n}
{:timestamp=>"2017-11-06T15:19:50.361000+0000", :message=>"CircuitBreaker::Open", :name=>"Beats input", :level=>:warn}
{:timestamp=>"2017-11-06T15:19:50.362000+0000", :message=>"Beats input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the
current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::Inputs::BeatsSupport::CircuitBreaker::OpenBreaker, :level=>:warn}
{:timestamp=>"2017-11-06T15:19:50.362000+0000", :message=>"Beats input: the pipeline is blocked, temporary refusing new connection.", :reconnect_backoff_sleep=>0.5, :l
evel=>:warn}

And at the same time in filebeat logs:

2017-11-06T15:19:49Z ERR Failed to publish events caused by: EOF
2017-11-06T15:19:49Z INFO Error publishing events (retrying): EOF
2017-11-06T15:20:20Z ERR Failed to publish events caused by: read tcp 127.0.0.1:33444->127.0.0.1:5044: i/o timeout
2017-11-06T15:20:20Z INFO Error publishing events (retrying): read tcp 127.0.0.1:33444->127.0.0.1:5044: i/o timeout

Looks like LS never resume that files and FB keeps them open forever.

@steffens Could you have a look at the above?

The logstash logs looks like form very old Logstash version. Logstash is complaining about the internal pipeline being stalled -> closes connections. The circuite breaker allows for new connections only after some timeout.

Consider updating your Logstash version.

Figure out why logstash is blocking? Unresponsive output? Some grok pattern going crazy?

You're right. There is version 2. We had issues with version 5 there. But it was before we installed filebeat there. Ok, we'll discuss about LS upgrade with team to check that.

We moved to LS 5.6 (the last version from yum repo). Now we have no any ERR/WARN logs in LS and FB. But the situation is the same. Our log scheme is next:

Java processes write their logs to processname-running.log, every hour that logs are renaming (same inode but no any new output) and process recreate the same processname-running.log (new inode).
Cron script removes any log that older then 61 minutes. In general every our log lives 2 hours (the same inode).

But after removing looks like all that logs (or almost all) keep open by FB (we use lsof to confirm it). So what else can we do?
As I said before - we have three the same servers with the same FB+LS configs and versions but issue exist only on one of them and exist permanently.