Filebeat holds file-handle during elastic downtime. -> monitored system failed due to full disk

Hi,

in our dev system we encountered following issue:
Our elastic-Stack was unavailable because the disk was full. So the indices / shards had been set to read only automatically by elasticsearch. As consequence redis was full and did not accept any more events because it's memory limit was reached.

Filebeat was trying continuously to read log lines and send them to redis. I general we do not want to lose any log data because of temporary unreachable elasticsearch.

Now here comes the main issue:
On the log producing system the filesystem went full (linux system). Old logs have been deleted, but the space was not released because filebeat had an filehandler open on these files. As consequence the monitored server failed it's service.

We solved this issue by restarting filebeat, but I need a general solution for production:
How can I configure filebeat correctly, that logs (if available) are still sent to filebeat's configured output if elastic backend is unavailable. But If a script or an admin is deleting a logfile, filebeat should release this file.

Thanks a lot, Andreas

Is this the part what I am looking for?

  # When enabling this option, a file handler is closed immediately in case a file can't be found
  # any more. In case the file shows up again later, harvesting will continue at the last known position
  # after scan_frequency.
  #close_removed: true

What is the default for this value? I didn't touch it.

I tried to reproduce on dev VM with following steps:

  • configured filebeat:
    • to sent do unavailable redis output
    • to read some gigabytes of local logfiles
  • start filebeat
  • check free disk space
  • check lsof of filebeat
  • delete logfiles
  • check lsof of filebeat
  • check free disk space

Reproducing is easy:
lsof shows stuff like this (after I deleted the files):

filebeat   1795      filebeat    6r      REG              253,0 2382651371     655368 /plx/jboss/test7/instance-poliks/log/monitoring.json.log.2019-11-16 (deleted)
filebeat   1795      filebeat    7r      REG              253,0 2521340487     655369 /plx/jboss/test7/instance-poliks/log/monitoring.json.log.2019-11-17 (deleted)
filebeat   1795      filebeat    8r      REG              253,0 1364593607     655367 /plx/jboss/test7/instance-poliks/log/monitoring.json.log (deleted)
filebeat   1795 1796 filebeat    6r      REG              253,0 2382651371     655368 /plx/jboss/test7/instance-poliks/log/monitoring.json.log.2019-11-16 (deleted)
filebeat   1795 1796 filebeat    7r      REG              253,0 2521340487     655369 /plx/jboss/test7/instance-poliks/log/monitoring.json.log.2019-11-17 (deleted)
filebeat   1795 1796 filebeat    8r      REG              253,0 1364593607     655367 /plx/jboss/test7/instance-poliks/log/monitoring.json.log (deleted)
filebeat   1795 1797 filebeat    6r      REG              253,0 2382651371     655368 /plx/jboss/test7/instance-poliks/log/monitoring.json.log.2019-11-16 (deleted)
filebeat   1795 1797 filebeat    7r      REG              253,0 2521340487     655369 /plx/jboss/test7/instance-poliks/log/monitoring.json.log.2019-11-17 (deleted)

df -h .                                                    
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/centos-root   15G  7.4G  6.6G  53% /

I changed my input config to this (added close_removed: true at the end) and retried :

[filebeat@filebeat ~]$ cat /etc/filebeat/inputs.d/generic_json.yml
# json non root logs
- type: log

  enabled: true

  paths:
    - /plx/jboss/*/*/log/monitoring.json.log*

  encoding: windows-1252

  fields:
    logType: generic-json
    log.format: json
  fields_under_root: true
  max_bytes: 90000000

  #ignore_older: 0
  #scan_frequency: 1s

  close_removed: true

But the issue stays. Now for 15 minutes lsof shows that the deleted files are still opened by filebeat.

What is the issue? What can I do?

filebeat log shows:

2020-03-24T11:00:23.657+0100    INFO    [monitoring]    log/log.go:145  Non-zero metrics in the last 30s        {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":1820,"time":{"ms":13}},"total":{"ticks":6660,"time":{"ms":28},"value":6660},"user":{"ticks":4840,"time":{"ms":15}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":9},"info":{"ephemeral_id":"b94ae394-89ad-46c7-9737-4b9ce9052668","uptime":{"ms":1650170}},"memstats":{"gc_next":84354352,"memory_alloc":43153184,"memory_total":303585624},"runtime":{"goroutines":64}},"filebeat":{"harvester":{"open_files":3,"running":3}},"libbeat":{"config":{"module":{"running":0},"reloads":1},"pipeline":{"clients":6,"events":{"active":4117}}},"registrar":{"states":{"current":6}},"system":{"load":{"1":0.57,"15":0.42,"5":0.58,"norm":{"1":0.19,"15":0.14,"5":0.1933}}}}}}
2020-03-24T11:00:26.663+0100    ERROR   pipeline/output.go:100  Failed to connect to redis(tcp://elastic-redis.internal.de:16380): dial tcp 172.22.19.164:16380: i/o timeout
2020-03-24T11:00:26.663+0100    INFO    pipeline/output.go:93   Attempting to reconnect to redis(tcp://elastic-redis.internal.de:16380) with 36 reconnect attempt(s)
2020-03-24T11:00:53.674+0100    INFO    [monitoring]    log/log.go:145  Non-zero metrics in the last 30s        {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":1850,"time":{"ms":33}},"total":{"ticks":6710,"time":{"ms":52},"value":6710},"user":{"ticks":4860,"time":{"ms":19}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":9},"info":{"ephemeral_id":"b94ae394-89ad-46c7-9737-4b9ce9052668","uptime":{"ms":1680171}},"memstats":{"gc_next":84354352,"memory_alloc":44123328,"memory_total":304555768,"rss":-24576},"runtime":{"goroutines":64}},"filebeat":{"harvester":{"open_files":3,"running":3}},"libbeat":{"config":{"module":{"running":0},"reloads":1},"pipeline":{"clients":6,"events":{"active":4117,"retry":2048}}},"registrar":{"states":{"current":6}},"system":{"load":{"1":0.6,"15":0.43,"5":0.59,"norm":{"1":0.2,"15":0.1433,"5":0.1967}}}}}}

So what is the issue? What can I do that filebeat closes the file handle if file is removed?

PS: filebeat is running as unprivileged user, but when I login to filebeat and run lsof, it can see it's own opened files with the deleted marking.

now found this here:
https://www.elastic.co/guide/en/beats/filebeat/current/faq-deleted-files-are-not-freed.html

This seems to help.