Filebeat crashing repeatedly with out of error on first use


(Richardde Jong) #1

Hi,

I have a fresh ELK stack based on the containers from Elastic. I am using Filebeat to send logs to it. I have Filebeat monitor all files in /var/log. When it first tries to read all those files and send them to Logstash, it crashes repeatedly and uses enormous amounts of memory (6-10G depending on what is available).

It looks like it is trying to read all content into memory and then starts sending it to Logstash.
Each time it runs it loads some data into Logstash, but it doesn't keep track. So next time it runs it inserts the same messages again.

Top output just before it crashed:

top - 14:37:23 up  1:06,  3 users,  load average: 1.82, 0.69, 0.57
Tasks: 193 total,   5 running, 188 sleeping,   0 stopped,   0 zombie
%Cpu(s): 37.1 us, 42.0 sy,  0.0 ni,  0.3 id, 17.9 wa,  0.0 hi,  2.7 si,  0.0 st
KiB Mem : 16268948 total,   135560 free, 16012860 used,   120528 buff/cache
KiB Swap:  1679356 total,   432792 free,  1246564 used.    87600 avail Mem
PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
25995 root      20   0 8765300 6.000g    128 S 110.4 38.7   0:29.20 filebeat

When running as a service (so under filebeat-god) it seems it keeps restarting after the out-of-memory error, leading to lots of duplicates in Elasticsearch and never being able to finish.

My config:

filebeat.prospectors:

- input_type: log
  paths:
    - /var/log/*
  exclude_files: [".gz$"]

- input_type: log
  paths:
    - /var/log/httpd/*
  exclude_files: [".gz$"]

- input_type: log
  paths:
    - /home/jboss/wildfly/standalone/log/server*
  exclude_files: [".gz$"]
  multiline.pattern: ^[[:space:]]+|^Caused by:|^$
  multiline.negate: false
  multiline.match: before

- input_type: log
  paths:
    - /home/jboss/jboss-as/standalone/log/server*
  exclude_files: [".gz$"]
  multiline.pattern: '^[0-9]{2}\:' # All lines starting with 2 numbers, followed by a : so matching 10:.
  multiline.negate: true
  multiline.match: after

filebeat.spool_size: 2048
filebeat.publish_async: true

output.logstash:
  hosts: ["logging:5044"]

logging.level: info
logging.to_files: true
logging.files:
  rotateeverybytes: 104857600 # = 100MB
  keepfiles: 7

Running it with debug output shows the following:

[root@logging ~]# /usr/share/filebeat/bin/filebeat -e -v -d "*" -c /etc/filebeat/filebeat.yml -path.home /usr/share/filebeat -path.config /etc/filebeat -path.data /var/lib/filebeat -path.logs /var/log/filebeat
2016/11/28 13:36:55.116950 beat.go:264: INFO Home path: [/usr/share/filebeat] Config path: [/etc/filebeat] Data path: [/var/lib/filebeat] Logs path: [/var/log/filebeat]
2016/11/28 13:36:55.116988 beat.go:174: INFO Setup Beat: filebeat; Version: 5.0.1
2016/11/28 13:36:55.117000 processor.go:43: DBG  Processors:
2016/11/28 13:36:55.117010 beat.go:180: DBG  Initializing output plugins
2016/11/28 13:36:55.117021 logp.go:219: INFO Metrics logging every 30s

<snip>

2016/11/28 13:36:55.127228 prospector_log.go:79: DBG  Prospector states cleaned up. Before: 42, After: 42
2016/11/28 13:36:55.171552 spooler.go:118: DBG  Flushing spooler because spooler full. Events flushed: 2048
2016/11/28 13:36:55.184512 client.go:184: DBG  Publish: {
  "@timestamp": "2016-11-28T13:36:55.126Z",
  "beat": {
    "hostname": "logging.infra.devcloud",
    "name": "logging.infra.devcloud",
    "version": "5.0.1"
  },
  "input_type": "log",
  "message": "Nov 28 14:15:02 template01 polkitd[877]: Registered Authentication Agent for unix-process:25329:262281 (system bus name :1.59 [/usr/bin/pkttyagent --notify-fd 5 --fallback], object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8)",
  "offset": 35686,
  "source": "/var/log/secure",
  "type": "log"
}

<snip 2046 entries>]

2016/11/28 13:37:26.590023 client.go:184: DBG  Publish: {
  "@timestamp": "2016-11-28T13:37:00.699Z",
  "beat": {
    "hostname": "logging.infra.devcloud",
    "name": "logging.infra.devcloud",
    "version": "5.0.1"
  },
  "input_type": "log",
  "message": "Sep 15 01:36:47 localhost dbus[634]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'",
  "offset": 2374720,
  "source": "/var/log/messages",
  "type": "log"
}
fatal error: **runtime: out of memory**

runtime stack:
runtime.throw(0x980692, 0x16)
        /usr/local/go/src/runtime/panic.go:566 +0x95

I am using the latest versions so
[root@logging ~]# /usr/share/filebeat/bin/filebeat -version
filebeat version 5.0.1 (amd64), libbeat 5.0.1
[root@logging ~]# docker exec logstash env|grep LOGSTASH_VERSION
LOGSTASH_VERSION=1:5.0.1-1


(Steffen Siering) #2

does it work if you set:

filebeat.publish_async: false

?


(Richardde Jong) #3

Hi Steffen,

I'm afraid that does not help. It crashed again after a few seconds. Last log output:

2016/11/28 16:45:57.577997 sync.go:107: DBG  Try to publish 2048 events to logstash with window size 619
2016/11/28 16:45:57.962668 window.go:51: DBG  update current window size: 619
2016/11/28 16:45:57.962696 window.go:55: DBG  set to max ok window size: 825
2016/11/28 16:45:57.962718 sync.go:78: DBG  619 events out of 2048 events sent to logstash. Continue sending
2016/11/28 16:45:57.962727 sync.go:107: DBG  Try to publish 1429 events to logstash with window size 825
fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x980692, 0x16)
        /usr/local/go/src/runtime/panic.go:566 +0x95
runtime.sysMap(0xc586780000, 0xb38f0000, 0x7fce8bc3cf00, 0xcc8418)
        /usr/local/go/src/runtime/mem_linux.go:219 +0x1d0
runtime.(*mheap).sysAlloc(0xcaec20, 0xb38f0000, 0x7fce00000001)

(Richardde Jong) #4

It seems I may have found the culprit. I was hoping to ship all files in /var/log, but it was choking on the "lastlog" file, which seems to be a "sparse file", see also http://www.noah.org/wiki/Lastlog_is_gigantic:

[root@logging log]# ls -la /var/log/lastlog
-rw-r--r--. 1 root root 159607201460 Nov 28 20:27 /var/log/lastlog
[root@logging log]# ls -lah /var/log/lastlog
-rw-r--r--. 1 root root 149G Nov 28 20:27 /var/log/lastlog
[root@logging log]# du -sh /var/log/lastlog
32K     /var/log/lastlog

I now excluded it in the prospector and Filebeat can proceed.

Are there any more or less complete filebeat configuration examples that cover how to best monitor a standard CentOS 7 server? Sofar I've only found bits and pieces, but I am still struggling with interpreting standard syslog files, auditd files, etc. On the other hand, all this struggling serves to improve my understanding of the tools...


(Steffen Siering) #5

filebeat parses a file line by line, trying not to keep too much information in main memory.

Growing memory might be triggered by a combination of settings. Like multiline + publish_async + big queue sizes.

I guess in your case too much logs have piled up in main memory due to publish_async. Due to the other log-files not being updated this often, removing lastlog seems to solve the issue, but it might not be really solved. Just mitigated, by not having this many log-lines that need to be send. Sounds like lastlog gets updated too often.

Still failing with filebeat.publish_async: false might indicate some problem with your multiline filters piling up too many lines.


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.