Filebeat stopped sending logs in realtime

I briefly mentioned this problem to Carlos Pérez at the OSSummit this week and he suggested to post my problem here.

Since a couple of weeks we have a problem on a particular host (a VM) that

  1. Only parts of the logs are seen in Kibana
  2. The log entries which make it through are all sent in the same time, having all the same timestamp (looks like accumulated logs)

The log source is a Nginx log from a well-visited site. There are multiple log entries per second, yet in the graph you see the gaps.

This happens on the last 5.6(.12 if I remember correctly) as well as on the newest 6.x version. I also went back to 5.6.9 to see if there was a regression. Always with the same result.

Logstash (where the logs are sent to) and Elasticsearch are working fine, as you can see the other log entries are correctly entered and graphed (see dashboard screenshot above).

I enabled the debug logs in Filebeat and could only see that the logs were sent accumulated at (seemlingly) random times. Which would explain the gaps but not the missing log entries.

On another host which also sends nginx logs using Filebeat the data is coming in almost realtime as wanted.

The following graph shows how much data is missing now:

The graph shows the number of log entries (aggregated to 1 day) in the past 60 days. As you can see there is a clear drop of the number since September 11. And no, the reason is not less visitors. We still have the same kind of website visitors.

Can anyone help explain what could be the reason? Different settings to try and debug this issue?

This look like a weird behavior, let's try to get a few information first:

  • What version of Logstash you are running?
  • What version of the beats input is available in LS, you can get that by running: bin/logstash-plugin list --verbose beats
  • Do you see any errors in the logs on beats or even in Logstash around these time, more precisely connection error or timeouts?
  • You said some Filebeat instance are fine and some are not?
  • Could you share the Filebeat configuration?

I enabled the debug logs in Filebeat and could only see that the logs were sent accumulated at (seemlingly) random times. Which would explain the gaps but not the missing log entries.

Can you describe how did you get that information?

BTW carlos is @exekias on the forum :slight_smile:

What version of Logstash you are running?

Logstash 5.6.10

What version of the beats input is available in LS

The beats input plugin was an older version but I already updated the beats input plugin prior to this thread to find out if this was the cause. So the current version is 5.1.6. Same effect with old and updated plugin version.

Do you see any errors in the logs on beats or even in Logstash around these time, more precisely connection error or timeouts?

That was the first thing I looked at. No, there aren't any errors logged.

You said some Filebeat instance are fine and some are not?

I might have to correct this. Looks like this issue happens generally on the filebeat-* index. I just saw this now:

This is a view on the whole filebeat-* Index with the last 15 min. Note that the last entries date back a couple of minutes. It also looks like there are (mostly) not more than ~2000 entries per 30 seconds.

Could you share the Filebeat configuration?

So as I currently reverted back to 5.6.5 (to see if the issue happens since an [automatic] update of the filebeat package), the current config is:

#=========================== Filebeat prospectors =============================

filebeat.prospectors:
## BEGIN -- HAProxy logging configured by Ansible
#- input_type: log
#  paths:
#    - /var/log/haproxy.log
#  document_type: haproxy
## END -- HAProxy logging configured by Ansible
# BEGIN -- Nginx logging configured by Ansible
- input_type: log
  paths:
    - /var/log/nginx/*.log
  document_type: nginx-access
  enabled: true 
# END -- Nginx logging configured by Ansible

#================================ General =====================================

# Shipper Name
name: onl-paylb01-p

#================================ Outputs =====================================

# Logstash output
# BEGIN -- Logstash output configured by Ansible
output.logstash:
  hosts: ["logstash:5044"]
# END -- Logstash output configured by Ansible

#================================ Logging =====================================

logging.level: info

The config didn't change by the way, it has been like this for a couple of months. We just realized on our monitoring screen these gaps appearing in the Nginx access graphs.
Obviously I adapted the config when I tested a newer Filebeat 6.x version.

@Napsty I've looked a the graph again with the count of events, so we are passing for 10m to less than 2 millions, this is indeed a huge drop.

  • What is your file rotation strategy look like? I am wondering if Filebeat is behind and rotation happens and filebeat did not read the file.

  • Are you sending everything to the same logstash? I am asking that because I only see one in the config.

Could you add the following to config, this will give us a bit more detail about prospector/harvestor events.

logging.selectors: [prospector, harvester]
logging.level: debug

@pierhugues Thanks for your responses so far.

What is your file rotation strategy look like?

Daily rotation, rotate 14, compress, delaycompress, postrotate: service nginx reload. Pretty straightforward logrotate actually.

if Filebeat is behind and rotation happens and filebeat did not read the file

Interestingly i found this:

# lsof +L1
COMMAND    PID     USER   FD   TYPE DEVICE   SIZE/OFF NLINK   NODE NAME
nginx    15775 www-data   46u   REG  252,1     102400     0    326 /var/lib/nginx/proxy/5/40/0020098405 (deleted)
nginx    15775 www-data  180u   REG  252,1     102400     0    283 /var/lib/nginx/proxy/3/77/0020099773 (deleted)
nginx    15775 www-data  436u   REG  252,1      36864     0    268 /var/lib/nginx/proxy/9/31/0020099319 (deleted)
nginx    15776 www-data  298u   REG  252,1      36864     0 130528 /var/lib/nginx/proxy/4/75/0020100754 (deleted)
nginx    15776 www-data  338u   REG  252,1     167936     0 130637 /var/lib/nginx/proxy/4/98/0020100984 (deleted)
nginx    15777 www-data  471u   REG  252,1     102400     0    354 /var/lib/nginx/proxy/8/26/0020101268 (deleted)
nginx    15778 www-data  601u   REG  252,1      65536     0 130621 /var/lib/nginx/proxy/9/87/0020097879 (deleted)
filebeat 26811     root    4r   REG  252,1 1187111843     0   1070 /var/log/nginx/www.example.com.access.log.1 (deleted)
filebeat 26811     root    7r   REG  252,1 1088771185     0   1693 /var/log/nginx/www.example.com.access.log.1 (deleted)
filebeat 26811     root    8r   REG  252,1 1316786306     0   1641 /var/log/nginx/www.example.com.access.log.1 (deleted)
filebeat 26811     root    9r   REG  252,1 1121149283     0   1499 /var/log/nginx/www.example.com.access.log.1 (deleted)
filebeat 26811     root   10r   REG  252,1 1150732948     0   1378 /var/log/nginx/www.example.com.access.log.1 (deleted)
filebeat 26811     root   11r   REG  252,1 1279809331     0   1761 /var/log/nginx/www.example.com.access.log.1 (deleted)
filebeat 26811     root   12r   REG  252,1 1179537745     0   1789 /var/log/nginx/www.example.com.access.log.1 (deleted)
filebeat 26811     root   13r   REG  252,1 1238494980     0   1663 /var/log/nginx/www.example.com.access.log.1 (deleted)
filebeat 26811     root   14r   REG  252,1 1177592344     0   1661 /var/log/nginx/www.example.com.access.log.1 (deleted)
filebeat 26811     root   15r   REG  252,1 1120426538     0   1788 /var/log/nginx/www.example.com.access.log.1 (deleted)
filebeat 26811     root   16r   REG  252,1 1355659737     0   1741 /var/log/nginx/www.example.com.access.log.1 (deleted)

Are you sending everything to the same logstash?

Yes, at least for the Filebeat collected logs.

Could you add the following to config

Added and restarted Filebeat.

How many filebeat instances are sending to the same Logstash?

How many filebeat instances are sending to the same Logstash?

Just checked using netstat connections to beats listener on that particular Logstash instance and it's a total of 12 filebeat instances.

Edit: I also checked if maybe Logstash is under heavy stress and tells Filebeat to back off, but doesn't seem to be the case:

OK - Logstash seems to be doing fine.
OK: CPU usage in percent: 18
OK: Config reload syntax check
OK: Inflight events: -130
OK: Heap usage at 15.00% (166800904 out of 1065025536 bytes in use)
OK: Open file descriptors at 0.68%. (112 out of 16384 file descriptors are open)

(Yes, we're monitoring Logstash)

@pierhugues @exekias Any ideas? Is there something I should look for in the debug logs?
I can't find a single error in it.

Update January 7th 2019: Together with the Elastic support the problem could be identified in the Logstash collector. It turned out that the VM where Logstash runs on, was just too weak (1 vcpu). After increasing the number of vcpus to 4 the numbers/events are back to normal.
This was only found by try'n'err through different settings on both Logstash and Filebeat. Neither Logstash nor the OS of that VM (Ubuntu 16.04) logged bottleneck issues. Even load/cpu usage was never at 100%.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.