Filebeat performance stall sometimes

Hello anyone!
I have a some problems with a Filebeat. I collecting logs from a 2 files and send it to a Logstash. Events rate decreasing after a few hours of work. Events rate becomes normal after restart filebeat or connection to Logstash was reset. Normal event rate is a 250 e/s, stalled - ~50 e/s.

Environment:

  • CentOS 8
  • Filebeat 7.6.0

Filebeat configuration:

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/nginx/access.log*
  exclude_files: ['.gz$']
  fields_under_root: true
  fields:
    source_type: nginx_access
filebeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: false
setup.template.settings:
  index.number_of_shards: 1
setup.kibana:
output.logstash:
  hosts: ["elk.local:5043"]
monitoring.enabled: true
monitoring.cluster_uuid: "a7Povx8cTK-IN_uQpDC8kA"
monitoring.elasticsearch:
  hosts: ["elk.local:9200"]

Event rate with comments:

Memory usage with comments:

Memory usage and event rate in a connection reset:

Hi @GhOsTMZ, welcome to discuss :slightly_smiling_face:

Do you see any error in filebeat or logstash logs?

Hello @jsoriano!
Unfortunately, logs are clear, only metrics in a log and no one error.

Could you share your filebeat configuration?

Filebeat configuration in a first message. But I can duplicate it here:

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/nginx/access.log*
  exclude_files: ['.gz$']
  fields_under_root: true
  fields:
    source_type: nginx_access
filebeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: false
setup.template.settings:
  index.number_of_shards: 1
setup.kibana:
output.logstash:
  hosts: ["elk.local:5043"]
monitoring.enabled: true
monitoring.cluster_uuid: "a7Povx8cTK-IN_uQpDC8kA"
monitoring.elasticsearch:
  hosts: ["elk.local:9200"]

In additional: it so strange, early filebeat worked properly not more than 6 hours. But it works more than 12 hours after last incident. I'll try to reproduce this behavior and get a syscalls by a strace. I hope it will be useful.

Filebeat worked properly only 1 hour after restarted. I collected syscalls for a normal and stalled working.

strace outputs: Normal, Stalled

Oops sorry :man_facepalming: thanks for sending it again, along with the strace outputs. Could you also send the debug logs of filebeat?

I have found this issue reported in Github that could lead to similar synthoms, it might be the same issue https://github.com/elastic/beats/issues/16335.

@GhOsTMZ what version of logstash are you using?

Version 7.4.2.

Isn't a problem but I should wait when issue will be reproduced.

I'll try to update logstash to actual version. Maybe issue will gone. Or no chance?

Hello,
Debug logs here. This logs covers too much period. Interesting period from 2020-08-03 09:30 to 2020-08-03 11:09. I can remove unactual lines. Let me know if you want it.

Also log file is filtered by egrep -h '^[0-9]+.*[A-Z]{4,}' filebeat{.1,} | grep -v 'Publish event'. I think you don't needs a events in this log file.

Thanks for sending the debug logs.

I don't see anything specially bad, but something called my attention. Starting on ~9:30, there are many batches sent with 2048 events:

2020-03-08T09:30:05.370Z	DEBUG	[logstash]	logstash/async.go:159	2048 events out of 2048 events sent to logstash host elk.local:5043. Continue sending

2048 is the default maximum batch size. Is it possible that starting on 9:30 your nginx server is generating more logs? It could be that something in the pipeline cannot cope with this load and gets saturated, affecting perfomance.

You could try to increase the max bulk size and the in-memory queue size and see if there is any improvement.

...
queue.mem:
  events: 16384
output.logstash:
  hosts: ["elk.local:5043"]
  bulk_max_size: 4096
...

If you are also monitoring logstash, please also check its metrics to see if it could be constrained of some resource. You may also need to increase the number of workers or max batch size in logstash.

If you are monitoring Elasticsearch, it could be also interesting to see the ingest rate during the periods when filebeat is having low performance.

Yes, it is. Sometimes (every 10 minutes) we have a request rate spikes (1500-4000k RPS). But not all times performance is affected. Before using filebeat we made some tests and got more than 6k event/sec performance in a default configuration.

So, I found interesting things in my debug logs:

Some stats
grep async.go beat.debug | sed -r 's/:[0-9]{2}\.[0-9]{3}Z//' | awk '{print $1}' | uniq -c
# Normal working
     15 2020-03-08T08:55
     25 2020-03-08T08:56
     35 2020-03-08T08:57
     34 2020-03-08T08:58
     37 2020-03-08T08:59
     42 2020-03-08T09:00
     26 2020-03-08T09:01
     33 2020-03-08T09:02
     29 2020-03-08T09:03
     26 2020-03-08T09:04
     36 2020-03-08T09:05
     25 2020-03-08T09:06
     35 2020-03-08T09:07
     30 2020-03-08T09:08
     37 2020-03-08T09:09
     36 2020-03-08T09:10
     32 2020-03-08T09:11
     24 2020-03-08T09:12
     31 2020-03-08T09:13
     29 2020-03-08T09:14
     32 2020-03-08T09:15
     28 2020-03-08T09:16
     33 2020-03-08T09:17
     30 2020-03-08T09:18
     27 2020-03-08T09:19
     39 2020-03-08T09:20
     22 2020-03-08T09:21
     30 2020-03-08T09:22
     28 2020-03-08T09:23
     32 2020-03-08T09:24
     35 2020-03-08T09:25
     27 2020-03-08T09:26
     34 2020-03-08T09:27
     37 2020-03-08T09:28
     37 2020-03-08T09:29
# Look here
     10 2020-03-08T09:30
      4 2020-03-08T09:31
      3 2020-03-08T09:32
      2 2020-03-08T09:33
      3 2020-03-08T09:34
      1 2020-03-08T09:35
      3 2020-03-08T09:36
      2 2020-03-08T09:37
      1 2020-03-08T09:38
      3 2020-03-08T09:39
      2 2020-03-08T09:40
      2 2020-03-08T09:41

It's a normal behavior?

Yes, I monitoring logstash too. And I don't see anomalies in a graphs. But I'll try to increase workers count.

PS: thank you for helping!

Unfortunately, no changes after using this. I think my final solution is restarting filebeat every hour...

Curious thing. I tried to compare nginx log size with a filebeat registy. I used this command for it:

while true; do clear; (LANG=C date; cat /var/lib/filebeat/registry/filebeat/data.json | jq '.[] as $src | ($src.source, $src.offset)' | tr -d '"' | awk 'BEGIN{cnt=-1}{if(NR%2==1){cnt++;fname[cnt]=$0}else{cmd="ls -la " fname[cnt] " | cut -f5 -d\" \""; while((cmd | getline line)>0){fsize[cnt]=line}close(cmd);offset[cnt]=$0}}END{for(i=0;i<=cnt;i++){if(match(fname[i],".gz$")==0){printf("\t%s\n\t\tPosition: %d\n\t\tFile size: %s\n\t\tUnreaded: %d\n\n",fname[i],offset[i],fsize[i],fsize[i]-offset[i])}}}') | tee -a ~/filebeat_pos.log; sleep 1; done

When filebeat becomes stall I see this:

Fri Mar 27 14:48:58 GMT 2020                                                                                                                                                                                                                 
        /var/log/nginx/access.log-20200327                                                                                                                                                                                                   
                Position: 17186877614                                                                                                                                                                                                        
                File size: 17186877614                                                                                                                                                                                                       
                Unreaded: 0                                                                                                                                                                                                                  
                                                                                                                                                                                                                                             
        /var/log/nginx/access.log                                                                                                                                                                                                            
                Position: 5027051158                                                                                                                                                                                                         
                File size: 5071950384                                                                                                                                                                                                        
                Unreaded: 44899226

File size and position (offset in a filebeat registry) is equal (or difference is no more than 1M) in a normal working.

I still trying to resolve issue without restarting beats every hour and I found next curious thing. This command shows events sent and acked events by a logstash:

egrep 'async|acker' beat.debug | awk '{if($3=="[logstash]"){print "Sent: " $5 "/" $9}else{sub("}", "", $8); print "Acked: " $8 "\n"}}'

And what I watching there:

Output
# Normal working
Sent: 454/454
Acked: 454

Sent: 231/231
Acked: 231

Sent: 459/459
Acked: 459

Sent: 232/232
Acked: 232

Sent: 699/699
Acked: 699

# Issue started here
Sent: 2048/2048
Sent: 1806/1806
Sent: 242/242
Acked: 2048

Sent: 2048/2048
Acked: 1806

Sent: 1806/1806
Acked: 242

Sent: 242/242
Acked: 2048

Sent: 2048/2048
Acked: 1806

Sent: 1806/1806
Acked: 242

This is a logstash or beats issue?

PS: this logs from here

# Issue started here
Sent: 2048/2048
Sent: 1806/1806
Sent: 242/242
Acked: 2048

Yes, this is why I suggested to try to increase the bulk_max_size, that could be involved here, and suspiciously defaults to 2048. Did you see any difference in these values after increasing bulk_max_size?

There is a possible bug in the logstash output that can make a beat to continuously retry to send the same batch of events if any of the events is rejected. I created an issue for that some time ago, but we are not sure of the conditions when it happens https://github.com/elastic/beats/issues/11732. In any case if this is the issue we should see something about failed events in filebeat or logstash logs.

Would it be an option for you to try to send events directly from filebeat to Elasticsearch? Without logstash? So we can confirm if this is an issue with filebeat, or with logstash or the logstash output.

No changes. I tried to increase bulk_max_size to 4096 and no changes.

Unfortunately I cannot do it because I using ClickHouse instead of ES.

But I found a curious thing again. I disabled compression (I tried to capture traffic) in filebeat on one of my servers and problem is gone. 4 days without problems.

1 Like