Constant high cpu usage from Filebeat

(Gryzli) #1

Hello,

I'm using filebeat (6.4.3) to ship Apache logs (and some system ones) directly to ES.

Currently Filebeat is configured by using apache2 module and wildcard path setting, pointing to a directory with near 16 000 log files (not all of them are constant updating).

The event intensity is about 1000/s at max.

Filebeat is constantly using more than 100% cpu (more than 1 core) on a "Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz" cpu.

My current registry file is of 1,5MB size (don't know if that matters).

This is a sample snippet from my filebeat log :

===================

2019-02-12T14:18:39.945+0200 INFO [monitoring] log/log.go:141 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":102890,"time":{"ms":6569}},"total":{"ticks":997290,"time":{"ms":48492},"value":997290},"user":{"ticks":894400,"time":{"ms":41923}}},"info":{"ephemeral_id":"707a2420-9c67-4a21-b1cc-605a5aa74d78","uptime":{"ms":1450532}},"memstats":{"gc_next":70045040,"memory_alloc":41175592,"memory_total":83141794256,"rss":-7925760}},"filebeat":{"events":{"active":7768,"added":16776,"done":9008},"harvester":{"open_files":617,"running":701,"started":114},"input":{"log":{"files":{"renamed":48}}}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":10158,"active":73,"batches":272,"failed":14,"total":10245},"read":{"bytes":216823,"errors":3},"write":{"bytes":8552477}},"pipeline":{"clients":20,"events":{"active":4122,"filtered":6364,"published":10174,"retry":64,"total":16544},"queue":{"acked":6112}}},"registrar":{"states":{"current":7239,"update":9008},"writes":{"success":1130,"total":1130}},"system":{"load":{"1":9.57,"15":11.48,"5":10.86,"norm":{"1":0.3988,"15":0.4783,"5":0.4525}}}}}}

===================

I've also tried to do httprof profiling and I'm attaching the result of 30s profile (as a png).

(Nathan Skerl) #2

Just wanted to ref my own issue here, as I saw a very similar CPU pattern with a high number of current files. I see the same in your output. I was able to resolve my CPU issue by reducing the number of stale files in the registry, but I still don't understand why that helped.

(Gryzli) #3

Hi and thanks for your feedback !

Could you share, how did you managed to reduce the stale files in the registry ?

(Nathan Skerl) #4

We deleted them from the watched directory. That seemed to allow clean_removed to release them. The current count drops almost immediately after we purged them, and the CPU came down with it.

(Gryzli) #5

That's not applicable in my case, because these are Apache domains access logs, which are needed. The problem is that most of them are rarely updated or the update rate is pretty low, but there's no way to delete them.

(Gryzli) #6

Actually you are right, that the registry file is the main reason for the high cpu usage.

I tried to stop the apache2 module on a single server, where Filebeat was making the most CPU , and after doing it , there was almost no impact on the CPU (the registry stayed the same). Filebeat was consuming about 1300 cpu minutes per 24h on the same server.

Yesterday I decided to try and delete the registry file (while apache2 module is still disabled). After restarting , the CPU for 20 hours is approximately 13 cpu minutes (which is almost 100x times lower).

Isn't this reported as a bug ? Because for me such behavior is totally broken.

(Nathan Skerl) #7

I agree. I don't know if its a bug or just how filebeat works.

The documentation here suggests that after 5 minutes (default), the inactive files should be closed, but that doesn't mean filebeat stops tracking them:

If the closed file changes again, a new harvester is started and the latest changes will be picked up after scan_frequency has elapsed.

So, to derive if the "file changes again" it seems it has to inspect the file. I can see this happening in the debug logs:

|2019-02-14T08:27:57.485-0800|DEBUG|[prospector]|log/prospector.go:361|Check file for harvesting: d:\logs\iis\W3SVC3\u_ex19012920_x.log|
|2019-02-14T08:27:57.485-0800|DEBUG|[prospector]|log/prospector.go:447|Update existing file for harvesting: d:\logs\iis\W3SVC3\u_ex19012920_x.log, offset: 173310|
|2019-02-14T08:27:57.485-0800|DEBUG|[prospector]|log/prospector.go:501|File didn't change: d:\logs\iis\W3SVC3\u_ex19012920_x.log|
|2019-02-14T08:28:08.919-0800|DEBUG|[prospector]|log/prospector.go:361|Check file for harvesting: d:\logs\iis\W3SVC3\u_ex19012920_x.log|
|2019-02-14T08:28:08.919-0800|DEBUG|[prospector]|log/prospector.go:447|Update existing file for harvesting: d:\logs\iis\W3SVC3\u_ex19012920_x.log, offset: 173310|
|2019-02-14T08:28:08.919-0800|DEBUG|[prospector]|log/prospector.go:501|File didn't change: d:\logs\iis\W3SVC3\u_ex19012920_x.log|
|2019-02-14T08:28:19.931-0800|DEBUG|[prospector]|log/prospector.go:361|Check file for harvesting: d:\logs\iis\W3SVC3\u_ex19012920_x.log|
|2019-02-14T08:28:19.931-0800|DEBUG|[prospector]|log/prospector.go:447|Update existing file for harvesting: d:\logs\iis\W3SVC3\u_ex19012920_x.log, offset: 173310|
|2019-02-14T08:28:19.931-0800|DEBUG|[prospector]|log/prospector.go:501|File didn't change: d:\logs\iis\W3SVC3\u_ex19012920_x.log|
|2019-02-14T08:28:31.608-0800|DEBUG|[prospector]|log/prospector.go:361|Check file for harvesting: d:\logs\iis\W3SVC3\u_ex19012920_x.log|
|2019-02-14T08:28:31.608-0800|DEBUG|[prospector]|log/prospector.go:447|Update existing file for harvesting: d:\logs\iis\W3SVC3\u_ex19012920_x.log, offset: 173310|
|2019-02-14T08:28:31.608-0800|DEBUG|[prospector]|log/prospector.go:501|File didn't change: d:\logs\iis\W3SVC3\u_ex19012920_x.log|

That file hasn't been touched in > 2 weeks. However, every 10s (the default scan_frequency) filebeat is inspecting it. Even if it's a small operation, inspecting a large number of files every 10s could account for the cpu pattern we are seeing.

clean_inactive appears to remove the file from the registrar completely, but I dont like that if the file is updated after being cleaned it is read from the beginning. I guess that is useful if you know for sure the file wont be updated.

In my case I think deleting the file and relying on clean_removed is safer.

In your case, since you can't delete, we might be better off adjusting the scan_frequency to poll every minute or so instead of the default 10s. There is also max_procs but I have not experimented with that yet.

Let me know what you find out!

(J Teekaram Prasath) #8

Did you specified log rotateeverybyte and files to keep in filebeat.yml.

(system) closed #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.