We are using filebeat as a deamonset on kubernetes nodes to collect all application logs and sending them to logstash and then to elasticsearch.
In our Prod environment, we have some pods writing huge volume of events like ~1000/sec. Filebeat is not able to send all those events from those pods to logstash. We see that around 70% of logs are missing from those pods.
Also in the filebeat logs I observe that the file it is harvesting is being closed within close_inactive time which is 5m even though new events are being written to that file continuously (log below) and it is harvesting the same file again after ~20mins and irrespective of new data being written to the file it is closing the file after 5m
Can someone please help me to resolve this issue and tune filebeat to get all logs?
We are using filebeat version 6.0.0
2020/12/08 01:51:17.027070 harvester.go:232: INFO File is inactive: /var/lib/docker/containers/4f889cc75bfda6549d5200317a787d0fca4bc1256f078a6/4f889cc75bfda6549d5200317a787d0fca4bc1256f078a6-json.log. Closing because close_inactive of 5m0s reached.
@jsoriano It might take some time for us to upgrade to latest version as our patterns might break. We are using log filed as json_message_key and looks like latest version doesn't support it any more and it has been renamed to message. Meanwhile can you please help me identify the reason for this issue?
For the service which we are missing logs, log rotation is happening for every 20-30mins (like original file will be renamed to log.1, log.2 ... so on once 10mb limit is reached on individual file)
It seems to be volume of logs issue, because for this particular service there are no missing logs in lower environments, and this is happening only in production. log volume difference would be ~10k/15mins in lower environments and ~100k/15mins in prod. Filebeat is sending logs from other pods on the same node but having issue with only one particular pod.
do you think upgrade would fix this issue? or do we need increase some filebeat configurations like queue.events, bulk_size, resources or some other?
It can help lots of things have changed since 6.0.0.
It'd be important to identify the bottleneck in this scenario. It may be a problem of backpressure. If Elasticsearch or Logstash are not able to process all these events, Filebeat will wait, and in this time files can be rotated.
Try increasing the close_timeout as mentioned by @sandeepkanabar if you see that files are being closed because of this.
You may also try to increase the number of events in the mem queue, so Filebeat has a bigger buffer to handle peaks of logs.
If the problem is that files are being rotated before they can be sent, you can try to use the disk queue, that can give an additional buffer if the problem is in the output. But this is only available since Filebeat 6.3, and has had several changes and enhancements since then.
Check also if filebeat has spare CPU, if it has you may try to increase the number of workers of the logstash output. But be careful with this, it won't help if Logstash or your Elasticsearch cluster cannot cope with the load.
@jsoriano@sandeepkanabar thanks for your response. I debugged a little more into this issue and here are my findings.
I removed all other pods running on this node and kept only the pod having the issue. Below is the file rotation frequency. What i observe is filebeat is sending logs during the time when the file last modified, i.e at
22:08, 21:25. etc.. and only sending few events at the bottom from that file and ignoring remaining all. even those few are delayed 10-15mins. I am not sure why filebeat is ignoring all other events.
I increase close_inactive to 120m but that didn't help.It has enough memory and CPU
-rw-r----- 1 root root 2295951 Dec 10 22:37 abc123-json.log
-rw-r----- 1 root root 10000070 Dec 10 22:32 abc123-json.log.1
-rw-r----- 1 root root 10000542 Dec 10 22:08 abc123-json.log.2
-rw-r----- 1 root root 10000019 Dec 10 21:46 abc123-json.log.3
-rw-r----- 1 root root 10000423 Dec 10 21:25 abc123-json.log.4
-rw-r----- 1 root root 10000049 Dec 10 21:01 abc123-json.log.5
-rw-r----- 1 root root 10000151 Dec 10 20:38 abc123-json.log.6
-rw-r----- 1 root root 10000393 Dec 10 20:14 abc123-json.log.7
-rw-r----- 1 root root 10000107 Dec 10 19:50 abc123-json.log.8
-rw-r----- 1 root root 10000114 Dec 10 19:26 abc123-json.log.9
Did you try any of the other options? If filebeat has enough memory and CPU you could try to increase the number of events in the queue and the number of workers in the output and see if it helps.
Could you check if the bottleneck can be in Logstash or Elasticsearch?
In addition to Jaime's answer, could you try with default filbeat config and remove queue.mem etc and also in output: logstash use the defaults like skip the worker, bulk_max_size etc.
In effect, begin with default and that should help to narrow down the bottleneck @user2416
@jsoriano@sandeepkanabar I removed logstash output and writing directly to file, It is not working when writing to file also.
What I observed is when I remove multiline pattern from filebeat config it is working fine, if I add multiline pattern then it's sending sporadic. here is my multiline pattern and sample logs.
we are not seeing any issue with logs that match multiline pattern, but logs that doesn't not match multiline pattern are missing almost ~70%.
They are not 100% missing but when the volume is huge filebeat is dropping bunch of logs.
Please let me know how I can fix this, we have both type of logs in the same path
Can you try using multiline.pattern: '^[0-9]{4}-[0-9]{2}-[0-9]{2}' and see if that helps?
Also, you can try adding the line json.add_error_key: true after json.keys_under_root: true. This will help to know if there are any json errors for those skipped lines.
Additionally, check the filebeat log at /var/log/filebeat/filebeat.log or whichever path you've set.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.