I am ingesting firewall logs through logstash via tcp input and am dealing with an issue. Event logs (e.g. vpn) and UTM arrive normally, but when I enable traffic logs (which jumps from 20 eps to 3k eps) logstash automatically increases cpu consumption to 100% and from time to time it ends up losing the connection to the source and discards some logs. So far, all pipelines (3 in total) are running with memory queue and normally, but it is only after enabling the sending of traffic logs that the problems begin.
Today there is a machine dedicated to logstash (8vCPU, 600GB Disk, 32GB Memory). The logstash settings are the installation defaults (with the exception of the JVM, which I increased to 8GB. I have some questions:
If I enable persistent queue, will I be able to remedy this problem? If so, is there a way to enable it only for specific pipelines instead of all?
Logstash CPU issues are in most cases related to filter configurations or an excess amount of errors in the output.
For example if you have a grok filter that has multiple patterns and it fails for the majority of them or if you are receiving a lot of 400 errors from your Elasticsearch endpoint, both things can lead to an increase in the CPU usage by Logstash.
Since you said that this happens when you enable traffic logs on your firewall, I'm assuming that you have an exclusive pipeline to receive your firewall logs, can you share your complete logstash configuration? Also, can you share your pipelines.yml file?
I doubt, persistent queues are used for resilience, it can sometimes also increase the CPU usage because it will need more I/O to write and read from disk.
Yes, you can configure PQ per pipeline.
Depends on the use case, there are some things that you can improve in Logstash before, but sometimes a Kafka cluster is needed, I would not use Redis in this case.
I enrich the IP data with Ruby, helping to identify the organization's origin/destination based on subnets (around 3,000), in addition to enriching it with GeoIP data (internal IPs are also being manipulated in Ruby)
I used Ruby because it seemed to me a simpler option to work with an extensive dictionary (more than 3,000 subnets) and multiple enrichments without needing several filters, but I'm open to suggestions.
You cannot do it directly with a translate filter. I would start by moving the JSON parsing to the init option of the ruby filter (making sure the scope of the parsed data is right).
You are going to have to do a lot of .include? calls in any case. That is going to be expensive. If you can build a custom MMDB database then you do it in a geoip filter, which will do caching and check recently seen subnets first. This is a major optimization because there are often many entries for the same IP in a short section of log.
If you can use the custom DB to tag a subnet id to the event then you could use a translate filter to enrich additional fields based on that id.
Thanks for the suggestions. I have some doubts: In the case of Ruby, I preferred to use subnets because creating a dictionary with all the IPs present would be very costly.
In the case of the Ruby filter, it takes the value of the src or dst (ex: 192.168.0.12) and enriches it with the pertinent information if this IP is in the corresponding subnet (ex: 192.168.0.0/24). By setting up a customized mmdb base, can geoip work with this same logic?
Regarding the option of a customized database, which one would you recommend? I've already thought about using memcached, but as I don't have the expertise, I didn't proceed.
These questions are more to think about multiple alternatives to be taken, because I've been having these difficulties for a while
This may or may not be relevant to your situation but might be helpful to know when it comes to processing larger numbers of events.
In Logstash v8.6.2+ one of the underlying libraries had a change to its default configuration which may not have been a problem to everyone but did mean a lot of confusion when it came to larger volumes not being processed as they had been previously.
The default value in v8.6.2+ is confirmed as -Dio.netty.allocator.maxOrder=6 ( which is 512K per thread )
Earlier versions of logstash used a different value which meant the higher number of events would be processed more efficiently.
-Dio.netty.allocator.maxOrder=11 (is basically 16Mb per thread, which is an older library default.)
... apply the setting in jvm.options file if you want the higher throughput behaviour again