Logstash doesn't use all CPU available

Oliver_Hernandez · November 16, 2016, 3:06am

Hi guys,

Hope you can help me.
i'm running logstash 5.0.1 on a Blade server with 40 cores and 256GB of RAM.

I'm getting a LOT of logs into a single file (around 50.000 lines per minute) from a service we're running in the same server and I'm using metrics to get some metrics:

input {
file {
path => "/var/log/myfile.log"
start_position => "end"
sincedb_path => "/dev/null"
}
}

filter {
if [message] {
metrics {
meter => [ "%{host}" ]
add_tag => "metric"
}
}
}

output {
if "metric" in [tags] {
stdout {
codec => rubydebug
}
}
}

The problem is I'm getting around 1500 events per minute which I think is really slow.

{
"@version" => "1",
"@timestamp" => "2016-11-16T03:03:21.393Z",
"message" => "bb-warrns-ats-3",
"bb-warrns-ats-3" => {
"count" => 23356,
"rate_1m" => 1537.44797520174,
"rate_5m" => 1514.5634994611603,
"rate_15m" => 1510.598886839833
},
"tags" => [
[0] "metric"
]
}

CPU usage while running logstash doesn't pass 5%
Memory doesn't pass 15%

Does anybody have an idea what could be wrong?

Thanks in advance

magnusbaeck · November 16, 2016, 4:45am

How many pipeline workers?

Christian_Dahlqvist · November 16, 2016, 4:48am

I believe the rate numbers are the number of events per second averaged over the period specified, so it seems you are processing 1500 events per second, which equals 90000 events per minute. Logstash 5.0 has a new monitoring API, which may give you more information about processing rates.

Oliver_Hernandez · November 16, 2016, 2:53pm

Hi, i'm using "-w 40" which is the total de processors available on the server

Oliver_Hernandez · November 16, 2016, 2:57pm

Hi, yes you're right, I realised that right after i posted the thread. Thank you. I'm investigating what can i get from that API.

Being 90.000 records per minute is a very good number but makes the situation weirder because what could be causing the slowness then?

Christian_Dahlqvist · November 16, 2016, 3:09pm

How are you observing the slowness?

Oliver_Hernandez · November 16, 2016, 3:14pm

Well, what I see after I start logstash is that the date/time field in the log starts to fall behind the timestamp. After 10 minutes running, i get a whole minute delay.

Christian_Dahlqvist · November 16, 2016, 3:18pm

Is this by looking at the output to stdout? If this is the case, try writing to a file instead. As far as I recall, writing output to stdout is not very performant compared to many other outputs.

Oliver_Hernandez · November 16, 2016, 3:20pm

Yes, correct

Christian_Dahlqvist · November 16, 2016, 3:22pm

I edited my post as you replied. As far as I recall writing to stdout is not very fast and could very well be what is slowing down processing.

Oliver_Hernandez · November 16, 2016, 8:59pm

Hi, i installed elasticsearch and i'm sending everything to it.

output {
elasticsearch {
hosts => "localhost"
index => "stats"
}
}

i'm still seeing an increasing delay between the timestamp and the log date.

Oliver_Hernandez · November 16, 2016, 9:02pm

Something tells me the problem is not fast enough reading of the file.

Any advice to improve input and output?

Thanks

st1t · November 17, 2016, 4:00am

Hi Oliver_Hernadez.

Although it is described that the CPU utilization is 5%, is it not 100% when looking at a single core?
Also, do software and hardware interrupts occur?
If you have the above trends,
It may be better to check RSS (Receive Side Scaling) and RFS (Receive Flow Steering)

One problem that load is concentrated too much on a single core is called TCP reordering problem.
This is because if the NIC does not support multiple queues (MSI-X or RSS)
The hardware interrupt from the NIC occurs because it is fixed to a single CPU.
The reason why hardware interrupt processing is fixed is that if multiple CPU interrupts are randomly applied, packets are processed in parallel,
This is because it is necessary to rearrange the packets when there is packet order guarantee like TCP, which may degrade performance.

On the Kernel side there is RFS as turning on RSS equivalent function.

redhat ⁠8.8. Receive Flow Steering (RFS)
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/network-rfs.html
Scaling in the Linux Networking Stack
https://www.kernel.org/doc/Documentation/networking/scaling.txt

# echo "f" > /sys/class/net/ethX/queues/rx-0/rps_cpus
# echo 4096 > /sys/class/net/ethX/queues/rx-0/rps_flow_cnt
# echo 32768 > /proc/sys/net/core/rps_sock_flow_entries

I hope to be helpful.

Christian_Dahlqvist · November 17, 2016, 6:16am

It liiks like you now have a lot of pipeline workers but only a single output worker for Elasticsearch. You need to tune the pipeline as a whole. With this setup it is possible that Elasticsearch output is the bottleneck.

system · December 15, 2016, 6:16am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Logstash-filter-metrics used 100% CPU? Logstash	3	658	April 21, 2017
Maximising Logstash CPU Utilisation Logstash	2	438	March 11, 2019
Logstash dont use all available CPU-Cores Logstash	8	4254	July 6, 2017
Grok filter help! Logstash	5	528	September 15, 2018
Logstash "idle" CPU is 100% Logstash	3	648	September 13, 2019

Logstash doesn't use all CPU available

Related topics