Logstash doesn't use all CPU available


(Oliver Hernández) #1

Hi guys,

Hope you can help me.
i'm running logstash 5.0.1 on a Blade server with 40 cores and 256GB of RAM.

I'm getting a LOT of logs into a single file (around 50.000 lines per minute) from a service we're running in the same server and I'm using metrics to get some metrics:

input {
file {
path => "/var/log/myfile.log"
start_position => "end"
sincedb_path => "/dev/null"
}
}

filter {
if [message] {
metrics {
meter => [ "%{host}" ]
add_tag => "metric"
}
}
}

output {
if "metric" in [tags] {
stdout {
codec => rubydebug
}
}
}

The problem is I'm getting around 1500 events per minute which I think is really slow.

{
"@version" => "1",
"@timestamp" => "2016-11-16T03:03:21.393Z",
"message" => "bb-warrns-ats-3",
"bb-warrns-ats-3" => {
"count" => 23356,
"rate_1m" => 1537.44797520174,
"rate_5m" => 1514.5634994611603,
"rate_15m" => 1510.598886839833
},
"tags" => [
[0] "metric"
]
}

CPU usage while running logstash doesn't pass 5%
Memory doesn't pass 15%

Does anybody have an idea what could be wrong?

Thanks in advance


(Magnus Bäck) #2

How many pipeline workers?


(Christian Dahlqvist) #3

I believe the rate numbers are the number of events per second averaged over the period specified, so it seems you are processing 1500 events per second, which equals 90000 events per minute. Logstash 5.0 has a new monitoring API, which may give you more information about processing rates.


(Oliver Hernández) #4

Hi, i'm using "-w 40" which is the total de processors available on the server


(Oliver Hernández) #5

Hi, yes you're right, I realised that right after i posted the thread. Thank you. I'm investigating what can i get from that API.

Being 90.000 records per minute is a very good number but makes the situation weirder because what could be causing the slowness then?


(Christian Dahlqvist) #6

How are you observing the slowness?


(Oliver Hernández) #7

Well, what I see after I start logstash is that the date/time field in the log starts to fall behind the timestamp. After 10 minutes running, i get a whole minute delay.


(Christian Dahlqvist) #8

Is this by looking at the output to stdout? If this is the case, try writing to a file instead. As far as I recall, writing output to stdout is not very performant compared to many other outputs.


(Oliver Hernández) #9

Yes, correct


(Christian Dahlqvist) #10

I edited my post as you replied. As far as I recall writing to stdout is not very fast and could very well be what is slowing down processing.


(Oliver Hernández) #11

Hi, i installed elasticsearch and i'm sending everything to it.

output {
elasticsearch {
hosts => "localhost"
index => "stats"
}
}

i'm still seeing an increasing delay between the timestamp and the log date.


(Oliver Hernández) #12

Something tells me the problem is not fast enough reading of the file.

Any advice to improve input and output?

Thanks


(Shota Ito) #13

Hi Oliver_Hernadez.

Although it is described that the CPU utilization is 5%, is it not 100% when looking at a single core?
Also, do software and hardware interrupts occur?
If you have the above trends,
It may be better to check RSS (Receive Side Scaling) and RFS (Receive Flow Steering)

One problem that load is concentrated too much on a single core is called TCP reordering problem.
This is because if the NIC does not support multiple queues (MSI-X or RSS)
The hardware interrupt from the NIC occurs because it is fixed to a single CPU.
The reason why hardware interrupt processing is fixed is that if multiple CPU interrupts are randomly applied, packets are processed in parallel,
This is because it is necessary to rearrange the packets when there is packet order guarantee like TCP, which may degrade performance.

On the Kernel side there is RFS as turning on RSS equivalent function.

# echo "f" > /sys/class/net/ethX/queues/rx-0/rps_cpus
# echo 4096 > /sys/class/net/ethX/queues/rx-0/rps_flow_cnt
# echo 32768 > /proc/sys/net/core/rps_sock_flow_entries

I hope to be helpful.


(Christian Dahlqvist) #14

It liiks like you now have a lot of pipeline workers but only a single output worker for Elasticsearch. You need to tune the pipeline as a whole. With this setup it is possible that Elasticsearch output is the bottleneck.


(system) #15

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.