Very high CPU load (how to interpret output of "top"?)

Precision · December 18, 2017, 3:47pm

I am using Logstash to send logfiles via the Gelf plugin to Graylog.

My config is fairly complex, with lots of grok filters.

When my system is under load the messages queue up for DAYS.

I have attached the output of "top".

How do I interpret "[main]>worker0" - and what does "[main]<file" mean?
I guess a constant 100% CPU load per worker is not good, no?

top - 16:34:11 up 45 days, 22:56, 3 users, load average: 5.19, 5.74, 10.87
Threads: 81 total, 4 running, 77 sleeping, 0 stopped, 0 zombie
%Cpu(s): 13.9 us, 0.4 sy, 84.9 ni, 0.3 id, 0.0 wa, 0.0 hi, 0.4 si, 0.0 st
KiB Mem : 15398116 total, 173912 free, 11405412 used, 3818792 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 3417404 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27886 logstash 39 19 4718860 492004 10596 R 86.7 3.2 12850:42 [main]>worker0
27888 logstash 39 19 4718860 492004 10596 R 86.7 3.2 12853:13 [main]>worker2
27887 logstash 39 19 4718860 492004 10596 R 81.1 3.2 12849:42 [main]>worker1
27889 logstash 39 19 4718860 492004 10596 R 79.7 3.2 12857:18 [main]>worker3
27894 logstash 39 19 4718860 492004 10596 S 2.3 3.2 68:36.41 [main]<file
27890 logstash 39 19 4718860 492004 10596 S 1.0 3.2 64:45.88 [main]<file
27742 logstash 39 19 4718860 492004 10596 S 0.3 3.2 19:17.18 java
27743 logstash 39 19 4718860 492004 10596 S 0.3 3.2 19:23.35 java
27744 logstash 39 19 4718860 492004 10596 S 0.3 3.2 19:23.70 java
27745 logstash 39 19 4718860 492004 10596 S 0.3 3.2 19:27.28 java
27747 logstash 39 19 4718860 492004 10596 S 0.3 3.2 9:10.55 java
27757 logstash 39 19 4718860 492004 10596 S 0.3 3.2 11:35.33 java
27892 logstash 39 19 4718860 492004 10596 S 0.3 3.2 18:20.70 [main]<file
27895 logstash 39 19 4718860 492004 10596 S 0.3 3.2 11:16.33 [main]-pipeline
27706 logstash 39 19 4718860 492004 10596 S 0.0 3.2 0:00.00 java
27738 logstash 39 19 4718860 492004 10596 S 0.0 3.2 2:35.57 LogStash::Runne
27746 logstash 39 19 4718860 492004 10596 S 0.0 3.2 80:24.30 java
27748 logstash 39 19 4718860 492004 10596 S 0.0 3.2 0:11.65 java
27749 logstash 39 19 4718860 492004 10596 S 0.0 3.2 0:05.61 java
27751 logstash 39 19 4718860 492004 10596 S 0.0 3.2 0:00.17 java
...

Dheeraj_Gupta · December 19, 2017, 12:30pm

My understanding is that any process name with "<" is an input. So it looks like you have 3 file input plugins in your config. Logstash also spawns a number of workers (threads) to do the job (filter and output parts of pipeline). In your case there are 4 workers. This is controlled by the pipeline.workers setting in your logstash config and defaults to number of CPU cores that you have.

From high CPU loads, it looks like your workers do a lot of processing (you said you have a lot of grok filters etc.). You can look to increase the number of workers (increase cores or just change the setting) or tinker with the pipeline.batch.size setting to see if decreasing/increasing the number of inflight events can give you some speed-up.

In general Grok filters are very expensive and newer versions of logstash have the dissect filter. If this fits your use case, I'd recommend changing grok to dissect. It is WAYY faster.

You can also use the Logstash Stats API to pinpoint what filters are causing a bottleneck. This API also gives you a count of the queued events so is handy to check queue size over time.

system · January 16, 2018, 12:30pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.