My logstash server has 4 CPUs and 16g Mem, and 4 pipeline-workers (It's said the number of pipeline-worker should be the number of CPUs. ). However, after running more than 20 hours, I found it said:
Beats input: The circuit breaker has detected a slowdown or stall in the pipeline...
Then I changed 'congestion_threshold' and 'timeout' to a much larger number by referring to the solution on the Internet, and restarted logstash. Unfortunately, also after about 20+ hours, the same issue occurred.
I checked the CPU/Mem usage with top command, and found CPU was about 1%, and Mem was about 3%. Very low usage.
I'm not sure why the pipeline was blocked under such low CPU/Mem usage. And what else can I do to avoid that issue? Any help will be appreciated.
The recommended settings for these parameters have changed over the last few releases. Which version of Logstash are you using? What does your config look like?
filter {
grok {
match => {
"message" => [
"[(?(%{MONTHNUM}/%{MONTHDAY}/%{YEAR})\s+%{TIME}\s+%{WORD})]\s+%{BASE16NUM:ThreadID}\s+(?([\w|\S]+))\s+%{WORD:LogLevel}\s+(?[\w|\W](?(SR[A-Za-z\d][\d]+))[\W]+[\w|\W])",
"[(?(%{MONTHNUM}/%{MONTHDAY}/%{YEAR})\s+%{TIME}\s+%{WORD})]\s+%{BASE16NUM:ThreadID}\s+(?([\w|\S]+))\s+%{WORD:LogLevel}\s+(?[\w|\W](\n)+(?(SR[A-Za-z\d][\d]+))(\n)+[\w|\W])"
]
}
remove_field => ["message"]
}
if "_grokparsefailure" in [tags] {
grok {
match => ["message", "[(?(%{MONTHNUM}/%{MONTHDAY}/%{YEAR})\s+%{TIME}\s+%{WORD})]\s+%{BASE16NUM:ThreadID}\s+(?([\w|\S]+))\s+%{WORD:LogLevel}\s+(?[\w|\W])"]
remove_field => ["message"]
remove_tag => ["_grokparsefailure"]
add_field => {
SRNumber => "-"
}
}
}
if "_grokparsefailure" in [tags] {
grok {
match => ["message", "[(?(%{MONTHNUM}/%{MONTHDAY}/%{YEAR})\s+%{TIME}\s+%{WORD})]\s+%{BASE16NUM:ThreadID}\s+%{WORD:LogLevel}\s+(?[\w|\W])"]
remove_field => ["message"]
remove_tag => ["_grokparsefailure"]
add_field => {
SRNumber => "-"
LogSOurce => "-"
}
}
}
if "_grokparsefailure" in [tags] {
grok {
match => ["message", "(?[\w|\W]+)"]
remove_field => ["message"]
remove_tag => ["_grokparsefailure"]
add_tag => ["ignore"]
add_field => {
LogSource => "-"
LogLevel => "-"
SRNumber => "-"
LogTime => "-"
ThreadID => "-"
}
}
}
if "SWIS" in [fields][ServerType] {
date {
match => ["Logtime", "M/d/yy HH:mm:ss:SSS z"]
timezone => "GMT"
}
} else {
date {
match => ["Logtime", "M/d/yy HH:mm:ss:SSS z"]
timezone => "UTC"
}
}
}
output {
elasticsearch {
hosts => "IP"
index => "logstash-site-%{+YYYY.MM.dd}"
flush_size => 50
}
}
After increasing the number of pipeline workers from 4 to 8, logstash had been working well for about 10 days. It's a good progress compared to before it only worked well for less than 1 days.
Is anything wrong with my configuration? If yes, any suggestions for that? This issue has bothered my for a long time, I'll be appreciated for your help.
Why have you specified such a small value? I would stick with the default values as this could have an impact on performance as it will require additional round trips to Elasticsearch.
Although I don't know exactly what your data looks like and what proportion of data that is matched by the various patterns, it looks like all patterns start with the same sequence: [(?(%{MONTHNUM}/%{MONTHDAY}/%{YEAR})\s+%{TIME}\s+%{WORD})]\s+%{BASE16NUM:ThreadID}\s+
It may be more efficient to capture this in one grok filter and use a GREEDYDATA to capture the rest of the message into a separate variable that can then be matched against the various scenarios.
Yes, you're right. The flush_size is too small. I didn't notice that before your reminder. Thank a lot.
Your suggestion on grok also makes sense, but after I update the grok filter, there seems no improvements on logstash performance. It met the pipeline slowdown issue after several hours running. I may need to revise my grok configuration.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.