Hopefully someone can help with this issue. I have seen other similar posts but nothing that has helped us so far.
Our linux based logstash instance is reading a series of Windows IIS log files from a CIFS share and sending the log entries to a linux based ElasticSearch cluster.
Our problem is that logstash stops processing randomly even tho more data and files exists. There are no logstash errors (logstash.err is 0 bytes) and the service is still running after it stops processing the log files.
By randomly stops, I mean logstash can stop processing anywhere between an hour to 48 hours since the last logstash restart. It can happen several times in a 6 hour span or not at all.
I have tried upgrading (we originally used logstash v1.4.2), adding mem/heap, altering the logstash.conf file, etc.
Details:
Logstash: v1.5.0 (May 14, 2015) installed on 1 VM with CentOS 7 box, 6cpu, 16gb ram (2gb for heap), running as service
Elasticsearch: v1.5.2 installed on 5 VMs with CentOS 7 box, 4cpu, 16gb ram
Filer: CIFs share with 35 subfolders (one per IIS server) off root; each subfolder contains around ~480 IIS log files (ie one for each hour for each of 20 days).
-bash-4.2$ ps auxx | grep logstash:
logstash 16465 443 6.0 5795564 983204 pts/0 SNl 08:06 413:43 java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.io.tmpdir=/var/lib/logstash -Xmx2g -Xss2048k -Djffi.boot.library.path=/opt/logstash/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.io.tmpdir=/var/lib/logstash -Xbootclasspath/a:/opt/logstash/vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/opt/logstash/vendor/jruby -Djruby.lib=/opt/logstash/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main --1.9 /opt/logstash/lib/bootstrap/environment.rb logstash/runner.rb agent -f /etc/logstash/conf.d -l /var/log/logstash/logstash.log -w 6
-bash-4.2$ top -b -n1 |head:
top - 09:41:06 up 11 days, 8:43, 1 user, load average: 2.36, 2.51, 3.41
Tasks: 158 total, 1 running, 157 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.1 us, 1.7 sy, 49.2 ni, 48.6 id, 0.1 wa, 0.0 hi, 0.3 si, 0.0 st
KiB Mem : 16268828 total, 8096524 free, 1205768 used, 6966536 buff/cache
KiB Swap: 1679356 total, 1679356 free, 0 used. 14650808 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16465 logstash 39 19 5795564 983220 15160 S 293.3 6.0 417:04.18 java
1 root 20 0 57308 4396 2464 S 0.0 0.0 0:07.49 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.35 kthreadd
CPU usages when logstash stops processing:
logstash: was running steady ~80% then drops to 0 over a 10 minute span
elasticsearch: was running between 15 and 20%
note that while logstash is running hot at ~80%, it often runs like that for a day or two with no issues.
logstash conf file:
input {
file {
path => ["/sharefolder/servernameXX/hourfilename.adv_H201506*.log","/sharefolder/servernameYY/hourfilename.adv_H201507*.log","/sharefolder/servernameZZ/hourfilename.adv_H201508*.log"]
start_position => "beginning"
}
}
filter {
if [message] =~ "^#" {
drop {}
}
grok {
patterns_dir => "./patterns"
match => ["message", "%{TIMESTAMP_ISO8601:http_timestamp} %{QUOTEDSTRING:http_servername} %{WORD:http_method} %{DATA:http_uristem} %{DATA:http_uriquery} %{IIS_OPTIONAL_FIELD:http_username} %{DATA:http_clientip} %{DATA:http_version} %{DATA:http_useragent} %{IIS_OPTIONAL_FIELD:http_cookie} %{IIS_OPTIONAL_FIELD:http_referer} %{DATA:http_host} %{DATA:http_status} %{DATA:http_timetaken}\r"]
}
# use http_timestamp from the log
if [http_timestamp] {
date {
match => [ "http_timestamp", "YYYY-MM-dd HH:mm:ss.SSS" ]
timezone => "UTC"
target => "@timestamp"
}
}
mutate {
remove_field => [ "host", "http_timestamp", "@version", "http_cookie" ]
}
if "_grokparsefailure" not in [tags] {
fingerprint {
key => "appnameiislogs"
target => [ "fingerprint"]
}
mutate {
remove_field => [ "message" ]
convert => { "http_timetaken" => "integer" }
gsub => [
"http_servername", "\"", "",
"http_clientip", "\"", "",
"http_version", "\"", "",
"http_host", "\"", "",
"http_status", "\"", "",
"http_useragent", "\"", "",
"http_useragent", "\"", "",
"http_useragent", "\"", "",
"http_referer", "\"", ""
]
}
if [http_uriquery] and "-" == [http_uriquery] {
mutate {
remove_field => [ "http_uriquery" ]
}
}
if [http_username] and "-" == [http_username] {
mutate {
remove_field => [ "http_username" ]
}
}
if [http_useragent] and "-" == [http_useragent] {
mutate {
remove_field => [ "http_useragent" ]
}
}
if [http_referer] and "-" == [http_referer] {
mutate {
remove_field => [ "http_referer" ]
}
}
if [http_uriquery] {
urldecode {
field => "http_uriquery"
}
mutate {
gsub => [
"http_uriquery", "amp;", ""
]
}
kv {
source => "http_uriquery"
field_split => "&"
target => "querystring"
}
mutate {
remove_field => [ "[querystring][callback]", "[querystring][_]", "[querystring][api_token]", "[querystring][*s]", "[querystring][*h]" ]
}
}
}
}
output {
if "_grokparsefailure" in [tags] {
elasticsearch {
host => ["aa.bb.cc.dd"]
protocol => "http"
index => "appname-error-log"
}
} else {
elasticsearch {
host => ["aa.bb.cc.dd"]
protocol => "http"
index => "appname-query-log-%{+YYYY.MM.dd}"
document_id => "%{fingerprint}"
idle_flush_time => 5
max_retries => 5
workers => 15
}
}
}