Logstash 1.4.2 and 1.5.0 stops processing randomly with no errors


(Gerald DeConto) #1

Hopefully someone can help with this issue. I have seen other similar posts but nothing that has helped us so far.

Our linux based logstash instance is reading a series of Windows IIS log files from a CIFS share and sending the log entries to a linux based ElasticSearch cluster.

Our problem is that logstash stops processing randomly even tho more data and files exists. There are no logstash errors (logstash.err is 0 bytes) and the service is still running after it stops processing the log files.

By randomly stops, I mean logstash can stop processing anywhere between an hour to 48 hours since the last logstash restart. It can happen several times in a 6 hour span or not at all.

I have tried upgrading (we originally used logstash v1.4.2), adding mem/heap, altering the logstash.conf file, etc.

Details:

Logstash: v1.5.0 (May 14, 2015) installed on 1 VM with CentOS 7 box, 6cpu, 16gb ram (2gb for heap), running as service
Elasticsearch: v1.5.2 installed on 5 VMs with CentOS 7 box, 4cpu, 16gb ram
Filer: CIFs share with 35 subfolders (one per IIS server) off root; each subfolder contains around ~480 IIS log files (ie one for each hour for each of 20 days).

-bash-4.2$ ps auxx | grep logstash:
logstash 16465 443 6.0 5795564 983204 pts/0 SNl 08:06 413:43 java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.io.tmpdir=/var/lib/logstash -Xmx2g -Xss2048k -Djffi.boot.library.path=/opt/logstash/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.io.tmpdir=/var/lib/logstash -Xbootclasspath/a:/opt/logstash/vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/opt/logstash/vendor/jruby -Djruby.lib=/opt/logstash/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main --1.9 /opt/logstash/lib/bootstrap/environment.rb logstash/runner.rb agent -f /etc/logstash/conf.d -l /var/log/logstash/logstash.log -w 6

-bash-4.2$ top -b -n1 |head:
top - 09:41:06 up 11 days, 8:43, 1 user, load average: 2.36, 2.51, 3.41
Tasks: 158 total, 1 running, 157 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.1 us, 1.7 sy, 49.2 ni, 48.6 id, 0.1 wa, 0.0 hi, 0.3 si, 0.0 st
KiB Mem : 16268828 total, 8096524 free, 1205768 used, 6966536 buff/cache
KiB Swap: 1679356 total, 1679356 free, 0 used. 14650808 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16465 logstash 39 19 5795564 983220 15160 S 293.3 6.0 417:04.18 java
1 root 20 0 57308 4396 2464 S 0.0 0.0 0:07.49 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.35 kthreadd

CPU usages when logstash stops processing:
logstash: was running steady ~80% then drops to 0 over a 10 minute span
elasticsearch: was running between 15 and 20%

note that while logstash is running hot at ~80%, it often runs like that for a day or two with no issues.

logstash conf file:

input {
  file {
    path => ["/sharefolder/servernameXX/hourfilename.adv_H201506*.log","/sharefolder/servernameYY/hourfilename.adv_H201507*.log","/sharefolder/servernameZZ/hourfilename.adv_H201508*.log"]
    start_position => "beginning"
  }
}

filter {
  if [message] =~ "^#" {
    drop {}
  }

  grok {
    patterns_dir => "./patterns"
    match => ["message", "%{TIMESTAMP_ISO8601:http_timestamp} %{QUOTEDSTRING:http_servername} %{WORD:http_method} %{DATA:http_uristem} %{DATA:http_uriquery} %{IIS_OPTIONAL_FIELD:http_username} %{DATA:http_clientip} %{DATA:http_version} %{DATA:http_useragent} %{IIS_OPTIONAL_FIELD:http_cookie} %{IIS_OPTIONAL_FIELD:http_referer} %{DATA:http_host} %{DATA:http_status} %{DATA:http_timetaken}\r"]
  }

  # use http_timestamp from the log
  if [http_timestamp] {
    date {
      match => [ "http_timestamp", "YYYY-MM-dd HH:mm:ss.SSS" ]
      timezone => "UTC"
      target => "@timestamp"
    }
  }

  mutate {
    remove_field => [ "host", "http_timestamp", "@version", "http_cookie" ]
  }

  if "_grokparsefailure" not in [tags] {
    fingerprint {
      key => "appnameiislogs"
      target => [ "fingerprint"]
    }

    mutate {
      remove_field => [ "message" ]
      convert => { "http_timetaken" => "integer" }
      gsub => [
        "http_servername", "\"", "",
        "http_clientip", "\"", "",
        "http_version", "\"", "",
        "http_host", "\"", "",
        "http_status", "\"", "",
        "http_useragent", "\"", "",
        "http_useragent", "\"", "",
        "http_useragent", "\"", "",
        "http_referer", "\"", ""
      ]
    }

    if [http_uriquery] and "-" == [http_uriquery] {
      mutate {
        remove_field => [ "http_uriquery" ]
      }
    }
    if [http_username] and "-" == [http_username] {
      mutate {
        remove_field => [ "http_username" ]
      }
    }
    if [http_useragent] and "-" == [http_useragent] {
      mutate {
        remove_field => [ "http_useragent" ]
      }
    }
    if [http_referer] and "-" == [http_referer] {
      mutate {
        remove_field => [ "http_referer" ]
      }
    }

    if [http_uriquery] {
      urldecode {
        field => "http_uriquery"
      }
      mutate {
        gsub => [
          "http_uriquery", "amp;", ""
        ]
      }
      kv {
        source => "http_uriquery"
        field_split => "&"
        target => "querystring"
      }
      mutate {
        remove_field => [ "[querystring][callback]", "[querystring][_]", "[querystring][api_token]", "[querystring][*s]", "[querystring][*h]" ]
      }
    }
  }
}

output {
  if "_grokparsefailure" in [tags] {
    elasticsearch {
      host => ["aa.bb.cc.dd"]
      protocol => "http"
      index => "appname-error-log"
    }
  } else {
    elasticsearch {
      host => ["aa.bb.cc.dd"]
      protocol => "http"
      index => "appname-query-log-%{+YYYY.MM.dd}"
      document_id => "%{fingerprint}"
      idle_flush_time => 5
      max_retries => 5
      workers => 15
    }
  }
}

(system) #2