Logstash Hangs due to overload

I have a Filebeat -> Logstash -> Elasticsearch setup where filebeat reads local log from VM and sends it to Logstash which in-turn indexes the log to Elasticsearch.

The above setup worked well for a small environment. But currently our production has 500 or more VM where logs are written frequently each VM has a filebeat running on them which sends the logs to Logstash and then to Elasticsearch.

Currently we get about 1 GB of Log and 10 million or more log data a day, the throughput from Filebeat is not the same since the log are written only when process runs on the VM.

The issue is Logstash hang after sometime due to heavy load. I then have to restart the server to again start the index. How can we solve this?. Due to frequent Hangs lot of Log data are missed.

Filebeat.yml:

filebeat:
  prospectors:
	- paths:
		 - /home/*/*/exception.log
	  tags: ["AssetException-Linux"]
	  input_type: log

	- paths:
		 - /home/asset/*/*/*/*/log/*.*
	  tags: ["AssetLog-Linux"]
	  input_type: log

output.logstash:
  hosts: ["x.x.x.x:5044"]

Logstash.config:

input {
  beats {
	port => 5044
  }
}
output {
  elasticsearch {
	hosts => "x.x.x.x:9200"
	manage_template => false
	index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"
	document_type => "%{[@metadata][type]}"
  }
}

Logstash can only index as fast as Elasticsearch can receive data, so ruling that out as a bottleneck is a good start. What is the specification of your Elasticsearch cluster? What throughput are you seeing? Do you see anything in the Elasticsearch logs around long or frequent GC?

1)I have only one t2.medium as ES server and only one t2.medium as Logstash server it specification is as below
vCPU: 2
Memory: 4GB
Processor:2.5 GHz Intel Xeon
2) I don't know and haven't taken the hourly through put it varies time to time based the process running on the VM. i.e.) Some asset write less number of Log so through put would be small and vice versa.
3) I get below logs frequently on Elasticsearch.

[2018-08-16T23:56:16,584][WARN ][o.e.c.r.a.DiskThresholdMonitor] [HtdutIG] high disk watermark [90%] exceeded on [HtdutIGZQz2uAWgvcqNBHQ][HtdutIG][/var/lib/elasticsear$
[2018-08-16T23:56:46,618][WARN ][o.e.c.r.a.DiskThresholdMonitor] [HtdutIG] high disk watermark [90%] exceeded on [HtdutIGZQz2uAWgvcqNBHQ][HtdutIG][/var/lib/elasticsear$
[2018-08-16T23:56:46,618][INFO ][o.e.c.r.a.DiskThresholdMonitor] [HtdutIG] rerouting shards: [high disk watermark exceeded on one or more nodes]
[2018-08-16T23:57:16,652][WARN ][o.e.c.r.a.DiskThresholdMonitor] [HtdutIG] high disk watermark [90%] exceeded on [HtdutIGZQz2uAWgvcqNBHQ][HtdutIG][/var/lib/elasticsear$
[2018-08-16T23:57:46,686][WARN ][o.e.c.r.a.DiskThresholdMonitor] [HtdutIG] high disk watermark [90%] exceeded on [HtdutIGZQz2uAWgvcqNBHQ][HtdutIG][/var/lib/elasticsear$
[2018-08-16T23:57:46,686][INFO ][o.e.c.r.a.DiskThresholdMonitor] [HtdutIG] rerouting shards: [high disk watermark exceeded on one or more nodes]
[2018-08-16T23:58:16,719][WARN ][o.e.c.r.a.DiskThresholdMonitor] [HtdutIG] high disk watermark [90%] exceeded on [HtdutIGZQz2uAWgvcqNBHQ][HtdutIG][/var/lib/elasticsear$
[2018-08-16T23:58:46,753][WARN ][o.e.c.r.a.DiskThresholdMonitor] [HtdutIG] high disk watermark [90%] exceeded on [HtdutIGZQz2uAWgvcqNBHQ][HtdutIG][/var/lib/elasticsear$
[2018-08-16T23:58:46,753][INFO ][o.e.c.r.a.DiskThresholdMonitor] [HtdutIG] rerouting shards: [high disk watermark exceeded on one or more nodes]
[2018-08-16T23:59:16,787][WARN ][o.e.c.r.a.DiskThresholdMonitor] [HtdutIG] high disk watermark [90%] exceeded on [HtdutIGZQz2uAWgvcqNBHQ][HtdutIG][/var/lib/elasticsear$
[2018-08-16T23:59:46,821][WARN ][o.e.c.r.a.DiskThresholdMonitor] [HtdutIG] high disk watermark [90%] exceeded on [HtdutIGZQz2uAWgvcqNBHQ][HtdutIG][/var/lib/elasticsear$
[2018-08-16T23:59:46,821][INFO ][o.e.c.r.a.DiskThresholdMonitor] [HtdutIG] rerouting shards: [high disk watermark exceeded on one or more nodes]

When I start Logstash it start off with CPU utilization of 30% and 300 MB of RAM, but as time progress the RAM utilization increases and the process hangs. There is no Error Log on Logstash.

t2 instances are generally not very suitable for Logstash and Elasticsearch as they have limited CPU allocation and can easily run out of credits, which will cause performance problems. You will see better performance if you upgrade to small m4/m5 or even r4 instances.

I will change the instance type as suggested and will let you know on the update. Is there a way to reduce the number of message per minute from FIlebeat, this may help in solving this issue right?.

Once you switch to an instance type with better CPU allocation I would expect the pipeline to easily keep up.

Hi Chris this is to update to you that your recommendation worked. Thanks for the help.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.