I have a basic ELK stack with Lumberjack shipping data into Logstash, and then to Elasticsearch.
Versions:
Logstash :1.5.3
Elasticsearch 1.7.0
I'm getting the errors below logged a lot in logstash.log.
{:timestamp=>"2015-10-05T16:36:52.736000+0100", :message=>"CircuitBreaker::Open", :name=>"Lumberjack input", :level=>:warn}
{:timestamp=>"2015-10-05T16:36:52.737000+0100", :message=>"Exception in lumberjack input thread", :exception=>#<LogStash::CircuitBreaker::OpenBreaker: for Lumberjack input>, :level=>:error}
{:timestamp=>"2015-10-05T16:36:53.194000+0100", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
Does this just mean that Elasticsearch is overloaded, and if so is there anyway to verify this? There are no errors in any of the Elasticsearch logs saying it is refusing connections or struggling to index incoming data.
Logstash's input rate is higher than its output rate. This doesn't have to be an overload condition per se, it just means that you're (possibly temporarily) receiving more data than can be processed. This is a concern if it's happening constantly but not necessarily a problem if it just happens now and then.
This is happening pretty regularly, looking at my Kibana graphs show a lot of quiet periods when I'm guessing the pipeline is shutdown,e g. see http://snag.gy/3E0mU.jpg
I don't have the most optimal Elasticsearch setup - 1 VM with 8Gb RAM, 8 cores & indicies on NFS - but the CPU and RAM are averaging 50%, and testing the NFS IO performance with:
time dd if=/dev/zero of=testfile bs=16k count=32000
returns an IO rate of 55 Mb/s, and with an average daily index size of 16Gb it shouldn't be IO limited.
This is happening pretty regularly, looking at my Kibana graphs show a lot of quiet periods when I'm guessing the pipeline is shutdown,e g. see http://snag.gy/3E0mU.jpg
Are you populating the @timestamp field properly so that this chart shows a steady state view of when events were emitted from the source, or does @timestamp contain the time Logstash processed the event?
and testing the NFS IO performance with: [...] returns an IO rate of 55 Mb/s,
I'd expect the chief problem with NFS storage to be latency, not bandwidth. Bandwidth limitations will lead to slow segment merges, though.
The majority are unstructured logs that don't contain a source timestamp, so get whatever Lumberjack/Logstash give them. Giving the log entries timestamps is actually one of the advantages of using Logstash with them!
It seems to be hard to measure NFS latency - most of the tools I've used (e.g. collectd) don't seem to cover latency on NFS. If you have any suggestions of how to measure this I'd be interested!
The garbage collecting doesn't look too bad - it's not really doing any full GCs at the moment, which is actually quite surprising! e.g. http://snag.gy/DcMrZ.jpg
The "Lumberjack input: the pipeline is blocked," errors reduce when our systems are quieter and there are less logs being shipped, so it's definitely load related. Just surprising as all the Elasticsearch metrics I can see look OK, and there is nothing in the Elasticsearch logs about being overloaded etc.
Was this issue ever resolved? I've been running into the same problem and cannot find a way to get around this. I hope scaling out is not the only solution.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.