LS intermittently shows zero EPS received/emitted

Hi All,

Looking for a little advice in trying to narrow down the source of a problem we have with our ELK instance.

Intermittently, we're seeing the received, and emitted EPS on an LS node drop to zero and we can't figure out why. The image below demonstrates what happens.

We've pushed the 6.7.1 code for LS out (along with ES, and K on other nodes), but this hasn't fixed the issue. Other things we've tried are

  • Increase Heap
  • Disable some filter plugin use on conf files (jdbc_streaming, drop, geoip, cidr).
  • Increase/decrease pipeline.batch.size from 125-250-500-1000, as the system is able to cope with it.
  • Performed bin/logstash-plugin update.

None of the above have fixed the issue, and I'm unsure on where to look next. The heap graph looks 'wrong' to me, when the issue occurs, but I don't know where to look to investigate that.

A tcpdump on the ingress interface shows that all inputs port are receiving traffic (we use a mix of beats, syslog, and json).

Any ideas, or suggestions?

Cheers
Andy

I do not think that is a problem. The long-term sawtooth pattern is caused by objects from Eden being promoted into the tenured generation. Then the tenured generation is full it runs a GC and frees all the garbage.

If LS stops processing events (which it does) then the promotion greatly slows down and you see a much shorter-term sawtooth pattern as the new generation fills and is GCd.

I suspect that the output is not accepting events, so the queues fill and back-pressure prevents LS from reading more events.

What sort of output are you using? Anything of interest in its logs?

Hi @Badger

Thanks a lot for the swift response.

There's nothing in the LS logs (logstash-plain.log) to show any issues when the problem occurs. Or in the clusterXX.log on the remote ES host.

We have an ingoing issue with a JunOS filter (kv plugin) which we're looking into, but it's been there since day #1, so possibly no relation -

0x73c952b9>], :response=>{"index"=>{"_index"=>"logstash-2019.04.07", 
"_type"=>"doc", "_id"=>"ca81-WkBixSH8MB7Axct", "status"=>400, "error"=> 

{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse", "caused_by"=>
{"type"=>"illegal_argument_exception", "reason"=>"object field starting or ending with a [.]
makes object resolution ambiguous: [0..3]"}}}}}

In ES logs, I see -

0x3354a357>], :response=>{"index"=>{"_index"=>"logstash-2019.04.07", "_type"=>"doc", 
"_id"=>"rbY6-WkBixSH8MB7nLDi", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", 
"reason"=>"failed to parse", "caused_by"=>{"type"=>"illegal_argument_exception", 
"reason"=>"object field starting or ending with a [.] makes object resolution ambiguous: [0..7]"}}}}}

I tweaked the pipeline.batch.size to 1000 recently, and the graphs are much more what I'd expect historically (this ELK has been running for about 9 months) -

Output is all to ES. Nothing unusual -

output {
          elasticsearch {
          hosts => ["http://x.x.x.x:9200"]
          index => "logstash-%{+YYYY.MM.dd}" }
}

Cheers
Andy

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.