MongoDB Packetbeat frequent crashing - syslog shows frequent errors

Potentially related to Slice bounds out of range?

I have Packetbeat 1.1.1 running on Ubuntu 14.04 sniffing traffic on a MongoDB 3.2.1 node. I'm quite impressed with the results that I'm getting, but it's also crashing frequently (I can usually get at least 15 minutes out of it, but it seems like it rarely runs more than an hour at a time). I noticed a lot of these sort of log messages in my syslog:

Feb 25 16:40:00 HOST /usr/bin/packetbeat[31004]: mongodb_parser.go:115: An error occured while parsing OP_REPLY message: Document is corrupted
Feb 25 16:40:06 HOST /usr/bin/packetbeat[31004]: mongodb_parser.go:40: Unknown operation code:
Feb 25 16:40:06 HOST /usr/bin/packetbeat[31004]: mongodb_parser.go:236: An error occured while parsing OP_QUERY message: Document is corrupted
Feb 25 16:40:06 HOST /usr/bin/packetbeat[31004]: mongodb_parser.go:236: An error occured while parsing OP_QUERY message: Document is corrupted
Feb 25 16:40:07 HOST /usr/bin/packetbeat[31004]: mongodb_parser.go:115: An error occured while parsing OP_REPLY message: Document is corrupted
Feb 25 16:40:07 HOST /usr/bin/packetbeat[31004]: mongodb_parser.go:236: An error occured while parsing OP_QUERY message: Document is corrupted
Feb 25 16:40:07 HOST /usr/bin/packetbeat[31004]: log.go:113: ParseMongodb exception. Recovering, but please report this: runtime error: slice bounds out of range.
Feb 25 16:40:07 HOST /usr/bin/packetbeat[31004]: log.go:114: Stacktrace: /go/src/github.com/elastic/beats/libbeat/logp/log.go:114 (0x4aa716)#012/usr/local/go/src/runtime/asm_amd64.s:437 (0x4621ee)#012/usr/local/go/src/runtime/panic.go:423 (0x4307c9)#012/usr/local/go/src/runtime/panic.go:18 (0x42ed09)#012/go/src/github.com/elastic/beats/packetbeat/protos/mongodb/mongodb_parser.go:346 (0x4f6a22)#012/go/src/github.com/elastic/beats/packetbeat/protos/mongodb/mongodb_parser.go:206 (0x4f4e18)#012/go/src/github.com/elastic/beats/packetbeat/protos/mongodb/mongodb_parser.go:67 (0x4f364f)#012/go/src/github.com/elastic/beats/packetbeat/protos/mongodb/mongodb.go:161 (0x4efb43)#012/go/src/github.com/elastic/beats/packetbeat/protos/mongodb/mongodb.go:110 (0x4ef5a7)#012/go/src/github.com/elastic/beats/packetbeat/protos/tcp/tcp.go:101 (0x5114c1)#012/go/src/github.com/elastic/beats/packetbeat/protos/tcp/tcp.go:164 (0x512003)#012/go/src/github.com/elastic/beats/packetbeat/decoder/decoder.go:183 (0x6f80a7)#012/go/src/github.com/elastic/beats/packetbeat/decoder/decoder.go:101 (0x6f75e1)#012/go/src/github.com/elastic/beats/packetbeat/sniffer/sniffer.go:356 (0x526299)#012/go/src/github.com/elastic/beats/packetbeat/beat/packetbeat.go:232 (0x4676fb)#012/usr/local/go/src/runtime/asm_amd64.s:1721 (0x464561)
Feb 25 16:40:07 HOST /usr/bin/packetbeat[31004]: mongodb_parser.go:40: Unknown operation code:

Since the log message asked me so nicely to report it, I decided to open this up to see if anyone else is seeing this or has any suggestions.

Thanks!

Thanks for reporting. These stacktraces are not crashing packetbeat, as packetbeat catches and recovers from panic. We catch and log error for reporting, but processing continues. As long as you still get some data after the error, it's not too critical.

Checking the mongodb code I see this panic potentially being generated if the parser did not capture start of mongodb message. This can happen either on packetbeat startup or due to packet loss due to parser not being in sync to newly discovered TCP stream. I see a number of Document is corrupted and Unknown operation code message in your logs which clearly points to packet loss after 15 minutes.

These panics show theres is quite some opportunity to harden the mongodb parser. But hardening the parser doesn't change the fact the parser is not in sync due to packet-loss. If parser is not in sync, it drops it's internal state about the TCP stream and tries to resync.

While there can be many reasons for packet loss (some process - kernel, packetbeat, elasticsearch - not being fast enough), one reason for getting these after 15 minutes might be due to the output (elasticsearch/logstash?) slowly generating some back-pressure not accepting/indexing transaction events fast enough. This kind of back-pressure can lead to the publisher queue blocking the memcache parsing worker. We're working on protecting the protocol analyzer state-machines by dropping( + reporting) transaction events when publisher queue is stalling processing.

Current nightlies might give you a little better experience, as transaction events are dropped (silently at the moment) instead of blocking.

2 Likes

Thanks for the quick response and the tips. I'm a novice at all this, and running on somewhat low-powered hardware (a single VM node currently trying to prove this out), so I can definitely believe that things are backing up.

I started tailing my syslog and it looks like it's most prone to throwing errors during and right after times when I am loading up dashboards in Kibana, so that lends credence to it possibly just being overloaded. I still do see errors at other times, so I can't say that's the only thing, but it's somewhere to start.

Thanks again.