Potential memory leak using tcp input?

Hello, we are using logstash to collect sflow. We use fluentd at the ingest point, which forwards flows to logstash to do some processing and push to the elastic cloud. These both reside on the same box and largely work great together (fluentd/logstash).

We noticed on one of our higher volume flow boxes that logstash appears to be leaking memory, with items piling up within logstash and slowly consuming the entire heap.

We were first able to see this within VisualVM, with the heap size slowly growing over a few days to the point of it causing OOM errors. We added more memory to jvm.options and to the host but saw the same behavior.

This led us to pull a heap dump and explore it within Eclipse MAT, as the elastic documentation suggests. I'm by no means an expert in either java or it's heap dumps, but I believe we have found something interesting and I'm not positive on how to interpret this or how to fix it, I was hoping somebody on this forum could assist.

This is our input segment which we believe is causing this:

input {
  tcp {
    host => "127.0.0.1"
    port => "8888"
    ecs_compatibility => "disabled"
    codec => "fluent"
  }
}

We see the same result if using the fluent codec or not, fwiw.

We see a LARGE number of items being stored which are "errorinfo: org.jrubyStandarderror" and "Decode_buffer bytes not available":

The truncated items are:

usr.share.logstash.vendor.bundle.jruby.$3_dot_1_dot_0.gems.logstash_minus_input_minus_tcp_minus_6_dot_4_dot_1_minus_java.lib.logstash.inputs.tcp.decoder_impl.decode() (/usr/share/logstash/vendor/bundle/jruby/3.1.0/gems/logstash-input-tcp-6.4.1-java/lib/logstash/inputs/tcp/decoder_impl.rb:23)
usr.share.logstash.vendor.bundle.jruby.$3_dot_1_dot_0.gems.logstash_minus_input_minus_tcp_minus_6_dot_4_dot_1_minus_java.lib.logstash.inputs.tcp.decode_buffer() (/usr/share/logstash/vendor/bundle/jruby/3.1.0/gems/logstash-input-tcp-6.4.1-java/lib/logstash/inputs/tcp.rb:219)

/usr/share/logstash/vendor/bundle/jruby/3.1.0/gems/logstash-input-tcp-6.4.1-java/lib/logstash/inputs/tcp/decoder_impl.rb:23
/usr/share/logstash/vendor/bundle/jruby/3.1.0/gems/logstash-input-tcp-6.4.1-java/lib/logstash/inputs/tcp.rb:219

This is the decode buffer code segment referenced in tcp.rb:

217   def decode_buffer(client_ip_address, client_address, client_port, codec, proxy_address,
218                     proxy_port, tbuf, ssl_subject)
219     codec.decode(tbuf) do |event|
220       if @proxy_protocol
221         event.set(@field_proxy_host, proxy_address) unless event.get(@field_proxy_host)
222         event.set(@field_proxy_port, proxy_port) unless event.get(@field_proxy_port)
223       end
224       enqueue_decorated(event, client_ip_address, client_address, client_port, ssl_subject)
225     end
226   end

I guess At this point it's not clear to me what specifically is causing these errors, or how to correct them.

I have many heap dumps and would be happy to provide more info, like I said I'm not an expert in java or heap dumps so if you would like more info please be explicit in how I can provide that.

It feels like these errors are being logged to the heap. In general the entire flow for our process works fine and the data/numbers we're seeing in elastic appears to be correct and working. We just have these heap issues and I don't know why.

And before the elastiflow bro chimes in here, we are not interested in using elastiflow.

Thank you for any assistance you can provide.

1 Like

We see a lot of these "Not enough bytes available".

Again, any help is appreciated

1 Like

When you start up MAT, Leak Suspect #1 should be listed front and centre. What does it look like?

At the time of this heap dump we had 8 or 10 worker threads in the specified sflow pipeline, most of the items listed belong to that.

Let me know what else I can provide to assist.

1 Like

It would be interesting to know how big tbuf is. I am wondering of the codec is failing to decode the buffer, so it is just accumulating more and more data (like 300 MB) in tbuf. I haven't used MAT for a few years, so I cannot remember how to find that.

I had a similar thought to this, although I'm unclear on to how to check this. I'm open to using a different method for viewing....

We are getting data all the way through logstash and out the other side to elastic, so it's clearly not all flow records which accumulate.

Provided your theory is correct, that items are accumulating in tbuf, I'd be curious to know what specifically is getting stuck in there and what said flows look like

If I remember correctly you would drill into one of those org.Jruby.RubyThread objects, expand the list of objects (i.e. all 41 instead of the first 10). I would expect one of those 41 objects will have either shallow heap or retained heap that explains the 300 MB. Drill into that object, rinse and repeat. Eventually you will likely find that the 300 MB is.

And honestly at this point I'm fairly confused a to what exactly is happening:
o. Is this too much data coming in at once and overflowing the buffer?
o. Is this possibly bad or corrupted data coming that gets "Stuck" and backs up the tbuf?
o. Is this an issue with the tcp input or the fluent codec used?
o. Is this possibly a very small thing getting stuck and the memory/heap usage is amplified by the MASSIVE amount of jruby log errors stored in the heap?

Along with my above questions, I'm unclear on how to find the root cause, short of drilling down to the bottom of the heap, which at this point to do manually would likely take at least a full day of time to track down if not a week. I'm not trying to avoid that manual work, but it'd be awesome if we could pin down a culprit in a bit more efficient way. I guess I'm not even sure if I'm doing what you've asked me to do correctly. Like I said, I'm pretty new to java heaps as a whole.....

And we have the option to potentially not use either the fluent codec in the input or use the tcp input at all, but I'd like to know for sure what the culprit is before we explore those avenues and I just don't have the java/ruby/whatever skills to be able to track that down.

Are there any other recommendations of what I could do or provide that would help shed some light on what direction we should look at to figure out the root cause?

A Google search for that get hits for hive. Any chance there are error messages coming from a hive DB?

You mention sflow, are you trying to decode truncated packets? If so, how? Could that be the source of these errors?

We are not using hive anywhere within the pipeline that processes sflow (I wish we were, to at least narrow down what's happening......).

I guess it's possible we're attempting to decode truncated packets, however, fluentd sits in front of logstash, processing incoming sflow directly, which then hands it off to logstash for filtering. I would venture a guess to say fluentd would likely also be throwing up on truncated packets before it gets to logstash if this was the case. Additionally, the volume of flows is very high, so it's unlikely we'd be able to keep say a tcpdump running to be able to verify if that was happening.

The "Not enough bytes available" errors shown above appear to specifically reference the tcp input code segment, by line. Unfortunately this doesn't really give me much info on if this is a fluent codec problem with retrieving items from tbuf, if tbuf is just full as data comes in faster than the pipeline can process, if this is malformed data, or something else entirely.

I do feel that the heap usage would slow down considerably provided the jruby errors don't also log to the heap. I've zero clue if this is even possible and if so how, just thinking outloud.

Additionally, as stated above, we're not specifically constrained to using the tcp input OR the fluent codec, we can move off either/both, but I'd like to nail down the culprit before we start making such changes. I just don't know how to do that as I don't really understand the underlying cause.

Well, well, well. Three holes in the ground!

The last two StackTraceElement entries are Unpacker.java:209 and Unpacker.java:222, which is this line in the code that calls the decoder to get the next entry in the msgpack buffer. In Decoder.java there are three occurrences of the code raising an underflowErrorClass with the text "Not enough bytes available". The first is here. The Unpacker catches this and explicitly does not rethrow it. feed_each is called directly by the codec. I would expect the exception to go out of scope and get GC'd.

That is interesting, but I am not sure if it is useful.

The way forward may involve attaching a Java debugger to logstash and setting a breakpoint in the constructor for java.nio.BufferUnderflowException. I have read about that approach, but never had to do it. When the breakpoint triggers you can look at the stack to find which of the three occurrences you are hitting. Then look at the parameters passed to various methods to see what it is trying to decode. This may involve building logstash from source to make sure the debugging symbols are present.

Another possible approach would be to remove the fluent codec and try writing the raw input stream to a file, then consume the file with a fluent codec to isolate which event is causing the problem, then pass just that event to logstash and see what happens.

And yes, that is going to be a lot of work.

Badger, I don't think I've said this yet but thank you for the time on this.

We have talked internally about removing the fluent codec to see if we can use just the tcp input, which may require us to map a few things differently in the filter section of this pipeline but it's doable. We've also talked about putting redis in between fluentd and logstash, then using the redis input in our pipeline, which might solve our issue.

As to your mention of Unpacker.java, I'm entirely unfamilar. Can you tell me what that is/does and what it stores?

Unpacker.java is a class that is part of the msgpack-ruby implementation. The codec is calling the feed_each method, which is an iterator that walks through a buffer, returning each object as it unpacks it.

Going back to the heap dump, if you take a look at one of those Ruby strings that contain "Not enough bytes available" you could look at the things that reference it, and then look at the things that reference the things that reference it and so on and see if anything jumps out.

Badger,

I wanted to update you on what we've done. Instead of outputting from fluentd --> logstash via tcp input and fluent codec, we've essentially shimmed redis in between the two, using the logstash-redis-input.

This has removes both the tcp input and the fluentd codec from our equation and things have been working VERY well. I don't know specifically which piece was to blame but things are working great for our needs now.

Thanks for taking the time badger.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.