Logstash throws java.lang.OutOfMemoryError: Java heap space no matter the heap size

Im attempting to parse a huge (few million lines) csv file with logstash and output it to elasticsearch.

[FATAL] 2023-04-16 19:00:19.011 [LogStash::Runner] Logstash -
java.lang.OutOfMemoryError: Java heap space
        at java.nio.HeapCharBuffer.<init>(java/nio/HeapCharBuffer.java:61) ~[?:?]
        at java.nio.CharBuffer.allocate(java/nio/CharBuffer.java:348) ~[?:?]
        at java.nio.charset.CharsetDecoder.decode(java/nio/charset/CharsetDecoder.java:807) ~[?:?]
        at java.nio.charset.Charset.decode(java/nio/charset/Charset.java:814) ~[?:?]
        at org.jruby.RubyEncoding.decodeUTF8(org/jruby/RubyEncoding.java:297) ~[jruby-complete-9.2.20.1.jar:?]
        at org.jruby.RubyString.decodeString(org/jruby/RubyString.java:802) ~[jruby-complete-9.2.20.1.jar:?]
        at org.jruby.RubyString.toString(org/jruby/RubyString.java:793) ~[jruby-complete-9.2.20.1.jar:?]
        at org.logstash.Javafier.lambda$initConverters$1(org/logstash/Javafier.java:88) ~[logstash-core.jar:?]
        at org.logstash.Javafier$$Lambda$586/0x00000001012b4c40.convert(org/logstash/Javafier$$Lambda$586/0x00000001012b4c40) ~[?:?]
        at org.logstash.Javafier.deep(org/logstash/Javafier.java:57) ~[logstash-core.jar:?]
        at org.logstash.Event.getField(org/logstash/Event.java:177) ~[logstash-core.jar:?]
        at org.logstash.StringInterpolation.evaluate(org/logstash/StringInterpolation.java:86) ~[logstash-core.jar:?]
        at org.logstash.Event.sprintf(org/logstash/Event.java:363) ~[logstash-core.jar:?]
        at org.logstash.ext.JrubyEventExtLibrary$RubyEvent.sprintf(org/logstash/ext/JrubyEventExtLibrary.java:202) ~[logstash-core.jar:?]
        at java.lang.invoke.DirectMethodHandle$Holder.invokeSpecial(java/lang/invoke/DirectMethodHandle$Holder) ~[?:?]
        at java.lang.invoke.LambdaForm$MH/0x0000000100780840.invoke(java/lang/invoke/LambdaForm$MH) ~[?:?]
        at java.lang.invoke.DelegatingMethodHandle$Holder.delegate(java/lang/invoke/DelegatingMethodHandle$Holder) ~[?:?]
        at java.lang.invoke.LambdaForm$MH/0x0000000100737c40.guard(java/lang/invoke/LambdaForm$MH) ~[?:?]
        at java.lang.invoke.DelegatingMethodHandle$Holder.delegate(java/lang/invoke/DelegatingMethodHandle$Holder) ~[?:?]
        at java.lang.invoke.LambdaForm$MH/0x0000000100737c40.guard(java/lang/invoke/LambdaForm$MH) ~[?:?]
        at java.lang.invoke.Invokers$Holder.linkToCallSite(java/lang/invoke/Invokers$Holder) ~[?:?]
        at usr.share.logstash.logstash_minus_core.lib.logstash.util.decorators.add_fields(/usr/share/logstash/logstash-core/lib/logstash/util/decorators.rb:34) ~[?:?]
        at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java/lang/invoke/DirectMethodHandle$Holder) ~[?:?]
        at java.lang.invoke.LambdaForm$MH/0x00000001012b7840.invoke(java/lang/invoke/LambdaForm$MH) ~[?:?]
        at java.lang.invoke.Invokers$Holder.invokeExact_MT(java/lang/invoke/Invokers$Holder) ~[?:?]
        at org.jruby.RubyArray.each(org/jruby/RubyArray.java:1821) ~[jruby-complete-9.2.20.1.jar:?]
        at java.lang.invoke.LambdaForm$DMH/0x0000000100763040.invokeVirtual(java/lang/invoke/LambdaForm$DMH) ~[?:?]
        at java.lang.invoke.LambdaForm$MH/0x0000000100780840.invoke(java/lang/invoke/LambdaForm$MH) ~[?:?]
{

is thrown after a few minutes of running logstash.
There are similar threads asking about this very question but none are properly answered.

Inside my /etc/logstash/jvm.options:
-Xms2g
-Xmx2g

my machine has 8gb of ram. I have tried lowering the heap memory, I have tried making it higher but nothing helps. If I put it too high after a while logstash will just crash with "Killed".

/etc/elasticsearch/jvm.options.d/jvmheap.options has:
-Xms2g
-Xmx2g

I dont know what to do as I really cannot afford purchasing a server with more RAM but need to parse this file no matter what.

What does your configuration look like? Inputs, filters...

input {
    file {
        path => "path/to/file.csv"
        start_position => "beginning"
    }
}
filter {
  csv {
    autodetect_column_names => false
    columns => ["username", "uid"]
    target => "_tmp"
  }
  mutate {
    add_field => {
      "[data][username]" => "%{[_tmp][username]}"
      "[data][uid]" => "%{[_tmp][uid]}"
    }
  }
  mutate {
    remove_field => ["_tmp"]
  }
  prune {
    whitelist_names => [ "data" ]
  }
}
output {
       elasticsearch {
        hosts => ["http://localhost:9200/"]
        index => "uids"
    }
    stdout{}
}

I would not expect that configuration to need more than 300 MB to run in! You can simplify the filters a little

csv { columns => ["username", "uid"] target => "data" autogenerate_column_names => false }

will only parse the first two columns of the CSV file, so you can remove the mutate filters and even the prune. The OOM is happening in the add_field, although I doubt removing it will change much.

Reducing pipeline.batch.size from the default of 125 might help.

There are multiple columns in the CSV file and im extracting only the ones I need, unfortunately getting the first two columns would not work

Edit: Decreasing pipeline.batch.size did not help :^(

But that is what your filter configuration would do if it didn't run out of memory! The first two columns will be named username and uid, the rest will be column3, column4, etc, and will get remove when you delete [_tmp]

How about pipeline.workers to 1? Will be slower, however will use less resources.
Also, maybe file_chunk_count, default value is 4611686018427387903, to reduce reading all at once.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.