Logstash memory consumption and swap memory issues

Hi all,

I've been using logstash version 7.17.8 in docker, and for a while now trying to minimize its memory usage and swap usage, without much success.

I have 2 file inputs and one gelf input, and 2 small filters.
I tried to minimize the config (used max_open_files=1 cause I expect only 1 file in each input),

input {
    gelf {
        host => "0.0.0.0"
        port_udp => 12201
    }
    file {
        path => "path1"
        add_field => { "component" => "component1" }
        add_field => { "priority" => "HIGH" }
        max_open_files => 1

        # Any line not starting with a timestamp should be merged with the previous line
        codec => multiline {
          # Regex for YYYY-MM-dd HH-mm-ss,SSS
          pattern => "^([0-9]{4})-(0[1-9]|1[0-2])-(0[1-9]|[1-2][0-9]|3[0-1]) (2[0-3]|[01][0-9]):([0-5][0-9]):([0-5][0-9]),([0-9]{3})"
          negate => true
          what => "previous"
        }
    }
    file {
        path => "path2"
        add_field => { "component" => "component2" }
        add_field => { "priority" => "HIGH" }
        max_open_files => 1

        # Any line not starting with a timestamp should be merged with the previous line
        codec => multiline {
          # Regex for YYYY-MM-dd HH-mm-ss,SSS
          pattern => "^([0-9]{4})-(0[1-9]|1[0-2])-(0[1-9]|[1-2][0-9]|3[0-1]) (2[0-3]|[01][0-9]):([0-5][0-9]):([0-5][0-9]),([0-9]{3})"
          negate => true
          what => "previous"
        }
    }
}

filter {
    if [component] == "component1" or [component] == "component2" {
        dissect {
          mapping => {
            "message" => "%{timestamp} %{+timestamp} [%{pid}] [%{module}] %{level} - %{message}"
          }
        }
    } else {
        if [level] == "ERROR" and "java.sql.SQLException" in [StackTrace] {
             mutate {
                gsub => ["field1", "pattern1", "pattern2"]
             }
        }
    }
}

logstash.yml:

api.enabled: true
http.host: 0.0.0.0
pipeline.batch.delay: 1000
pipeline.batch.size: 30
pipeline.workers: 1
xpack.monitoring.enabled: false

The GC used is CMS.

After a lot of trial and error, I have limited the container to 512mb memory, and the heap to 256mb (Xmx and Xms).

The container uses up the entire 512mb right away (in java profiler I saw that heap is taking 250mb and non-heap around 200mb), but the main problem is that its using swap memory.

I've ran it for a few hours with constant input messages (10 messages every 3 sec) and the swap usage was 500mb, and went up to 1gb+.

The messages I sent are through the gelf input. They don't go into the dissect or gsub filters - so I don't think that's the reason.

I have a custom java output which forwards the messages to a 3rd party. I thought that was the problem so I replaced it with stdout. Same issue.

Another strange thing I saw is that when checking the node stats API, I saw 101 file descriptors open, even though I only have 2 input files with max_open_files=1.

  1. How do I decrease the memory usage? I read a lot of posts about it and tried many things but problem persists.
  2. What's the reason for the 100 file descriptors?
1 Like

Are you using the file inputs? It is not clear, if besides the gelf input you are also reading files using the file input, then you should evaluate everything.

Can you provide more context?

Also, any reason to set the pipeline.batch.delay to 1 second? This setting rarely needs to be changed.

I do not use docker, but can you not disable the swap for the logstash container? Will it get OOM errors if you disable it?

1 Like

Yes I'm using the file inputs in my production scenario. I just meant that the 10 messages/3 sec I sent for the memory test were through the gelf, not the files.

regarding the 1 sec - no special reason, I can use the default value if you think it could help.

And yes, if I disable swap for the container the logstash crashes on OOM repeatedly.

I suggest you get a heap dump when the OOM occurs. See this thread for what to do with the dump. This thread shows what the report will look like.

Hi @Badger, thanks for your answer.

It seems that the container doesn't output heap dump when its killed on OOM, even though I have -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/usr/share/logstash/logs/heapdump.hprof configured.

The OOM is via docker-compose, using mem_limit: 512mb.

Here is the OOM in syslog:

Apr  7 13:15:52 xsense kernel: [350834.594560] Memory cgroup stats for /docker/4f3433be1f77df617e57322adfa07bbf65f32f4f8218d6e4c482beb1e20af99b: cache:12KB rss:519176KB rss_huge:223232KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB
 active_anon:519088KB inactive_file:8KB active_file:4KB unevictable:0KB
Apr  7 13:15:52 xsense kernel: [350834.594571] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Apr  7 13:15:52 xsense kernel: [350834.594691] [26011]  1000 26011   990379   134189  1490944        0             0 java
Apr  7 13:15:52 xsense kernel: [350834.594702] Memory cgroup out of memory: Kill process 26011 (java) score 1026 or sacrifice child
Apr  7 13:15:52 xsense kernel: [350834.744256] Killed process 26011 (java) total-vm:3961516kB, anon-rss:515168kB, file-rss:21588kB, shmem-rss:0kB
Apr  7 13:15:52 xsense kernel: [350834.906108] oom_reaper: reaped process 26011 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

and in docker events:

2023-04-07T13:38:15.522116400Z container oom 4f3433be1f77df617e57322adfa07bbf65f32f4f8218d6e4c482beb1e20af99b (com.docker.compose.config-hash=de2fa4e54c7dc3c9cd7109e27f622a3ef157053998420cdfa78a6e89a9a77858, com.docker.compose.container-number=1, com.docker.compose.oneoff=False, com.docker.compose.project=compose, com.docker.compose.project.config_files=docker-compose.yaml, com.docker.compose.project.working_dir=/opt/sensor/active/compose, com.docker.compose.service=logstash, com.docker.compose.version=1.28.5, image=logstash:7.17.8, name=logstash, org.label-schema.build-date=2022-11-30T16:12:02+00:00, org.label-schema.license=Elastic License, org.label-schema.name=logstash, org.label-schema.schema-version=1.0, org.label-schema.url=https://www.elastic.co/products/logstash, org.label-schema.vcs-url=https://github.com/elastic/logstash, org.label-schema.vendor=Elastic, org.label-schema.version=7.17.8, org.opencontainers.image.created=2022-11-30T16:12:02+00:00, org.opencontainers.image.description=Logstash is a free and open server-side data processing pipeline that ingests data from a multitude of sources, transforms it, and then sends it to your favorite 'stash.', org.opencontainers.image.licenses=Elastic License, org.opencontainers.image.title=logstash, org.opencontainers.image.vendor=Elastic, org.opencontainers.image.version=7.17.8)

in the container logs I don't see any errors, its just restarting. And Heap dump isn't created.

Maybe it means that the OOM isn't on heap memory, but on non-heap memory growing?

1 Like

Another thing which doesn't add up.

I tried running it with: mem_limit: 640mb, -Xms256m -Xmx384m.
Using the internal logstash API, I get:

"jvm" : {
    "threads" : {
      "count" : 37,
      "peak_count" : 39
    },
    "mem" : {
      "heap_used_percent" : 52,
      "heap_committed_in_bytes" : 330166272,
      "heap_max_in_bytes" : 389283840,
      "heap_used_in_bytes" : 204917496,
      "non_heap_used_in_bytes" : 191164768,
      "non_heap_committed_in_bytes" : 215511040,
      "pools" : {
        "survivor" : {
          "peak_max_in_bytes" : 13369344,
          "peak_used_in_bytes" : 13369344,
          "used_in_bytes" : 2311608,
          "max_in_bytes" : 13369344,
          "committed_in_bytes" : 13369344
        },
        "old" : {
          "peak_max_in_bytes" : 268435456,
          "peak_used_in_bytes" : 156364728,
          "used_in_bytes" : 156364728,
          "max_in_bytes" : 268435456,
          "committed_in_bytes" : 209317888
        },
        "young" : {
          "peak_max_in_bytes" : 107479040,
          "peak_used_in_bytes" : 107479040,
          "used_in_bytes" : 46241160,
          "max_in_bytes" : 107479040,
          "committed_in_bytes" : 107479040
        }
      }
    },
    "gc" : {
      "collectors" : {
        "old" : {
          "collection_count" : 5,
          "collection_time_in_millis" : 2201
        },
        "young" : {
          "collection_count" : 31,
          "collection_time_in_millis" : 1475
        }
      }
    }

heap_used_in_bytes + non_heap_used_in_bytes is 204917496 + 191164768 = 377mb.
In docker stats I see

CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
d6457a6d5478   logstash 1.20%     628.1MiB / 640MiB   98.15%    1.55MB / 337kB   27.3MB / 32.3MB   51

if heap + non-heap is 377mb, how come the container uses 628mb?

Doesn't seem like the heap is the problem, but another memory issue.

logstash 1.20 is EOL and no longer supported. Please upgrade ASAP.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns :elasticheart: )

I think the more appropriate calculation is heap_committed_in_bytes + non_heap_committed_in_bytes is 330166272 + 215511040 = 520 MB. If the OS and other processes are using 100 MB (seems reasonable) then that would run you out of memory.

The JVM is not getting an OOM exception, the oom_killer is harvesting it. That appears to use SIGKILL (kill -9) so the JVM cannot catch it.

I would suggest removing all the file inputs and filters and see if the gelf input by itself results in this issue. If not, add bits of the configuration back in until that non-heap memory number significantly increases.

Well, I tried again with only the GELF input and STDOUT output. As you suggested I looked at the committed value rather than the used value.

Thing is, both the non_heap_committed_in_bytes and the heap_committed_in_bytes just keep growing. non_heap_committed_in_bytes started at 170mb (just with GELF and STDOUT) and grew to 190mb.
And after about 30 min I got the OOM on 512mb.

I really don't understand how to prevent it.
It seems that with any memory limit, it would eventually just keep growing and reach it.

The gelf input requires gelfd2 0.4.1, which seems to be the latest version. Almost the only change gelfd2 made to dqminh/gelfd is a rewrite of the chunking code to avoid a memory leak :slight_smile:

Does your JVM have something similar to Hotspot's Native Memory Tracking?

I ran the logstash with -XX:NativeMemoryTracking=summary. In the current config (GELF input, STDOUT output) t always takes about 2 hours for it to crash.
params: mem_limit: 512mb, -Xms128m, -Xmx128m.

I ran jcmd <pid> VM.native_memory baseline when the usage was around 390mb.
at some point it jumped to 490mb, and I ran jdk/bin/jcmd 1 VM.native_memory summary.diff | grep +, that's what I got:

Total: reserved=1618216KB +25758KB, committed=346116KB +30166KB
-                     Class (reserved=1052553KB +22KB, committed=19337KB +470KB)
                            (classes #19463 +638)
                            (  instance classes #18527 +612, array classes #936 +26)
                            (malloc=3977KB +22KB #67872 -3362)
                            (mmap: reserved=1048576KB, committed=15360KB +448KB)
                            (    reserved=81920KB +8192KB, committed=74432KB +3456KB)
                            (    used=72542KB +3437KB)
                            (    waste=1890KB =2.54% +19KB)
                            (    reserved=1048576KB, committed=15360KB +448KB)
                            (    used=13645KB +370KB)
                            (    waste=1715KB =11.16% +78KB)
-                    Thread (reserved=41227KB +6186KB, committed=4219KB +694KB)
                            (stack: reserved=41112KB +6168KB, committed=4104KB +676KB)
                            (malloc=70KB +11KB #244 +36)
                            (arena=45KB +7 #78 +12)
-                      Code (reserved=249201KB -1792KB, committed=54985KB +12396KB)
                            (mmap: reserved=247688KB, committed=53472KB +14188KB)
-                  Compiler (reserved=560KB +205KB, committed=560KB +205KB)
                            (malloc=396KB +205KB #599 -92)
-                  Internal (reserved=940KB +348KB, committed=940KB +348KB)
                            (malloc=904KB +348KB #19811 +2590)
-                     Other (reserved=120KB +56KB, committed=120KB +56KB)
                            (malloc=120KB +56KB #20 +7)
-                    Symbol (reserved=14045KB +304KB, committed=14045KB +304KB)
                            (malloc=12886KB +176KB #402553 +6993)
                            (arena=1159KB +128 #1)
-    Native Memory Tracking (reserved=7950KB +7KB, committed=7950KB +7KB)
                            (malloc=11KB +1KB #166 +19)
                            (tracking overhead=7939KB +5KB)
-                   Tracing (reserved=14507KB +14474KB, committed=14507KB +14474KB)
                            (malloc=14475KB +14474KB #200 +194)
-                    Module (reserved=1235KB +141KB, committed=1235KB +141KB)
                            (malloc=1235KB +141KB #5401 +431)
-           Synchronization (reserved=269KB +44KB, committed=269KB +44KB)
                            (malloc=269KB +44KB #2754 +424)
-            Serviceability (reserved=2KB +1KB, committed=2KB +1KB)
                            (malloc=2KB +1KB #14 +8)
-                 Metaspace (reserved=82853KB +8311KB, committed=75365KB +3575KB)
                            (malloc=933KB +119KB #1179 +224)
                            (mmap: reserved=81920KB +8192KB, committed=74432KB +3456KB)
-           Object Monitors (reserved=75KB +11KB, committed=75KB +11KB)
                            (malloc=75KB +11KB #370 +52)

Seems 55mb is the diff.

Are there any known memory leaks in he off-heap? Anything I can do to prevent its growing?

I also tried to remove unused plugins in the Dockerfile build hoping to reduce the off-heap usage, but the plugin removal doesn't change the memory used.

I don't think so. Committed memory (memory in use) is a subset of reserved memory (memory the JVM has told the OS it may need), so it only grew by 25 MB. This article will tell you some more. Either way, it does not account for an increase of 100 MB.

I am at a loss as to what to look at next (I have never used docker).

Yep you're right..
checked again also in top and that's what I see - Total used now is 344mb.

Total: reserved=1605696KB, committed=344004KB
       malloc: 47704KB #518240
       mmap:   reserved=1557992KB, committed=296300KB

-                 Java Heap (reserved=131072KB, committed=131072KB)
                            (mmap: reserved=131072KB, committed=131072KB)

-                     Class (reserved=1052832KB, committed=19744KB)
                            (classes #19509)
                            (  instance classes #18573, array classes #936)
                            (malloc=4256KB #75604)
                            (mmap: reserved=1048576KB, committed=15488KB)
                            (  Metadata:   )
                            (    reserved=81920KB, committed=74688KB)
                            (    used=72616KB)
                            (    waste=2072KB =2.77%)
                            (  Class space:)
                            (    reserved=1048576KB, committed=15488KB)
                            (    used=13657KB)
                            (    waste=1831KB =11.82%)

-                    Thread (reserved=36072KB, committed=3604KB)
                            (thread #35)
                            (stack: reserved=35972KB, committed=3504KB)
                            (malloc=61KB #214)
                            (arena=39KB #68)

-                      Code (reserved=250201KB, committed=61469KB)
                            (malloc=2513KB #8816)
                            (mmap: reserved=247688KB, committed=58956KB)

-                        GC (reserved=455KB, committed=455KB)
                            (malloc=23KB #78)
                            (mmap: reserved=432KB, committed=432KB)

-                  Compiler (reserved=506KB, committed=506KB)
                            (malloc=341KB #658)
                            (arena=165KB #5)

-                  Internal (reserved=941KB, committed=941KB)
                            (malloc=905KB #19950)
                            (mmap: reserved=36KB, committed=36KB)

-                     Other (reserved=82KB, committed=82KB)
                            (malloc=82KB #13)

-                    Symbol (reserved=14049KB, committed=14049KB)
                            (malloc=12890KB #402912)
                            (arena=1159KB #1)

-    Native Memory Tracking (reserved=8108KB, committed=8108KB)
                            (malloc=10KB #151)
                            (tracking overhead=8098KB)

-        Shared class space (reserved=12288KB, committed=12116KB)
                            (mmap: reserved=12288KB, committed=12116KB)

-               Arena Chunk (reserved=187KB, committed=187KB)
                            (malloc=187KB)

-                   Tracing (reserved=14507KB, committed=14507KB)
                            (malloc=14475KB #200)
                            (arena=32KB #1)

-                    Module (reserved=1212KB, committed=1212KB)
                            (malloc=1212KB #5348)

-                 Safepoint (reserved=8KB, committed=8KB)
                            (mmap: reserved=8KB, committed=8KB)

-           Synchronization (reserved=272KB, committed=272KB)
                            (malloc=272KB #2792)

-            Serviceability (reserved=1KB, committed=1KB)
                            (malloc=1KB #12)

-                 Metaspace (reserved=82858KB, committed=75626KB)
                            (malloc=938KB #1136)
                            (mmap: reserved=81920KB, committed=74688KB)

-      String Deduplication (reserved=1KB, committed=1KB)
                            (malloc=1KB #8)

-           Object Monitors (reserved=45KB, committed=45KB)
                            (malloc=45KB #221)

And top shows almost 500mb:
image

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.