Logstash service stopped without any specific error

Hello, I've been handling logstash this week, and the service always turned off at night (around 8pm-10pm), so I enabled the debug log, and I can't find any error log that lead to the stopped service
here's the last log before the service stopped, I will put only the important part of the log because my log is on json

"message": "[local: ip1:33000, remote: ip2:34212] Sending a new message for the listener, sequence: 88",
    "input": {
      "type": "log"
    },
    "logstash": {
      "log": {
        "module": "org.logstash.beats.BeatsHandler",
        "log_event": {},
        "thread": "defaultEventExecutorGroup-230-3"
      }
    },
    "message": "filters/LogStash::Filters::Mutate: removing field",
    "input": {
      "type": "log"
    },
    "logstash": {
      "log": {
        "module": "logstash.filters.mutate",
        "log_event": {
          "field": "@timestamp"
        },
        "thread": "[app]>worker0"
      }
    }
"message": "Running grok filter",
    "input": {	
      "type": "log"
    },
    "logstash": {
      "log": {
        "module": "logstash.filters.grok",
        "log_event": {
          "event": {
            "metaClass": {
              "metaClass": {
                "metaClass": {
                  "event": "2021-11-29T15:26:16.021Z 10.58.16.129 1141 <14>1 2021-11-29T15:25:50.953847+00:00 app.backend 95625168-98d9-4f90-888e-0f93de836a84 [APP/PROC/WEB/0] - [tags@47450 app_id=\"95625168-98d9-4f90-888e-0f93de836a84\" app_name=\"app.backend\" deployment=\"cf-fb405d0829d833e3cbb8\" index=\"f334fa17-8f51-48c3-894d-ee584a7372e1\" instance_id=\"0\" ip=\"10.58.16.129\" job=\"job_job\" organization_id=\"ef732656-7fc7-4a65-a06f-8c6627b31b63\" organization_name=\"app.backend\" origin=\"rep\" process_id=\"95625168-98d9-4f90-888e-0f93de836a84\" process_instance_id=\"5161608f-1f93-4745-7697-6cb6\" process_type=\"web\" product=\"product_product\" source_id=\"95625168-98d9-4f90-888e-0f93de836a84\" source_type=\"APP/PROC/WEB\" space_id=\"30e72328-6611-4d91-bee1-a20cfee17faa\" space_name=\"backend\" system_domain=\"domain.co.id\"] [DEBUG] 2021-11-29 22:25:50.953 org.apache.kafka.clients.FetchSessionHandler$Builder.build(FetchSessionHandler.java:252) - [Consumer clientId=consumer-10, groupId=app.group.general] Built incremental fetch (sessionId=527411778, epoch=29986) for node 3. Added 0 partition(s), altered 0 partition(s), removed 0 partition(s) out of 3 partition(s)"
                }
              }
            }
          }
        },
        "thread": "input|syslog|tcp|ip:39698}"
      }
    }

thats the last log I can read before the service stopped, I don't think thats error log, can someone give any clue what can I look into for this case? because I'm stuck here, I have to restart logstash service everyday because of this

Thanks in advance

anyone have insight about my problem?

How did you install rpm, Deb, tar.gz?

How did you start it,?

I see those are cloud foundry logs, are you running logstash inside cloud foundry or on a separate vm?

What version of logstash?

I understand this is the last log but did you look back to see if you received any sighup signals earlier?

What / how many , how complex are the logstash pipelines?

Are you monitoring logstash with metricbeat.. you could see what is happening?

Are there any security agents or scans against the vm?

There could be many reasons...

Last one like this on filebeat there was a qualys scan that ran nightly that did not recognize the process and killed it. Not saying it is that but could be many things.

Perhaps @Badger might have some thoughts...

I install with --no-index because my server is isolated, (RHEL 7)

I run with root permission, using .sh script, I usually run this fine without any problem, the reason I run with root because the listener won't start without root

I run my logstash on an isolated vm

its on version 7.11.2

I read the log around last 30 minutes and the log only give your usual logstash input and output information

I have around 100 pipelines in my logstash, but I have no problem for years, this problem just pop outta nowhere

I only use self monitoring on my cluster, so it only cover elasticsearch, kibana, logstash, and filebeat process

no, there aren't any security agents

I don't really know filebeat qualys scan, is this a module on filebeat?

Qualys is a commercial security scan nothing to do with filebeat..it was just an example... it was just a new security scanning tool implemented by the customers company that looked for unknown processes running as root ... And killed them.

In this customer case actually it was metricbeat and the scanning software was Killing it.

I was just giving an example to look for new / unexpected changes..

I don't know what --no-index is

So you don't have any metrics about the VM? See if there is memory or CPU spike? Perhaps that might help.

We'll something changed :slight_smile: ... Is there a new backup policy? Did something change in the environment underlying virtualization / volumes of data, etc.

It often been my experience that when "nothing changed"... Something did change.. somewhere ...

I don't think there was any recent changes related to that

--no-index is yum install parameter on RHEL to install package without using internet (alternately use downloaded package)

I have the metricbeat, but doesn't include the cpu monitoring because of size reason, maybe I'll keep in mind about this and try to include the cpu monitor asap,

this one, I should ask everyone that related to handle ELK stack on my side since I myself don't have any recent change on the thing

1 Like

It might be helpful to look through process accounting logs. If I start a logstash service and then kill the process the footprint in the logs is

java             S   X logstash __        28.85 secs Tue Nov 30 12:25
sudo             S     root     pts/0      0.00 secs Tue Nov 30 12:26
kill             S     root     pts/0      0.00 secs Tue Nov 30 12:26

It's unlikely you will see anything as obvious as that, but having an idea when logstash exits and what else is happening on the system around that time might help.

Another option would be something like

strace -f <logstash command> 2>&1 > /dev/null | tail -10000 > /tmp/strace.log &

That discards stdout but sends stderr (the strace output) through the pipe and retains the last 10000 lines of it. Somewhere in that strace output you would see something like

[pid  4328] --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=3783, si_uid=1000} ---

for a kill or

[pid  4384] +++ killed by SIGKILL +++

for a kill -9. Maybe add a grep SIG | before the tail.

2 Likes

I will look into this, I will inform later after I try this method

yesterday I set up the metricbeat, today the the logstash died again, I checked on the cpu and memory usage last time,
here's the cpu usage, seems normal, there was no peak condition (keep in mind the going down cpu usage because logstash died)

there weird one is the memory usage, for use case, the logstash died at around 22.00 the memory usage going down to around 10GB ish, but after it is turned on, the memory usage is gradually increase until reach max memory capacity (31GB ish), and it died again, so the possible reason it died because of memory overload.

Meanwhile I set up the jvm.options on logstash like this

-Xms8g
-Xmx8g

-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

-Djava.awt.headless=true

-Dfile.encoding=UTF-8

-Djruby.compile.invokedynamic=true

-Djruby.jit.threshold=0

-Djruby.regexp.interruptible=true

-XX:+HeapDumpOnOutOfMemoryError

-Djava.security.egd=file:/dev/urandom

-Dlog4j2.isThreadContextMapInheritable=true

do you have any suggestion about my jvm.options? (since I've set the memory to 8gb, but logstash used my memory up to 20GB total)

Thanks in advance as always

I found that the bundled JDK sometimes does not dump the heap on OOM. I suspect that this is because it does not pre-allocate all the memory it needs to do so. Now admittedly I run a t2.micro with 1 GB of memory, so it may not have space to allocate native memory that it needs once the heap is maxed.

That said, even if you do not get a heap dump you should get an hs_err_<pid>.log in the working directory of the JVM.

Since you asked for comments....

-XX:CMSInitiatingOccupancyFraction=75

When Sun first created this option I think the default was 65. Then folks realized that if your app needed 4 GB of heap, you had to allocate 6 GB or it would spend 100% of its time running concurrent GC sweeps. Sun/Oracle gradually increased the default to 95%. I have seen corner cases where a value below 95% is warranted, but they are very rare.

That said, I cannot see a reason to run CMS in this day and age. G1 is better and much simpler to tune.

1 Like

Hello, sorry for the late reply, I was trying any method thats possible,
after I checked further, my logstash ran with built in java from VM (not from logstash bundle java package), I tried to run logstash with the java bundle one, but the service still died.

I guess since the logstash service is killed by linux kernel because of OOM, the hs_err log wasn't created, I checked on the JVM directory and couldn't find any log there.

I also tried this method, but the service still died.

I asked my coworker about the recent change, there was newly created pipeline, and I already disable the pipeline, thus the service still stopped

from the issue persist, I think the java is experience memory leak, I don't know from where, but I think I will try to run a fresh installation logstash or try to change the jdk, still don't know will it work or nah

Can you update logstash try 7.15.2?

Can I use logstash 7.15.2 aside with elastic and kibana 7.11.2?

You should be able to assuming you're not doing anything super unusual if you're just pulling in data transforming it and pushing it to your own index it should probably be okay.

Certainly worth a test I would think

I checked my logstash heap,

this is my "htop" command on linux:
image

and this is my full cluster memory usage (the image is different from last screenshot because I just increased my memory by 32GB),

I want to ask about the heap on logstash, whats the different between logstash heap and memory usage on linux, because what I can see is the htop stats matched with the memory usage stats (thus are not normal) (because I set up -Xmx8g on the jvm.options, but the heap is normal(?) I guess,

hello, sorry for very late reply, after I learn further about jvm, I've come into conclusion that my logstash is receiving too much input, while the jvm couldn't clear the garbage fast enough resulting out of memory on the OS. The kernel will kill any process that took too many memory on the process that was logstash. I solved the issue by upgrade the memory and cpu, and tune the jvm too by using G1GC, I increased my jvm heap, increased the young generation size so the old generation is not too big while waiting for the GC to recover

big thanks to @stephenb and @Badger for helping me realize this issue

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.