Journalbeat is being stuck

Journalbeat stuck for an unknown reason. Also, this only happens on the nodes with more logs than on the other ones. There are around 5 logs per second (which I don't think is a big load), but still.
Here is an actual behavior:

{"level":"error","timestamp":"2022-01-25T07:06:57.785Z","logger":"input","caller":"input/input.go:185","message":"Error while reading event: failed to iterate journal: bad message"}
{"level":"error","timestamp":"2022-01-25T07:07:02.786Z","logger":"input","caller":"input/input.go:185","message":"Error while reading event: failed to iterate journal: bad message"}
{"level":"error","timestamp":"2022-01-25T07:07:07.885Z","logger":"input","caller":"input/input.go:185","message":"Error while reading event: failed to iterate journal: bad message"}
{"level":"error","timestamp":"2022-01-25T07:07:12.888Z","logger":"input","caller":"input/input.go:185","message":"Error while reading event: failed to iterate journal: bad message"}
{"level":"error","timestamp":"2022-01-25T07:07:17.985Z","logger":"input","caller":"input/input.go:185","message":"Error while reading event: failed to iterate journal: bad message"}
{"level":"error","timestamp":"2022-01-25T07:07:22.985Z","logger":"input","caller":"input/input.go:185","message":"Error while reading event: failed to iterate journal: bad message"}
{"level":"error","timestamp":"2022-01-25T07:07:27.986Z","logger":"input","caller":"input/input.go:185","message":"Error while reading event: failed to iterate journal: bad message"}
{"level":"error","timestamp":"2022-01-25T07:07:32.987Z","logger":"input","caller":"input/input.go:185","message":"Error while reading event: failed to iterate journal: bad message"}

The Journalbeat is logging the "bad message" error every 5s. If we take a closer look at the registry file:

[root@journalbeat-rm25g journalbeat]# cat /data/registry
update_time: 2022-01-24T00:33:46.589642248Z
journal_entries:
- path: /var/log/journal
  cursor: s=1205fae92bbb428eb0880e512d90bb5c;i=371e8e7b;b=c7ed156bb4e341dc9abed854f9e3ebe4;m=31426c080a3;t=5d648f8105d7d;x=5b1a9413ae69d0a4
  realtime_timestamp: 1642983771364733
  monotonic_timestamp: 3385084379299
[root@journalbeat-rm25g journalbeat]# date
Tue Jan 25 07:06:44 UTC 2022

As you can see, the Journalbeat is not re-reading the journal at all. The update_time is not changing. It just stuck. Also, if I log in via SSH to the node:

$ sudo journalctl --verify
4fd3d98: Unused data (entry_offset==0)
4fd3e08: Unused data (entry_offset==0)
PASS: /var/log/journal/ec228772414251599b62b1f5e35bc22f/system@1205fae92bbb428eb0880e512d90bb5c-000000003e745502-0005d662b34e63f6.journal
4fd3eb0: Unused data (entry_offset==0)
PASS: /var/log/journal/ec228772414251599b62b1f5e35bc22f/system@1205fae92bbb428eb0880e512d90bb5c-000000003e702d10-0005d662a4fa1fd9.journal
PASS: /var/log/journal/ec228772414251599b62b1f5e35bc22f/system@1205fae92bbb428eb0880e512d90bb5c-000000003e75b763-0005d662b7d960cd.journal
4fd3c68: Unused data (entry_offset==0)
4fd3cc0: Unused data (entry_offset==0)
4fd3e30: Unused data (entry_offset==0)
PASS: /var/log/journal/ec228772414251599b62b1f5e35bc22f/system@1205fae92bbb428eb0880e512d90bb5c-000000003e7190da-0005d662a9c1d3db.journal
245078: Invalid data object at hash entry 77505 of 145322
File corruption detected at /var/log/journal/ec228772414251599b62b1f5e35bc22f/user-1003.journal:244d10 (of 8388608 bytes, 28%).
FAIL: /var/log/journal/ec228772414251599b62b1f5e35bc22f/user-1003.journal (Bad message)
4fd3e40: Unused data (entry_offset==0)
PASS: /var/log/journal/ec228772414251599b62b1f5e35bc22f/system@1205fae92bbb428eb0880e512d90bb5c-000000003e6ecabe-0005d662a0b251e8.journal
PASS: /var/log/journal/ec228772414251599b62b1f5e35bc22f/system@1205fae92bbb428eb0880e512d90bb5c-000000003e6d665a-0005d6629b970f32.journal
12018b0: Invalid array offset 3d98958
File corruption detected at /var/log/journal/ec228772414251599b62b1f5e35bc22f/system.journal:3d80028 (of 67108864 bytes, 96%).
FAIL: /var/log/journal/ec228772414251599b62b1f5e35bc22f/system.journal (Bad message)
PASS: /var/log/journal/ec228772414251599b62b1f5e35bc22f/system@1205fae92bbb428eb0880e512d90bb5c-000000003e72f468-0005d662aec3794b.journal

And trying to read from the cursor that we have in the registry file:

$ sudo journalctl --cursor "s=1205fae92bbb428eb0880e512d90bb5c;i=371e8e7b;b=c7ed156bb4e341dc9abed854f9e3ebe4;m=31426c080a3;t=5d648f8105d7d;x=5b1a9413ae69d0a4"
-- Logs begin at Tue 2022-01-25 06:59:34 UTC, end at Tue 2022-01-25 07:09:51 UTC. --
Jan 25 06:59:34 ip-xxx-xxx-xxx-xxx audit: PROCTITLE proctitle=2F7573722F62696E2F646F636B657264002D480066643A2F2F002D2D636F6E7461696E6572643D2F72756E2F636F6E7461696E6572642F636F6E7461696E6572642E736F636B
Jan 25 06:59:34 ip-xxx-xxx-xxx-xxx audit[1322]: SYSCALL arch=c000003e syscall=257 success=yes exit=240 a0=ffffffffffffff9c a1=c002701000 a2=80000 a3=0 items=1 ppid=1 pid=1322 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fs>
Jan 25 06:59:34 ip-xxx-xxx-xxx-xxx audit: CWD cwd="/"
Jan 25 06:59:34 ip-xxx-xxx-xxx-xxx audit: PATH item=0 name="/var/lib/docker/image/overlay2/imagedb/content/sha256/106a8e54d5eb3f70fcd1ed46255bdf232b3f169e89e68e13e4e67b25f59c1315" inode=393240 dev=ca:12 mode=0100600 ouid=0 ogid=0 rdev=00:00 >
Jan 25 06:59:34 ip-xxx-xxx-xxx-xxx audit: PROCTITLE proctitle=2F7573722F62696E2F646F636B657264002D480066643A2F2F002D2D636F6E7461696E6572643D2F72756E2F636F6E7461696E6572642F636F6E7461696E6572642E736F636B
Jan 25 06:59:34 ip-xxx-xxx-xxx-xxx audit[1322]: SYSCALL arch=c000003e syscall=257 success=yes exit=240 a0=ffffffffffffff9c a1=c002701300 a2=80000 a3=0 items=1 ppid=1 pid=1322 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fs>
Jan 25 06:59:34 ip-xxx-xxx-xxx-xxx audit: CWD cwd="/"

We are running the Journalbeat version 7.15.0. We haven't tried the Filebeat with journald input yet (because it requires some changes in our workflow, Logstash). Based on this PR:
Remove Journalbeat by kvch · Pull Request #29131 · elastic/beats · GitHub, nothing has changed with the logic of how Journalbeat or now Filebeat works with journald. Therefore, I don't think that it will change something.

Also, we can see a pretty big amount of errors (like bad message or something else).

As you can see from the chart, these "gaps" are when the Journalbeat is logging these errors every 5s. After restart it works fine for some time and then it brokes again. As an ad-hoc solution we are now monitoring the update_time and if we that it is not updating, we are killing the journalbeat process, the Kubernetes will recreate this container and everything should work fine.

Any chance of it being resolved?

We are using EKS, and we had to manually restart a Pod from time to time. Here is the config that we are currently using to resolve an issue:

journalbeat-configmap.yaml
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: journalbeat-confimap
data:
  kill.sh: |
    REGISTRY_PATH=/data/registry
    MAX_DIFF_SEC=60
    JOURNALBEAT_CMD=journalbeat
    while (sleep 30s); do
      journal_pid=$(ps -x | grep "$JOURNALBEAT_CMD" | grep -v "grep" | xargs)
      if [[ -z "${journal_pid}" ]]; then
        continue;
      fi
      journal_pid=$(echo $journal_pid | cut -d ' ' -f 1,1)
      if test -f "$REGISTRY_PATH"; then
        registry=$(date -d "$(cat $REGISTRY_PATH | grep "update_time" | cut -d " " -f2,2)" +%s)
        now=$(date +%s)
        diff=$(echo $(( now - registry )))
        if (( $diff > $MAX_DIFF_SEC )); then
          echo "killing journalbeat process"
          kill -15 $journal_pid
          sleep 60s
        fi;
      fi
    done

Our journalbeat config:

journalbeat.yaml
name: ${JOURNALBEAT_NAME}

tags: ${JOURNALBEAT_TAGS}

journalbeat:
  registry_file: /data/registry
  inputs:
    - paths: ["/var/log/journal"]
      seek: cursor
      cursor_seek_fallback: tail

processors:
- add_docker_metadata: ~
- add_cloud_metadata: ~
- add_kubernetes_metadata:
    in_cluster: true
    default_indexers.enabled: false
    default_matchers.enabled: false
    indexers:
      - container:
    matchers:
      - fields:
          lookup_fields: ["container.id"]

output.kafka:
  enabled: true
  # other kafka-related config

logging:
  level: warning
  to_files: false
  to_syslog: false
  json: true

Hi,

I think the journald input might have changed since 7.15, so I would recommend to look at the latest public version and if it doesn't work, please open an issue for Beats.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.