Journalbeat stuck for an unknown reason. Also, this only happens on the nodes with more logs than on the other ones. There are around 5 logs per second (which I don't think is a big load), but still.
Here is an actual behavior:
{"level":"error","timestamp":"2022-01-25T07:06:57.785Z","logger":"input","caller":"input/input.go:185","message":"Error while reading event: failed to iterate journal: bad message"}
{"level":"error","timestamp":"2022-01-25T07:07:02.786Z","logger":"input","caller":"input/input.go:185","message":"Error while reading event: failed to iterate journal: bad message"}
{"level":"error","timestamp":"2022-01-25T07:07:07.885Z","logger":"input","caller":"input/input.go:185","message":"Error while reading event: failed to iterate journal: bad message"}
{"level":"error","timestamp":"2022-01-25T07:07:12.888Z","logger":"input","caller":"input/input.go:185","message":"Error while reading event: failed to iterate journal: bad message"}
{"level":"error","timestamp":"2022-01-25T07:07:17.985Z","logger":"input","caller":"input/input.go:185","message":"Error while reading event: failed to iterate journal: bad message"}
{"level":"error","timestamp":"2022-01-25T07:07:22.985Z","logger":"input","caller":"input/input.go:185","message":"Error while reading event: failed to iterate journal: bad message"}
{"level":"error","timestamp":"2022-01-25T07:07:27.986Z","logger":"input","caller":"input/input.go:185","message":"Error while reading event: failed to iterate journal: bad message"}
{"level":"error","timestamp":"2022-01-25T07:07:32.987Z","logger":"input","caller":"input/input.go:185","message":"Error while reading event: failed to iterate journal: bad message"}
The Journalbeat is logging the "bad message" error every 5s. If we take a closer look at the registry file:
[root@journalbeat-rm25g journalbeat]# cat /data/registry
update_time: 2022-01-24T00:33:46.589642248Z
journal_entries:
- path: /var/log/journal
cursor: s=1205fae92bbb428eb0880e512d90bb5c;i=371e8e7b;b=c7ed156bb4e341dc9abed854f9e3ebe4;m=31426c080a3;t=5d648f8105d7d;x=5b1a9413ae69d0a4
realtime_timestamp: 1642983771364733
monotonic_timestamp: 3385084379299
[root@journalbeat-rm25g journalbeat]# date
Tue Jan 25 07:06:44 UTC 2022
As you can see, the Journalbeat is not re-reading the journal at all. The update_time
is not changing. It just stuck. Also, if I log in via SSH to the node:
$ sudo journalctl --verify
4fd3d98: Unused data (entry_offset==0)
4fd3e08: Unused data (entry_offset==0)
PASS: /var/log/journal/ec228772414251599b62b1f5e35bc22f/system@1205fae92bbb428eb0880e512d90bb5c-000000003e745502-0005d662b34e63f6.journal
4fd3eb0: Unused data (entry_offset==0)
PASS: /var/log/journal/ec228772414251599b62b1f5e35bc22f/system@1205fae92bbb428eb0880e512d90bb5c-000000003e702d10-0005d662a4fa1fd9.journal
PASS: /var/log/journal/ec228772414251599b62b1f5e35bc22f/system@1205fae92bbb428eb0880e512d90bb5c-000000003e75b763-0005d662b7d960cd.journal
4fd3c68: Unused data (entry_offset==0)
4fd3cc0: Unused data (entry_offset==0)
4fd3e30: Unused data (entry_offset==0)
PASS: /var/log/journal/ec228772414251599b62b1f5e35bc22f/system@1205fae92bbb428eb0880e512d90bb5c-000000003e7190da-0005d662a9c1d3db.journal
245078: Invalid data object at hash entry 77505 of 145322
File corruption detected at /var/log/journal/ec228772414251599b62b1f5e35bc22f/user-1003.journal:244d10 (of 8388608 bytes, 28%).
FAIL: /var/log/journal/ec228772414251599b62b1f5e35bc22f/user-1003.journal (Bad message)
4fd3e40: Unused data (entry_offset==0)
PASS: /var/log/journal/ec228772414251599b62b1f5e35bc22f/system@1205fae92bbb428eb0880e512d90bb5c-000000003e6ecabe-0005d662a0b251e8.journal
PASS: /var/log/journal/ec228772414251599b62b1f5e35bc22f/system@1205fae92bbb428eb0880e512d90bb5c-000000003e6d665a-0005d6629b970f32.journal
12018b0: Invalid array offset 3d98958
File corruption detected at /var/log/journal/ec228772414251599b62b1f5e35bc22f/system.journal:3d80028 (of 67108864 bytes, 96%).
FAIL: /var/log/journal/ec228772414251599b62b1f5e35bc22f/system.journal (Bad message)
PASS: /var/log/journal/ec228772414251599b62b1f5e35bc22f/system@1205fae92bbb428eb0880e512d90bb5c-000000003e72f468-0005d662aec3794b.journal
And trying to read from the cursor that we have in the registry file:
$ sudo journalctl --cursor "s=1205fae92bbb428eb0880e512d90bb5c;i=371e8e7b;b=c7ed156bb4e341dc9abed854f9e3ebe4;m=31426c080a3;t=5d648f8105d7d;x=5b1a9413ae69d0a4"
-- Logs begin at Tue 2022-01-25 06:59:34 UTC, end at Tue 2022-01-25 07:09:51 UTC. --
Jan 25 06:59:34 ip-xxx-xxx-xxx-xxx audit: PROCTITLE proctitle=2F7573722F62696E2F646F636B657264002D480066643A2F2F002D2D636F6E7461696E6572643D2F72756E2F636F6E7461696E6572642F636F6E7461696E6572642E736F636B
Jan 25 06:59:34 ip-xxx-xxx-xxx-xxx audit[1322]: SYSCALL arch=c000003e syscall=257 success=yes exit=240 a0=ffffffffffffff9c a1=c002701000 a2=80000 a3=0 items=1 ppid=1 pid=1322 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fs>
Jan 25 06:59:34 ip-xxx-xxx-xxx-xxx audit: CWD cwd="/"
Jan 25 06:59:34 ip-xxx-xxx-xxx-xxx audit: PATH item=0 name="/var/lib/docker/image/overlay2/imagedb/content/sha256/106a8e54d5eb3f70fcd1ed46255bdf232b3f169e89e68e13e4e67b25f59c1315" inode=393240 dev=ca:12 mode=0100600 ouid=0 ogid=0 rdev=00:00 >
Jan 25 06:59:34 ip-xxx-xxx-xxx-xxx audit: PROCTITLE proctitle=2F7573722F62696E2F646F636B657264002D480066643A2F2F002D2D636F6E7461696E6572643D2F72756E2F636F6E7461696E6572642F636F6E7461696E6572642E736F636B
Jan 25 06:59:34 ip-xxx-xxx-xxx-xxx audit[1322]: SYSCALL arch=c000003e syscall=257 success=yes exit=240 a0=ffffffffffffff9c a1=c002701300 a2=80000 a3=0 items=1 ppid=1 pid=1322 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fs>
Jan 25 06:59:34 ip-xxx-xxx-xxx-xxx audit: CWD cwd="/"
We are running the Journalbeat version 7.15.0. We haven't tried the Filebeat with journald input yet (because it requires some changes in our workflow, Logstash). Based on this PR:
Remove Journalbeat by kvch · Pull Request #29131 · elastic/beats · GitHub, nothing has changed with the logic of how Journalbeat or now Filebeat works with journald. Therefore, I don't think that it will change something.
Also, we can see a pretty big amount of errors (like bad message or something else).
As you can see from the chart, these "gaps" are when the Journalbeat is logging these errors every 5s. After restart it works fine for some time and then it brokes again. As an ad-hoc solution we are now monitoring the update_time
and if we that it is not updating, we are killing the journalbeat process, the Kubernetes will recreate this container and everything should work fine.