Filebeat 6.4.2 Read line error: invalid CRI log format; File


(Justin Walz) #1

Hi,

I've seen this: https://github.com/elastic/beats/issues/8175, and it looks like the fix is merged into the 6.4.2 release. However, I'm still seeing this error frequently.

2018-10-29T18:58:06.818Z INFO log/harvester.go:251 Harvester started for file: /var/lib/docker/containers/50373e492339ea2c4011da0f76d7993fa73814be9c97db0f79f7b73c8c934820/50373e492339ea2c4011da0f76d7993fa73814be9c97db0f79f7b73c8c934820-json.log
2018-10-29T18:58:06.818Z ERROR log/harvester.go:278 Read line error: invalid CRI log format; File: /var/lib/docker/containers/50373e492339ea2c4011da0f76d7993fa73814be9c97db0f79f7b73c8c934820/50373e492339ea2c4011da0f76d7993fa73814be9c97db0f79f7b73c8c934820-json.log

I have no logs being processed from this file. I've toggled "combine_partial" and a few multiline settings to try and get around the bug, but no luck. My config is pasted below.

Thank you in advance for the help! Let me know if there is more information I can provide.

Best, Justin

Filebeat config:

logging.level: error
output.logstash.hosts: ["localhost:5044"]

filebeat:
  # autodiscover:
  #   providers:
  #     - type: kubernetes
  #       hints.enabled: true

  config:
    inputs:
      path: ${path.config}/inputs.d/*.yml
      reload.enabled: false
    modules:
      path: ${path.config}/modules.d/*.yml
      reload.enabled: false

processors:
  - add_locale: ~
  - add_cloud_metadata: ~
  - add_docker_metadata:
      host: "unix:///var/run/docker.sock"
  - add_kubernetes_metadata:
      in_cluster: true

inputs.d/*.yml

- type: docker
  combine_partial: true
  containers:
    path: "/var/lib/docker/containers"
    stream: "all"
    ids: "*"

(Jaime Soriano) #2

Hi @justinw,

Indeed it looks like #8175, but it should be already fixed. Did you try to enable cri.parse_flags? This should be able to use CRI flags to handle multiline logs.

Is it happening with some containers only or with all of them?


(Justin Walz) #3

Hi @jsoriano,

I just tried enabling that, with no luck. It's happening for only two containers, out of the many that we're running. I looked at the log files, and there's nothing (immediately) obvious. They're newline separated json lines, and both the first and last lines are json objects. One thing to note is that they're both on the same machine.

The config I'm using is now:

- type: docker
  containers:
    path: "/var/lib/docker/containers"
    stream: "all"
    ids: "*"
  cri.parse_flags: true
  combine_partial: true

Thank you,
Justin


(Jaime Soriano) #4

@justinw do you have any different configuration in the docker daemon of the machine where this issue happens?


(Justin Walz) #5

Hi @jsoriano, No, we use the same configuration for all instance types in this cluster.

Best, Justin


(Jaime Soriano) #6

Could you share some of the log lines of the file that fails to be parsed?


(Justin Walz) #7

Hi @jsoriano,

Here's an example:

{"log":"[ 'local node not active but active cluster node not found' ]\n","stream":"stderr","time":"2018-11-01T14:44:16.250639843Z"}
{"log":"Error: Could not fetch token\n","stream":"stderr","time":"2018-11-01T14:44:16.250697129Z"}
{"log":"    at main (/snapshot/app/lib/get-vault-token-k8.js:0:0)\n","stream":"stderr","time":"2018-11-01T14:44:16.250703425Z"}
{"log":"    at \u003canonymous\u003e\n","stream":"stderr","time":"2018-11-01T14:44:16.250708193Z"}
{"log":"    at process._tickCallback (internal/process/next_tick.js:188:7)\n","stream":"stderr","time":"2018-11-01T14:44:16.250713041Z"}
{"log":"!m g entry.exit 1\n","stream":"stdout","time":"2018-11-01T14:44:16.282976884Z"}
{"log":"!m c entry.error\n","stream":"stdout","time":"2018-11-01T14:44:16.283008729Z"}
2018-11-02T00:02:24.394Z	ERROR	log/harvester.go:278	Read line error: invalid CRI log format; File: /var/lib/docker/containers/c2815f6e09aa106c168d6129fc1034540afafa7cd59590394f9bf7f715524814/c2815f6e09aa106c168d6129fc1034540afafa7cd59590394f9bf7f715524814-json.log

Thanks, Justin


(Justin Walz) #8

I'm also seeing this.

2018-11-02T00:02:24.394Z	ERROR	log/harvester.go:278	Read line error: parsing CRI timestamp: parsing time "" as "2006-01-02T15:04:05Z07:00": cannot parse "" as "2006"; File: /var/lib/docker/containers/cd43bd46f7696f55b32327823b827452d2114756454b8454741589afde0f9552/cd43bd46f7696f55b32327823b827452d2114756454b8454741589afde0f9552-json.log

This container ID is from a metricbeat instance. I grepped for an empty time string or 2006 and didn't find anything. The limit for the post is 7000 chars, so here are some truncated logs from that file:

{"log":"2018-11-01T17:32:16.885Z\u0009INFO\u0009instance/beat.go:544\u0009Home path: [/usr/share/metricbeat] Config path: [/usr/share/metricbeat] Data path: [/usr/share/metricbeat/data] Logs path: [/usr/share/metricbeat/logs]\n","stream":"stderr","time":"2018-11-01T17:32:16.886287439Z"}
{"log":"2018-11-01T17:32:16.889Z\u0009INFO\u0009instance/beat.go:551\u0009Beat UUID: a18f6166-dced-41c7-9ece-072d972c2fc2\n","stream":"stderr","time":"2018-11-01T17:32:16.889452422Z"}
{"log":"2018-11-01T17:32:16.889Z\u0009INFO\u0009[seccomp]\u0009seccomp/seccomp.go:116\u0009Syscall filter successfully installed\n","stream":"stderr","time":"2018-11-01T17:32:16.889893774Z"}
{"log":"2018-11-01T17:32:16.889Z\u0009INFO\u0009[beat]\u0009instance/beat.go:768\u0009Beat info\u0009{\"system_info\": {\"beat\": {\"path\": {\"config\": \"/usr/share/metricbeat\", \"data\": \"/usr/share/metricbeat/data\", \"home\": \"/usr/share/metricbeat\", \"logs\": \"/usr/share/metricbeat/logs\"}, \"type\": \"metricbeat\", \"uuid\": \"a18f6166-dced-41c7-9ece-072d972c2fc2\"}}}\n","stream":"stderr","time":"2018-11-01T17:32:16.889906033Z"}
{"log":"2018-11-01T17:32:16.889Z\u0009INFO\u0009[beat]\u0009instance/beat.go:777\u0009Build info\u0009{\"system_info\": {\"build\": {\"commit\": \"e193f6d68b25b7ddbe3a3ed8d60bc07fea1ef800\", \"libbeat\": \"6.4.2\", \"time\": \"2018-09-26T12:46:47.000Z\", \"version\": \"6.4.2\"}}}\n","stream":"stderr","time":"2018-11-01T17:32:16.889914335Z"}
{"log":"2018-11-01T17:32:16.889Z\u0009INFO\u0009[beat]\u0009instance/beat.go:780\u0009Go runtime info\u0009{\"system_info\": {\"go\": {\"os\":\"linux\",\"arch\":\"amd64\",\"max_procs\":4,\"version\":\"go1.10.3\"}}}\n","stream":"stderr","time":"2018-11-01T17:32:16.889920853Z"}
{"log":"2018-11-01T17:32:16.893Z\u0009INFO\u0009[beat]\u0009instance/beat.go:784\u0009Host info\u0009{\"system_info\": {\"host\": {\"architecture\":\"x86_64\",\"boot_time\":\"2018-09-27T18:37:40Z\",\"containerized\":true,\"hostname\":\"ip-10-88-156-161\",\"ips\":[\"127.0.0.1/8\",\"::1/128\",\"10.88.156.161/19\",\"fe80::4ae:e7ff:fe01:c884/64\",\"172.17.0.1/16\",\"fe80::942e:99ff:fe73:2e82/64\",\"fe80::c8d5:acff:fe50:e20e/64\",\"fe80::e8a7:f3ff:fe79:868a/64\",\"fe80::1c1c:66ff:fe1e:2d36/64\",\"fe80::6459:25ff:feb0:470f/64\",\"fe80::88f5:8aff:feb1:a59/64\",\"fe80::8cf9:11ff:fea8:605f/64\",\"fe80::2456:18ff:fe1d:b5a5/64\",\"fe80::d81d:b4ff:fe8a:f8d9/64\",\"fe80::3ca9:c8ff:fe35:931e/64\",\"10.88.154.253/19\",\"fe80::418:87ff:fe23:d114/64\",\"fe80::b4a4:2aff:fe79:173a/64\",\"fe80::70a2:77ff:fef6:884d/64\"],\"kernel_version\":\"4.4.133-k8s\",\"mac_addresses\":[\"06:ae:e7:01:c8:84\",\"02:42:a4:a1:0a:66\",\"96:2e:99:73:2e:82\",\"ca:d5:ac:50:e2:0e\",\"ea:a7:f3:79:86:8a\",\"1e:1c:66:1e:2d:36\",\"66:59:25:b0:47:0f\",\"8a:f5:8a:b1:0a:59\",\"8e:f9:11:a8:60:5f\",\"26:56:18:1d:b5:a5\",\"da:1d:b4:8a:f8:d9\",\"3e:a9:c8:35:93:1e\",\"06:18:87:23:d1:14\",\"b6:a4:2a:79:17:3a\",\"72:a2:77:f6:88:4d\"],\"os\":{\"family\":\"redhat\",\"platform\":\"centos\",\"name\":\"CentOS Linux\",\"version\":\"7 (Core)\",\"major\":7,\"minor\":5,\"patch\":1804,\"codename\":\"Core\"},\"timezone\":\"UTC\",\"timezone_offset_sec\":0,\"id\":\"7bf1a9675da246fc9e8cfed31ee8e3cc\"}}}\n","stream":"stderr","time":"2018-11-01T17:32:16.893343644Z"}
{"log":"2018-11-01T17:32:16.893Z\u0009INFO\u0009[beat]\u0009instance/beat.go:813\u0009Process info\u0009{\"system_info\": {\"process\": {\"capabilities\": {\"inheritable\":[\"chown\",\"dac_override\",\"fowner\",\"fsetid\",\"kill\",\"setgid\",\"setuid\",\"setpcap\",\"net_bind_service\",\"net_raw\",\"sys_chroot\",\"mknod\",\"audit_write\",\"setfcap\"],\"permitted\":[\"chown\",\"dac_override\",\"fowner\",\"fsetid\",\"kill\",\"setgid\",\"setuid\",\"setpcap\",\"net_bind_service\",\"net_raw\",\"sys_chroot\",\"mknod\",\"audit_write\",\"setfcap\"],\"effective\":[\"chown\",\"dac_override\",\"fowner\",\"fsetid\",\"kill\",\"setgid\",\"setuid\",\"setpcap\",\"net_bind_service\",\"net_raw\",\"sys_chroot\",\"mknod\",\"audit_write\",\"setfcap\"],\"bounding\":[\"chown\",\"dac_override\",\"fowner\",\"fsetid\",\"kill\",\"setgid\",\"setuid\",\"setpcap\",\"net_bind_service\",\"net_raw\",\"sys_chroot\",\"mknod\",\"audit_write\",\"setfcap\"],\"ambient\":null}, \"cwd\": \"/usr/share/metricbeat\", \"exe\": \"/usr/share/metricbeat/metricbeat\", \"name\": \"metricbeat\", \"pid\": 1, \"ppid\": 0, \"seccomp\": {\"mode\":\"filter\"}, \"start_time\": \"2018-11-01T17:32:16.130Z\"}}}\n","stream":"stderr","time":"2018-11-01T17:32:16.893792727Z"}
{"log":"2018-11-01T17:32:16.893Z\u0009INFO\u0009instance/beat.go:273\u0009Setup Beat: metricbeat; Version: 6.4.2\n","stream":"stderr","time":"2018-11-01T17:32:16.893820777Z"}

... truncated ...

{"log":"2018-11-02T01:05:16.904Z\u0009INFO\u0009[monitoring]\u0009log/log.go:141\u0009Non-zero metrics in the last 30s\u0009{\"monitoring\": {\"metrics\": {\"beat\":{\"cpu\":{\"system\":{\"ticks\":239280,\"time\":{\"ms\":336}},\"total\":{\"ticks\":816290,\"time\":{\"ms\":1324},\"value\":816290},\"user\":{\"ticks\":577010,\"time\":{\"ms\":988}}},\"info\":{\"ephemeral_id\":\"a09b5a5f-410d-4389-af9b-332055f784eb\",\"uptime\":{\"ms\":27180168}},\"memstats\":{\"gc_next\":27698288,\"memory_alloc\":15923896,\"memory_total\":94640244584}},\"libbeat\":{\"config\":{\"module\":{\"running\":0}},\"output\":{\"events\":{\"acked\":2998,\"batches\":61,\"total\":2998},\"read\":{\"bytes\":55578},\"write\":{\"bytes\":2783377}},\"pipeline\":{\"clients\":8,\"events\":{\"active\":0,\"published\":2998,\"total\":2998},\"queue\":{\"acked\":2998}}},\"metricbeat\":{\"kubernetes\":{\"container\":{\"events\":66,\"success\":66},\"node\":{\"events\":3,\"success\":3},\"pod\":{\"events\":48,\"success\":48},\"system\":{\"events\":9,\"success\":9},\"volume\":{\"events\":63,\"success\":63}},\"rabbitmq\":{\"connection\":{\"events\":1053,\"success\":1053},\"exchange\":{\"events\":267,\"success\":267},\"node\":{\"events\":3,\"success\":3},\"queue\":{\"events\":1401,\"success\":1401}},\"system\":{\"cpu\":{\"events\":3,\"success\":3},\"load\":{\"events\":3,\"success\":3},\"memory\":{\"events\":3,\"success\":3},\"network\":{\"events\":51,\"success\":51},\"process\":{\"events\":22,\"success\":22},\"process_summary\":{\"events\":3,\"success\":3}}},\"system\":{\"load\":{\"1\":2.2,\"15\":1.8,\"5\":2.24,\"norm\":{\"1\":0.55,\"15\":0.45,\"5\":0.56}}}}}}\n","stream":"stderr","time":"2018-11-02T01:05:16.905054047Z"}




(Jaime Soriano) #9

All your logs seem actually correct :thinking: but it seems that filebeat thinks that they are CRI logs when they are JSON logs, looks similar to the previous issue.

With what frequency does it happen? Could you try if the workaround mentioned in #8175 works for you? It'd be to set a big close_inactive, like close_inactive: 48h to your configuration.


(Justin Walz) #10

I'll try that, thanks. Will follow up after letting it run for a while.

Justin


(Justin Walz) #11

Hi @jsoriano,

Sadly, no luck. Similar errors on a new file, less than 48h after I applied the change. Pasted log contents below for reference (found an example with only one line in the log file).

In an attempt to be helpful, here's some additional details:

The container that generated the logs is the kubernetes pod sandbox container, image = k8s.gcr.io/pause-amd64:3.0.

On this same node, other sandbox containers with the same log contents successfully made it through filebeat/logstash and I can see them in Kibana, with the expected beat.hostname. This specific log did not. (i.e. Filter docker.container.image and beat.hostname only, see a bunch -- add docker.container.id, see nothing).

It seems like the filebeat instance will continue to try and parse this file. I see the filebeat stderr message every 10 seconds, this aligns with the scan frequency.

Filebeat Error

2018-11-04T01:57:54.196Z	ERROR	log/harvester.go:278	Read line error: invalid CRI log format; File: /var/lib/docker/containers/96b0b20480eca645825772bf24b721395109f482ba54b97c953bfcd5c870f564/96b0b20480eca645825772bf24b721395109f482ba54b97c953bfcd5c870f564-json.log

Log Contents

user@ip-XX-XX-XX-XX# cat /var/lib/docker/containers/96b0b20480eca645825772bf24b721395109f482ba54b97c953bfcd5c870f564/96b0b20480eca64
5825772bf24b721395109f482ba54b97c953bfcd5c870f564-json.log
{"log":"shutting down, got signal: Terminated\n","stream":"stderr","time":"2018-11-01T14:36:26.561189222Z"}

user@ip-XX-XX-XX-XX# docker logs 96b0b20480eca645825772bf24b721395109f482ba54b97c953bfcd5c870f564
shutting down, got signal: Terminated

Current config

- type: docker
  containers:
    path: "/var/lib/docker/containers"
    stream: "all"
    ids: "*"
  cri.parse_flags: true
  combine_partial: true
  close_inactive: 48h

Best, Justin


(Jaime Soriano) #12

Hi @justinw,

You mentioned at some point that all failures happened on the same machine, is this still the case?

If it is I think we should focus on this host, there must be something different on it.

These errors seem to point to some kind of corruption of the filebeat registry, so it continues reading files from incorrect positions. This could happen for existing files if you upgraded from a version with some bug (like the one that should be fixed in 6.4.2), but it shouldn't happen with new files. This could also happen if you have multiple filebeat instances on the same host with the same data directory, I wonder if that could be the case here.

If it is only happening in one host you could also try to completely reinstall filebeat with a clean registry (data directory defaults to /var/lib/filebeat).


(Justin Walz) #13

Hi @jsoriano,

That could be the case. I cleared the directory this morning and will let it continue to run, but so far, no errors. I'll circle back after some more time.

Thanks,
Justin


(Raman Gupta) #14

Just chiming in to say I had the same issue on two of my Kubernetes nodes, also using Filebeat 6.4.2 (and the problem continued after upgrading to 6.5.1), noticed after some expected logs were not being found.

For some reason, filebeat was parsing particular JSON logs from a docker container as CRI logs, and failing to parse with the same error as reported by @justinw . Clearing the registry data from filebeat fixed the problem. I will monitor to see if this happens again.


(Justin Walz) #15

Clearing the directory fixed the problem on my end too.

Thanks for the help @jsoriano


(Justin Walz) #16

@jsoriano Is there a way to reset the registry for a single watched file? I do not want to just wipe the registry, as that would force reprocessing of all logs as I understand it.


(Jaime Soriano) #17

@justinw there is no supported way of reseting one single file :frowning: and yes, if the whole registry is removed all logs are reprocessed again.

But the registry is a JSON file, so you can try to edit it, with filebeat stopped, and remove the entry for the file that you want to reset. An entry for a file looks like this:

{"source":"/var/log/syslog","offset":0,"timestamp":"2018-12-05T09:59:38.247899031+01:00","ttl":-1,"type":"log","meta":null,"FileStateOS":{"inode":2490412,"device":64770}}

If you try to manually remove an entry, I'd recommend you to make a backup of the registry file, and to validate the format of the file as valid JSON before restarting filebeat.


(Justin Walz) #18

@jsoriano Makes sense, thanks.