Read line error: invalid CRI log; filbeat halts processing logs

Some log files trigger following error mesg and stops the log file from being processed:

filebeat-vng6q filebeat 2018-08-28T07:32:54.968Z ERROR log/harvester.go:275 Read line error: invalid CRI log; File: /var/lib/docker/containers/fa4b520c2374b1830eb46d3455f0a660acea4f735a68238408142709d4b830f2/fa4b520c2374b1830eb46d3455f0a660acea4f735a68238408142709d4b830f2-json.log; Message length: 14; Content: .743936838Z"}

I've modified filebeat to add extra logging to find which line in the log is causing the issue, here's the line:

{"log":"2018-08-24 09:12:48.743 WARN 1 --- [ handlers-19] .s.o.c.t.m.ManifestForceCacheRefreshTask : [] All refreshes succeeded, but not all have been processed yet...\n","stream":"stdout","time":"2018-08-24T09:12:48.743936838Z"}

It looks like the line isn't appended to the previous message and it's treated as a new message.

My current configuration is to read docker streams with multiline configured to capture java stack traces.
I've tried to troubleshoot this issue, but it's taking me longer than I've time to get familiar with the way the way docker logs are being processed.

Which filebeat version?

Can you share your configuration?

Is there a chance of multiple prospectors trying to read the same file?

@steffens I've just upgraded to version 6.4.0, previously I was using version 6.3.2 and I didn't have this issue. On the other hand in 6.3.2 I had issue where logs longer then 16kB were split up (due to docker's json-file limits). 6.4.0 Fixed this issue and it looks like CRI issue originates from the same code block.

I don't think it's multiple prospectors reading the same file.

Here's the config:

filebeat.autodiscover:
  providers:
    - type: kubernetes
      in_cluster: true
      hints.enabled: true
      include_annotations: '*'

processors:
  - add_cloud_metadata: ~
  - add_kubernetes_metadata:
      in_cluster: true
  - add_docker_metadata:

output.elasticsearch:
  hosts: ["http://elasticsearch:9200"]

http.enabled: true
http.port: 5066

You log is clearly JSON, the CRI parser should not have dealt with this line.

You wrote:

It looks like the line isn't appended to the previous message and it's treated as a new message.

Can you share some more logs before/after the failing line?

@steffens Yes, they are JSON. Here are 7 lines before and after

{"log":"2018-08-24 09:12:48.550  INFO 1 --- [    scheduler-2] com.netflix.spinnaker.q.QueueProcessor   : [] Received message RunTask(executionType=pipeline, executionId=01CNNK9V7CEVJQYHANR5HHWXSW, application=testk8sv1, stageId=01CNNK9V7GG7DEE2K5GY3YAN1A, taskId=4, taskType=class com.netflix.spinnaker.orca.clouddriver.tasks.manifest.ManifestForceCacheRefreshTask)\n","stream":"stdout","time":"2018-08-24T09:12:48.550658419Z"}
{"log":"2018-08-24 09:12:48.566  WARN 1 --- [    handlers-20] .s.o.c.t.m.ManifestForceCacheRefreshTask : [] All refreshes succeeded, but not all have been processed yet...\n","stream":"stdout","time":"2018-08-24T09:12:48.567073931Z"}
{"log":"2018-08-24 09:12:48.567  INFO 1 --- [    handlers-20] c.n.s.o.c.CloudDriverCacheStatusService  : [] ---\u003e HTTP GET http://spin-clouddriver.spinnaker:7002/cache/kubernetes/manifest\n","stream":"stdout","time":"2018-08-24T09:12:48.567171264Z"}
{"log":"2018-08-24 09:12:48.612  INFO 1 --- [    scheduler-1] com.netflix.spinnaker.q.QueueProcessor   : [] Received message RunTask(executionType=pipeline, executionId=01CNNK9SVNWPTJ8R2HHHRR3J7P, application=testk8sv1, stageId=01CNNK9SVVTQESZ7N6PRVCEGAG, taskId=4, taskType=class com.netflix.spinnaker.orca.clouddriver.tasks.manifest.ManifestForceCacheRefreshTask)\n","stream":"stdout","time":"2018-08-24T09:12:48.61261451Z"}
{"log":"2018-08-24 09:12:48.627  WARN 1 --- [    handlers-13] .s.o.c.t.m.ManifestForceCacheRefreshTask : [] All refreshes succeeded, but not all have been processed yet...\n","stream":"stdout","time":"2018-08-24T09:12:48.628005538Z"}
{"log":"2018-08-24 09:12:48.628  INFO 1 --- [    handlers-13] c.n.s.o.c.CloudDriverCacheStatusService  : [] ---\u003e HTTP GET http://spin-clouddriver.spinnaker:7002/cache/kubernetes/manifest\n","stream":"stdout","time":"2018-08-24T09:12:48.628132832Z"}
{"log":"2018-08-24 09:12:48.730  INFO 1 --- [    scheduler-7] com.netflix.spinnaker.q.QueueProcessor   : [] Received message RunTask(executionType=pipeline, executionId=01CNNK9RKRDZW46AB6761NDKJM, application=testk8sv1, stageId=01CNNK9RKW5SBG3GBQK98HMY0H, taskId=4, taskType=class com.netflix.spinnaker.orca.clouddriver.tasks.manifest.ManifestForceCacheRefreshTask)\n","stream":"stdout","time":"2018-08-24T09:12:48.730455885Z"}
{"log":"2018-08-24 09:12:48.743  WARN 1 --- [    handlers-19] .s.o.c.t.m.ManifestForceCacheRefreshTask : [] All refreshes succeeded, but not all have been processed yet...\n","stream":"stdout","time":"2018-08-24T09:12:48.743936838Z"}
{"log":"2018-08-24 09:12:48.743  INFO 1 --- [    handlers-19] c.n.s.o.c.CloudDriverCacheStatusService  : [] ---\u003e HTTP GET http://spin-clouddriver.spinnaker:7002/cache/kubernetes/manifest\n","stream":"stdout","time":"2018-08-24T09:12:48.743995522Z"}
{"log":"2018-08-24 09:12:48.761  INFO 1 --- [      events-17] c.n.spinnaker.orca.echo.EchoService      : [anonymous] \u003c--- HTTP 200 http://spin-echo.spinnaker:8089/ (2681ms)\n","stream":"stdout","time":"2018-08-24T09:12:48.762000295Z"}
{"log":"2018-08-24 09:12:48.788  INFO 1 --- [      events-17] c.n.spinnaker.orca.echo.EchoService      : [anonymous] ---\u003e HTTP POST http://spin-echo.spinnaker:8089/\n","stream":"stdout","time":"2018-08-24T09:12:48.788628255Z"}
{"log":"2018-08-24 09:12:48.968  INFO 1 --- [       events-2] c.n.spinnaker.orca.echo.EchoService      : [anonymous] \u003c--- HTTP 200 http://spin-echo.spinnaker:8089/ (2180ms)\n","stream":"stdout","time":"2018-08-24T09:12:48.968569285Z"}
{"log":"2018-08-24 09:12:48.968  INFO 1 --- [      events-20] c.n.spinnaker.orca.echo.EchoService      : [anonymous] \u003c--- HTTP 200 http://spin-echo.spinnaker:8089/ (2191ms)\n","stream":"stdout","time":"2018-08-24T09:12:48.968634627Z"}
{"log":"2018-08-24 09:12:48.968  INFO 1 --- [      events-16] c.n.spinnaker.orca.echo.EchoService      : [anonymous] \u003c--- HTTP 200 http://spin-echo.spinnaker:8089/ (7690ms)\n","stream":"stdout","time":"2018-08-24T09:12:48.968718646Z"}
{"log":"2018-08-24 09:12:48.988  INFO 1 --- [      events-20] c.n.spinnaker.orca.echo.EchoService      : [anonymous] ---\u003e HTTP POST http://spin-echo.spinnaker:8089/\n","stream":"stdout","time":"2018-08-24T09:12:48.988594207Z"}

The way I read the code, for some reason, the line hasn't been read to the end in previous loop. There's some left over in current one. With the if statement, since it's only partial line, it's recognised as CRI log, since it only checks that the string starts with a '{'. The question is, why isn't the line read to the end in previous loop iteration?

Yeah, I read the code the same.

Can we have a look at the registry file? The registry file stores the offset of the next line to be read. I wonder if the offset is off in the registry file and where exactly in the line it points to.

I am testing filebeat 6.4.0 with the type: docker option:

- type: docker
  combine_partial: true
  containers:
    ids: 'a33*'
    path: "/var/lib/docker/containers"
  fields_under_root: true
  processors:
    - add_docker_metadata: ~

I only read one container a33*. If I delete the registry file and start filebeat I do not get any errors.
The registry looks like this while running

[{"source":"/var/lib/docker/containers/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e-json.log","offset":9154026,"timestamp":"2018-08-30T17:41:17.506939688Z","ttl":-1,"type":"docker","meta":null,"FileStateOS":{"inode":102280769,"device":2049}}]

After stopping filebeat it looks like this:

[{"source":"/var/lib/docker/containers/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e-json.log","offset":9297058,"timestamp":"2018-08-30T17:41:21.471968364Z","ttl":-1,"type":"docker","meta":null,"FileStateOS":{"inode":102280769,"device":2049}}]

I only see a difference in the timestamp. Now if I start filebeat again with the registry file in place I get an error "read line error: invalid CRI log":

2018-08-30T17:42:01.215Z DEBUG [input] input/input.go:152 Run input
2018-08-30T17:42:01.215Z DEBUG [input] log/input.go:174 Start next scan
2018-08-30T17:42:01.215Z DEBUG [input] log/input.go:404 Check file for harvesting: /var/lib/docker/containers/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e-json.log
2018-08-30T17:42:01.215Z DEBUG [input] log/input.go:494 Update existing file for harvesting: /var/lib/docker/containers/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e-json.log, offset: 9297058
2018-08-30T17:42:01.215Z DEBUG [input] log/input.go:503 Resuming harvesting of file: /var/lib/docker/containers/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e-json.log, offset: 9297058, new size: 9869381
2018-08-30T17:42:01.215Z DEBUG [harvester] log/harvester.go:481 Set previous offset for file: /var/lib/docker/containers/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e-json.log. Offset: 9297058
2018-08-30T17:42:01.215Z DEBUG [harvester] log/harvester.go:472 Setting offset for file: /var/lib/docker/containers/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e-json.log. Offset: 9297058
2018-08-30T17:42:01.215Z DEBUG [harvester] log/harvester.go:387 Update state: /var/lib/docker/containers/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e-json.log, offset: 9297058
2018-08-30T17:42:01.215Z DEBUG [input] log/input.go:195 input states cleaned up. Before: 1, After: 1, Pending: 0
2018-08-30T17:42:01.215Z INFO log/harvester.go:251 Harvester started for file: /var/lib/docker/containers/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e-json.log
2018-08-30T17:42:01.216Z ERROR log/harvester.go:278 Read line error: invalid CRI log; File: /var/lib/docker/containers/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e-json.log
2018-08-30T17:42:01.216Z DEBUG [harvester] log/harvester.go:507 Stopping harvester for file: /var/lib/docker/containers/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e-json.log
2018-08-30T17:42:01.216Z DEBUG [harvester] log/harvester.go:517 Closing file: /var/lib/docker/containers/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e-json.log
2018-08-30T17:42:01.216Z DEBUG [harvester] log/harvester.go:387 Update state: /var/lib/docker/containers/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e-json.log, offset: 9297058
2018-08-30T17:42:01.216Z DEBUG [harvester] log/harvester.go:528 harvester cleanup finished for file: /var/lib/docker/containers/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e/a33ceec43dfd951e87fec2b7c3d9381cc8678a167df644f6add4b88944af905e-json.log

If I stop filebeat, delete the registry file and then start filebeat again I don't see any CRI errors for hours now.
I repeated this several times and my conclusion is that the CRI error only appears if there's a registry file on startup.

@steffens here's the offset for this particular file:

{
  "source": "/var/lib/docker/containers/fa4b520c2374b1830eb46d3455f0a660acea4f735a68238408142709d4b830f2/fa4b520c2374b1830eb46d3455f0a660acea4f735a68238408142709d4b830f2-json.log",
  "offset": 18648000,
  "timestamp": "2018-08-31T03:00:41.543776889Z",
  "ttl": -1,
  "type": "docker",
  "meta": null,
  "FileStateOS": {
    "inode": 5233642,
    "device": 66311
  }
},

Hi everyone,

Thank you for all the feedback, this should help us in the task of fixing the issue. BTW, there is a public issue here: https://github.com/elastic/beats/issues/8175 as the bugs seems confirmed.

@Jarek_Miszkinis, I'm wondering, do you see any other ocurrence of the /var/lib/docker/containers/fa4b520c2374b1830eb46d3455f0a660acea4f735a68238408142709d4b830f2/fa4b520c2374b1830eb46d3455f0a660acea4f735a68238408142709d4b830f2-json.log file in your registry?

Best regards

I have another question for everyone, do you see this behavior if you disable combine_partials? This is a new feature introduced in 6.4 so it could potentially be one of the causes of this issue.

An update here, it seems the root cause has been found, and we are working on the fix: https://github.com/elastic/beats/issues/8175

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.