Filebeat / logstash truncating json log lines

I'm using Logstash 5.2.2, Filebeat 5.2.2, Logstash-logback-encoder 4.7

I'm using net.logstash.logback.encoder.LogstashEncoder to write our application logs in JSON format. The logs themselves look fine.
Then I'm using ch.qos.logback.core.rolling.RollingFileAppender to roll logs with TimeBasedRollingPolicy writing to a new log each hour. There is no moving/renaming of logs, or zipping of old logs - each hour gets a completely fresh log file.

Because I am choosing to write logs in pure JSON format, I would expect parsing of those logs to be 100% reliable - since there is no need to worry about different log grok patterns, or different multi-line concerns. Filebeat just needs to take each single line of JSON from the log, and forward it through LS into ES.

But occasionally (maybe 100 times out of 5,000,000 lines of logs) I am seeing things like this:

 {
    	"_index": "unmatched_raw_logs-2017.03.27",
    	"_type": "FOO_Bear_Processor",
    	"_id": "AVsNtOkNKJa3B0FS-Cpg",
    	"_score": null,
    	"_source": {
    		"@timestamp": "2017-03-27T03:00:08.665Z",
    		"module": "FOO_Bear_Processor",
    		"@version": "1",
    		"beat": {
    			"hostname": "xxxxxxxxxxxxxx",
    			"name": "internal_tomcat_beat",
    			"version": "5.2.2"
    		},
    		"source": "/FOO/logs/stash/FOO-bear-processor/FOO-bear-processor.log.2017-03-27_03",
    		"message": "el.breadcrumbId\":\"ID-xxxxxxxxxxxxxx-xxxxxxxx-xxxxx-com-43890-1490176523958-0-1360\",\"camel.routeId\":\"load-sheep-from-remote-system-route\"}",
    		"type": "FOO_Bear_Processor",
    		"tags": [
    			"_jsonparsefailure"
    		]
    	},
    	"fields": {
    		"@timestamp": [
    			1490583608665
    		]
    	},
    	"sort": [
    		1490583608665
    	]
    }

Here, "el.breadcrumbId" is part of "camel.breadcrumbId" - the first characters of the line have been chopped off halfway through the word "camel", making the remaining characters become invalid JSON.

In this instance, the true original log line was approx 6100 characters long, and only the last 100 or so characters were fed into elastic. (I don't know whether the first 6000 chars were fed into elastic as a separate message, or whether they were just dropped completely)

The log in question (FOO-bear-processor.log.2017-03-27_03) was only 10 lines long (There were only ten lines of log written during that hour).

Could you share your logging config. I think I have seen similar problems in the past and it was related to log truncation.

Can you also share your filebeat config and if possible some log lines of filebeat around the time this happened?

<appender name="STASH" class="ch.qos.logback.core.rolling.RollingFileAppender">
    <encoder class="net.logstash.logback.encoder.LogstashEncoder">
    </encoder>
    <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
        <fileNamePattern>/efm/logs/stash/foo-bear-processor/foo-bear-processor.log.%d{yyyy-MM-dd_HH}</fileNamePattern>
        <maxHistory>50</maxHistory>
    </rollingPolicy>
</appender>

--

  - input_type: log
    document_type: "FOO_Bear_Processor"
    ignore_older: 48h
    close_inactive: 24h
    paths:
      - /efm/logs/stash/foo-bear-processor/*
    tags: ["json_encoded_message"]

--

2017-03-27T02:59:52Z INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_bytes=30 libbeat.logstash.call_count.PublishEvents=5 libbeat.logstash.publish.write_bytes=18532 libbeat.publisher.published_events=162 publish.events=162 libbeat.logstash.published_and_acked_events=162 registrar.states.update=162 registrar.writes=5
2017-03-27T03:00:05Z INFO File is inactive: /efm/logs/stash/z/z.log.2017-03-26_02. Closing because close_inactive of 24h0m0s reached.
2017-03-27T03:00:08Z INFO Harvester started for file: /efm/logs/stash/z/z.log.2017-03-27_03
2017-03-27T03:00:08Z INFO Harvester started for file: /efm/logs/stash/foo-bear-processor/foo-bear-processor.log.2017-03-27_03
2017-03-27T03:00:10Z INFO File is inactive: /efm/logs/stash/foo-bear-processor/foo-bear-processor.log.2017-03-26_03. Closing because close_inactive of 24h0m0s reached.
2017-03-27T03:00:22Z INFO Non-zero metrics in the last 30s: libbeat.logstash.published_and_acked_events=669 libbeat.publisher.published_events=669 libbeat.logstash.call_count.PublishEvents=6 filebeat.harvester.started=2 libbeat.logstash.publish.read_bytes=36 registrar.writes=6 publish.events=673 registrar.states.update=673 filebeat.harvester.closed=2 filebeat.prospector.log.files.truncated=1 libbeat.logstash.publish.write_bytes=49755
2017-03-27T03:00:51Z INFO Harvester started for file: /efm/logs/stash/a/a.log.2017-03-27_03
2017-03-27T03:00:52Z INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_bytes=36 libbeat.logstash.publish.write_bytes=19864 filebeat.harvester.running=1 registrar.states.update=156 libbeat.publisher.published_events=156 filebeat.harvester.open_files=1 filebeat.harvester.started=1 publish.events=156 filebeat.prospector.log.files.truncated=1 libbeat.logstash.call_count.PublishEvents=6 libbeat.logstash.published_and_acked_events=156 registrar.writes=6
2017-03-27T03:01:01Z INFO Harvester started for file: /efm/logs/stash/y/y.log.2017-03-27_03
2017-03-27T03:01:22Z INFO Non-zero metrics in the last 30s: filebeat.prospector.log.files.truncated=1 libbeat.logstash.publish.write_bytes=48088 libbeat.logstash.publish.read_bytes=30 filebeat.harvester.open_files=1 publish.events=548 libbeat.logstash.call_count.PublishEvents=5 libbeat.publisher.published_events=546 filebeat.harvester.running=1 registrar.writes=5 libbeat.logstash.published_and_acked_events=546 filebeat.harvester.started=1 registrar.states.update=548
2017-03-27T03:01:52Z INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=52334 libbeat.publisher.published_events=699 registrar.states.update=699 registrar.writes=6 libbeat.logstash.call_count.PublishEvents=6 libbeat.logstash.published_and_acked_events=699 publish.events=699 libbeat.logstash.publish.read_bytes=36
2017-03-27T03:02:06Z INFO Harvester started for file: /efm/logs/stash/x/x.log.2017-03-27_03
2017-03-27T03:02:22Z INFO Non-zero metrics in the last 30s: registrar.states.update=109 filebeat.harvester.open_files=1 libbeat.logstash.call_count.PublishEvents=4 filebeat.harvester.running=1 filebeat.prospector.log.files.truncated=1 publish.events=109 libbeat.logstash.published_and_acked_events=108 libbeat.logstash.publish.read_bytes=24 registrar.writes=4 libbeat.publisher.published_events=108 filebeat.harvester.started=1 libbeat.logstash.publish.write_bytes=14277
2017-03-27T03:02:52Z INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=89854 libbeat.publisher.published_events=1286 registrar.states.update=1286 libbeat.logstash.call_count.PublishEvents=6 publish.events=1286 libbeat.logstash.published_and_acked_events=1286 libbeat.logstash.publish.read_bytes=36 registrar.writes=6
2017-03-27T03:03:22Z INFO Non-zero metrics in the last 30s: registrar.states.update=141 libbeat.publisher.published_events=141 libbeat.logstash.publish.write_bytes=18741 libbeat.logstash.published_and_acked_events=141 libbeat.logstash.call_count.PublishEvents=6 publish.events=141 libbeat.logstash.publish.read_bytes=36 registrar.writes=6
2017-03-27T03:03:52Z INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_bytes=36 libbeat.publisher.published_events=662 libbeat.logstash.call_count.PublishEvents=6 libbeat.logstash.publish.write_bytes=47352 publish.events=662 registrar.states.update=662 registrar.writes=6 libbeat.logstash.published_and_acked_events=662
2017-03-27T03:04:22Z INFO Non-zero metrics in the last 30s: libbeat.publisher.published_events=133 libbeat.logstash.publish.read_bytes=30 registrar.writes=5 libbeat.logstash.published_and_acked_events=133 publish.events=133 registrar.states.update=133 libbeat.logstash.call_count.PublishEvents=5 libbeat.logstash.publish.write_bytes=14602
2017-03-27T03:04:52Z INFO Non-zero metrics in the last 30s: libbeat.logstash.published_and_acked_events=911 libbeat.logstash.publish.write_bytes=67150 libbeat.publisher.published_events=911 registrar.writes=6 publish.events=911 libbeat.logstash.call_count.PublishEvents=6 libbeat.logstash.publish.read_bytes=36 registrar.states.update=911
2017-03-27T03:05:22Z INFO Non-zero metrics in the last 30s: registrar.states.update=652 libbeat.logstash.publish.write_bytes=47227 libbeat.publisher.published_events=652 registrar.writes=4 libbeat.logstash.call_count.PublishEvents=4 libbeat.logstash.publish.read_bytes=24 libbeat.logstash.published_and_acked_events=652 publish.events=652
2017-03-27T03:05:52Z INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=6 libbeat.logstash.published_and_acked_events=196 registrar.states.update=196 publish.events=196 libbeat.logstash.publish.read_bytes=36 libbeat.publisher.published_events=196 libbeat.logstash.publish.write_bytes=22767 registrar.writes=6
2017-03-27T03:06:22Z INFO Non-zero metrics in the last 30s: libbeat.publisher.published_events=601 registrar.states.update=601 libbeat.logstash.call_count.PublishEvents=5 publish.events=601 libbeat.logstash.publish.read_bytes=30 registrar.writes=5 libbeat.logstash.published_and_acked_events=601 libbeat.logstash.publish.write_bytes=43246

Nothing suspicous in the logs. Is the disk you are using a local disk or a shared drive?

Local disk (ext3)

Ok, that also seems to be a dead end :frowning:

Can you share your full filebeat config?

Yesterday i had this problem. This same problem. And i probably have nearly the same config, logback and rolling policy config that the user who post. I've seen this problem in the forum more times. I expect this helps.

I have filebeat 5.2.2, but this problem will be in any version of Filebeat, because is not a Filebeat problem. The problem is Filebeat and your logging rolling policy are not synchronized and they don't speak. It's a concurrent problem.

It's a nearly impossible situation to reproduce, but i analized it, and now i understand it, and i have a solution for you.

This is the scenario:

  • /tracesPath/PAYMENTS.log starts as an empty file for this example
  1. You write a trace at 15:50 for example in /tracesPath/PAYMENTS.log. The trace is 650 bytes long.
  2. Filebeat scan the file and send the trace to Logstash for example, and save the offset of 650 bytes for that file so he knows where to keep reading after another scan period.
  3. We now are at 16:30:00 and Filebeat scans the file again. Filebeat makes any kind of checks that the file is the same he remembers, check the path, the descriptor, whaterver asserts he have.
  4. Your application logger now take the cpu, he's gonna write another trace in /tracesPath/PAYMENTS.log. It's 16:30, so the logger detects he must do a rolling ( in my case i have a TimeBasedRolling Policy configuration that rolls files every hour ), so he does, he take the file, compress it in .gz for example and save the .gz file at /whateverBackupTracesPath/, then he creates /tracesPath/PAYMENTS.log file again and writes the trace, and it's a 800 bytes trace.
  5. Now Filebeat takes the cpu and checks if someone wrote something. He goes to offset 650 and checks if it's EOF. It is not, so he reads until EOF.
  6. Here exactly is the problem. He sends to Logstash the message from 650 to 800 of the new trace, and it's a malformed json. In my case, u can see this in filebeat.log traces in debug mode....

2017-04-17T16:15:40+02:00 INFO Harvester started for file: /tracesPath/PAYMENTS.log
2017-04-17T16:15:44+02:00 DBG Publish: {
"@timestamp": "2017-04-17T14:15:40.890Z",
"beat": {
"hostname": "XXX",
"name": "XXX",
"version": "5.2.2"
},
"environment": "production",
"input_type": "log",
"message": "ket":{"code":"ES","country":"ES","currency":"EUR","iata":"1234","status":"ACTIVE"},"locale":"es_ES","version":"v3_0"},"seg":1,"ber":1,"nr":24,"cn":"E"}},"exception":""}",
"offset": 800,
"source": "/tracesPath/PAYMENTS.log",
"type": "business"
}
2017-04-17T16:15:44+02:00 DBG output worker: publish 1 events
2017-04-17T16:15:44+02:00 ERR Failed to publish events caused by: EOF
2017-04-17T16:15:44+02:00 INFO Error publishing events (retrying): EOF
2017-04-17T16:15:45+02:00 DBG Events sent: 2

You can see here that "ket": it's nothing. In fact is part of "market" and it's correctly written in /tracesPath/PAYMENTS.log. Filebeat splits the message from exactly byte 650 to the end of the trace (byte 800). And this of course is a malformed json, and logstash will fail.

And now the solution:

  • Instead of creating a log file with exactly the same file name than the previous file (that's the root cause of the problem) you can simply log the traces in /tracesPath/PAYMENTS_TIMESTAMP.log . If you add timestamp to the fileName, you will never start to read from another file offset, because the name of the file it's different and FileBeat will work perfect.

  • Maybe changing the name of the file makes your rolling policy to not work properly. In this case i would ensure log rolling in operating system, and not by application, and i will only roll old files.

  • The asserts that FileBeat makes before reading any line of a file, and reading the line are not an atomic-single CPU operation. So in the middle a logback (or any logger you have) process can create a new file, whit same fileName and write things on it. The solution is to make this in another file, so Filebeat doesn't get confussed.

P.D: It will be nice if any moderator edit my message because i have a terrible english.

1 Like

Thanks a lot for providing all the details. Really appreciated.

What you describe above can happen if logtruncation as rotating algorithm is used and the new event is bigger then the previous file. The file name does not matter here, as filebeat only tracks inode / device id. So if the new file has the same inode, it is probably not a new file but one that was truncated.

I strongly recommend to NOT user logtruncation in any case, as it always can lead to data loss independent of filebeat.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.