Description
We are using Filebeat to collect data from multiple systems (~300 different hosts) and OSes ( Windows, Ubuntu, Mac OS).
We are collecting around 125Million events coming from .ndjson files per week using Filebeat to upload them to Elasticsearch through Logstash.
The problem is that we are getting sporadic/intermittent decoding errors from Ubuntu hosts like these:
Error decoding JSON: invalid character '_' in literal true (expecting 'r')
Error decoding JSON: invalid character '_' in literal true (expecting 'r')
Error decoding JSON: invalid character 'T' looking for beginning of value
Error decoding JSON: invalid character 'l' looking for beginning of value
Error decoding JSON: invalid character 'a' looking for beginning of value
Error decoding JSON: invalid character 's' in literal null (expecting 'u')
Error decoding JSON: json: cannot unmarshal number into Go value of type map[string]interface {}
Out of the 125M events per week, ~1K events are Filebeat Decoding errors. We have not seen this happening in any other OS, only Ubuntu.
OS breakdown for the 125M events per week:
- Windows ____________ 66%
 - Ubuntu ______________ 33%
 - Others (like Mac OS) _ 1%
 
The decoding error does not seem consistent, we have analyzed several .ndjson files and the files are valid ndjson files. We have even tried just copying the same file to the same Filebeat input.file.path to re-harvest it and the problem does not reproduce.
The log.offset where the decoding error happens is not consistent either, we see offsets at the beginning of the files, almost at the end, in the middle... etc.
Example
Here is an example of the filebeat log for one instance of this issue (private details have been replaced by <>):
2020-12-15T13:00:02.885-0600    INFO    log/harvester.go:302    Harvester started for file: <path>/harvest/duration-ckiqcovu500009povay7jlkbg.ndjson
2020-12-15T13:00:02.886-0600    INFO    log/harvester.go:302    Harvester started for file: <path>/harvest/usage-ckiqcoumv000086ov90tbaujp.ndjson
2020-12-15T13:00:02.886-0600    INFO    log/harvester.go:302    Harvester started for file: <path>/harvest/usage-ckiqcothy00006uovasrxxjis.ndjson
2020-12-15T13:00:02.886-0600    INFO    log/harvester.go:302    Harvester started for file: <path>/harvest/usage-ckiqcow0u00039pov70wjba7t.ndjson
2020-12-15T13:00:02.887-0600    INFO    log/harvester.go:302    Harvester started for file: <path>/harvest/usage-ckiqcosls000066ovi7m1iic6.ndjson
2020-12-15T13:00:02.887-0600    INFO    log/harvester.go:302    Harvester started for file: <path>/harvest/usage-ckiqcoqac00002xovjtuxf8gc.ndjson
2020-12-15T13:00:02.887-0600    INFO    log/harvester.go:302    Harvester started for file: <path>/harvest/usage-ckiqcorjo00003yov0af7gp2s.ndjson
2020-12-15T13:00:02.887-0600    ERROR   [reader_json]   readjson/json.go:57 Error decoding JSON: invalid character 'p' looking for beginning of value
2020-12-15T13:00:02.888-0600    INFO    log/harvester.go:302    Harvester started for file: <path>/harvest/usage-ckiqcowhu0000c2ovubpqjr9h.ndjson
2020-12-15T13:00:02.888-0600    ERROR   [reader_json]   readjson/json.go:57 Error decoding JSON: invalid character 'a' in literal true (expecting 'r')
2020-12-15T13:00:02.890-0600    ERROR   [reader_json]   readjson/json.go:57 Error decoding JSON: json: cannot unmarshal number into Go value of type map[string]interface {}
2020-12-15T13:00:02.892-0600    INFO    [publisher_pipeline_output] pipeline/output.go:143  Connecting to backoff(async(tcp://<logstash_host>:5062))
Here is an example of one of the events with error (private details have been replaced by <>):
timestamp               December 15th 2020, 13:00:02.888
@version                1
agent.type              filebeat
agent.version           7.10.0
data.error.message      Error decoding JSON: invalid character 'a' in literal true (expecting 'r')
data.error.type         json
host.architecture       x86_64
host.containerized      false
host.os.codename        focal
host.os.family          debian
host.os.kernel          5.4.0-56-generic
host.os.name            Ubuntu
host.os.platform        ubuntu
host.os.version         20.04.1 LTS (Focal Fossa)
log.file.path           <path>/harvest/usage-ckiqcowhu0000c2ovubpqjr9h.ndjson
# log.offset            206
message                 tallers/linuxdesktop/<value>","<value>"]]},"subroutine":"<value>"}
tags                    beats_input_codec_plain_applied
Filebeat Details
- Version: 7.10.0 (we saw this in 7.7.0 too)
 
yml config file  (private details have been replaced by <>):
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - <path>/harvest/*-*.ndjson
    backoff: 10s
    json.keys_under_root: false
    json.add_error_key: true
processors:
  - drop_event:
      when:
        contains:
          json.error.message: "Error decoding JSON: EOF"
  - rename:
      fields:
        - from: "json"
          to: "data"
  - add_fields:
      target: host
      fields:
        metrics:
          running_version: 2.0.2
  - add_host_metadata: ~
  - add_locale: ~
  - rename:
      fields:
        - from: "event.timezone"
          to: "host.timezone"
  - drop_fields:
      fields: [
          "agent.id",
          "agent.ephemeral_id",
          "agent.name",
          "agent.hostname",
          "ecs",
          "host.id",
          "host.mac",
          "host.hostname",
          "host.os.build",
          "input"
      ]
      ignore_missing: true
queue.disk:
  max_size: 256MiB
  retry_interval: 5s
  max_retry_interval: 60s
output.logstash:
  hosts:
    - <list_of_hosts>
  loadbalance: true
  compression_level: 4