Filebeat inserting extra Unicode characters

This is the original entry in the log file:

2024-11-25 23:14:27,671 INFO o.s.w.s.c.WebSocketMessageBrokerStats [MessageBroker-1] WebSocketSession[0 current WS(0)-HttpStream(0)-HttpPoll(0), 0 total, 0 closed abnormally (0 connect failure, 0 send limit, 0 transport error)], stompSubProtocol[processed CONNECT(0)-CONNECTED(0)-DISCONNECT(0)], stompBrokerRelay[null], inboundChannel[pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0], outboundChannel[pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0], sockJsScheduler[pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 197]

I copied this file to my Windows machine and opened it in Notepad. The bottom of the Notepad window says the file is Unix (LF) UTF-8.

This is the same log record as shown in Kibana/Elasticsearch:

\u001b[30m2024-11-25 22:44:27,670\u001b[0;39m \u001b[34mINFO \u001b[0;39m [\u001b[34mMessageBroker-1\u001b[0;39m] \u001b[33mo.s.w.s.c.WebSocketMessageBrokerStats\u001b[0;39m: WebSocketSession[0 current WS(0)-HttpStream(0)-HttpPoll(0), 0 total, 0 closed abnormally (0 connect failure, 0 send limit, 0 transport error)], stompSubProtocol[processed CONNECT(0)-CONNECTED(0)-DISCONNECT(0)], stompBrokerRelay[null], inboundChannel[pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0], outboundChannel[pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0], sockJsScheduler[pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 196]\n

So somewhere in the process of Filebeat parsing the log file and shipping its content to Elasticsearch, to displaying the content in Kibana, something is inserting extra characters like \u001b[30m and \u001b[34m.

The applicable settings in our filebeat-kubernetes.yaml file

    - type: filestream
      id: ceo-api-dev1-container-logs
      paths:
        - /var/log/containers/ceo-api-*.log
      encoding: utf-8
      fields_under_root: true
      fields:
        data_stream.type: logs
        data_stream.dataset: ceo
        data_stream.namespace: api
        app_id: ceo-api-dev1
      parsers:
        - container: ~
      prospector:
        scanner:
          fingerprint.enabled: true
          symlinks: true
      file_identity.fingerprint: ~
      processors:
        - add_kubernetes_metadata:
            host: ${NODE_NAME}
            namespace: ceo-dev1
            matchers:
            - logs_path:
                logs_path: "/var/log/containers/"

According to the Filebeat docs, the encoding: utf-8 setting should have told Filebeat to parse the log file as utf-8 characters.

Is there a way to prevent these extra characters from being added or are we stuck hacking with a Grok processor or adding Logstash to our setup to get rid of those extra characters?

Are you sure this isn't from the source?

What you shared are ANSI escape codes for colors, does your application generate logs in colors? In which language is it written? Some logging frameworks add colors by default.

Testing with an echo -e this is the result:

I don't think that there is anything in the stack that would add those characters in that way.

Thanks!

It turns out the application had a logback-spring.xml file that was configured to use PatternLayout. The PatternLayout config had color strings which were then encoded as ANSI escape sequences.

Deleting the color strings solved the issue.