Encoding utf-8 doesn't honor BOM

prehor · April 7, 2016, 12:13pm

Hi,

Windows often prepend UTF-8 BOM to text files, which is legal - see Can a UTF-8 data stream contain the BOM character (in UTF-8 form)?. In my case it happens in Exchange 2010 Message Tracking logs.

Filebeat must strip UTF-8 BOM from files with UTF-8 encoding which it doesn't do and BOM appears in message field for the first line in the file:

{"@timestamp":"2016-04-07T11:58:36.922Z","beat":{"hostname":"XXXXX","name":"XXXXX"},"count":1,"fields":null,"input_type":"log","message":"<U+FEFF>#Software: Microsoft Exchange Server","offset":0,"source":"exchange/MSGTRK20160405-1.LOG","type":"exchange"}

Filebeat config:

filebeat:
prospectors:
-
document_type: exchange
input_type: log
paths:
- exchange/MSGTRK2*.LOG
encoding: utf-8

output:
file:
path: logstash/output
name: exchange

Hex dump of first line in file:

00000000  ef bb bf 23 53 6f 66 74  77 61 72 65 3a 20 4d 69  |...#Software: Mi|
00000010  63 72 6f 73 6f 66 74 20  45 78 63 68 61 6e 67 65  |crosoft Exchange|
00000020  20 53 65 72 76 65 72 0d  0a                       | Server..|

First three bytes EF BB BF (UTF-8 encoded BOM) are decoded to unicode character FE FF (which is used as BOM for UTF-16 encoding) and it appears in message field.

UTF-8 BOM could be used to detect UTF-8 encoding.

steffens · April 7, 2016, 12:33pm

we do not try to detect whatever encoding (besides the utf16-bom codec) we're dealing with, as BOM is used very rarely. Still BOM should be removed from events. Please report this as bug on github. Maybe fix includes 'just' updating library used for decoding different character-sets.

prehor · April 7, 2016, 1:11pm

OK, I filled bug report #1349.

steffens · April 7, 2016, 4:19pm

thanks

Topic		Replies	Views
Filebeat skips fields for the first line in UTF8 with BOM encoding Beats filebeat	3	976	September 5, 2018
Trouble with log in UCS-2 LE BOM encoding Beats filebeat	3	1797	July 24, 2020
Found encoding issue with Filebeat MS SQL module Beats filebeat	1	494	August 20, 2020
Help with Filebeat for Windows Beats filebeat	6	3002	October 24, 2016
Found encoding issue with Filebeat MS SQL module Beats beats-module	1	348	October 26, 2022

Encoding utf-8 doesn't honor BOM

Related topics