Encoding utf-8 doesn't honor BOM


Windows often prepend UTF-8 BOM to text files, which is legal - see Can a UTF-8 data stream contain the BOM character (in UTF-8 form)?. In my case it happens in Exchange 2010 Message Tracking logs.

Filebeat must strip UTF-8 BOM from files with UTF-8 encoding which it doesn't do and BOM appears in message field for the first line in the file:

{"@timestamp":"2016-04-07T11:58:36.922Z","beat":{"hostname":"XXXXX","name":"XXXXX"},"count":1,"fields":null,"input_type":"log","message":"<U+FEFF>#Software: Microsoft Exchange Server","offset":0,"source":"exchange/MSGTRK20160405-1.LOG","type":"exchange"}

Filebeat config:

document_type: exchange
input_type: log
- exchange/MSGTRK2*.LOG
encoding: utf-8

path: logstash/output
name: exchange

Hex dump of first line in file:

00000000  ef bb bf 23 53 6f 66 74  77 61 72 65 3a 20 4d 69  |...#Software: Mi|
00000010  63 72 6f 73 6f 66 74 20  45 78 63 68 61 6e 67 65  |crosoft Exchange|
00000020  20 53 65 72 76 65 72 0d  0a                       | Server..|

First three bytes EF BB BF (UTF-8 encoded BOM) are decoded to unicode character FE FF (which is used as BOM for UTF-16 encoding) and it appears in message field.

UTF-8 BOM could be used to detect UTF-8 encoding.

we do not try to detect whatever encoding (besides the utf16-bom codec) we're dealing with, as BOM is used very rarely. Still BOM should be removed from events. Please report this as bug on github. Maybe fix includes 'just' updating library used for decoding different character-sets.

OK, I filled bug report #1349.