Cp865 encoded logs

eikeskog · October 12, 2018, 12:41pm

Hi,
Some of the files I am trying to read are encoded in cp865.
I tried to set encoding: "cp865", but it seems that cp865 is not supported.

# Configure the file encoding for reading files with international characters
_ # following the W3C recommendation for HTML5 (http://www.w3.org/TR/encoding)._
_ # Some sample encodings:_
_ # plain, utf-8, utf-16be-bom, utf-16be, utf-16le, big5, gb18030, gbk,_
_ # hz-gb-2312, euc-kr, euc-jp, iso-2022-jp, shift-jis, ..._
_ # encoding: utf-8_

I have tried other encodings, but none will output the Nordic characters correctly.
Using filebeat-5.6.2
input_type: log
output.redis

What are my options here?

steffens · October 15, 2018, 11:41pm

Checking our dependencies, IBM865 (CP 865) is defined, but it's no official HTML5 encoding, that's why it's not included in the tables

A many other IBM encodings are missing as well. Please open a github issue about missing codecs in filebeat.

Potential workaround. This is quite hacky and might actually not work out (I'd prefer to fix the bug in filebeat):
As filebeat already reads the contents and tries to serialize it to UTF-8, we'd have to use some 8bit code map, which is ASCII compatible for all values up to 0x7f. Reading with utf-8 codec might combine 2 consecutive characters, getting you something non-reconstructible (plus, it might insert invalid-code-point control characters). You can configure iso-8859-1 (check the lib it's actually windows-1252, but this should be no problem). The official code map of iso-8859-1 does not specify all required mappings, but there are some code points defined in the decoder source code it seems. Now you will have some invalid characters. Next, one can use the translate or ruby filter to fix wrong code points. For example in CP865 the ø character has the code map ID 0x9B (155), and code point 00F8 (unicode). In the windows 152 mapping the code map ID 0x9B has the unicode code point 203a. That is, we can add a mapping of u+203a => u+00F8 to our translation table and this way create the correct utf-8 encoded text.

eikeskog · October 16, 2018, 11:22am

Thanks for reply.

I actually solved this by setting encoding in filebeat to cp866 and then just replace the affected characters. Ex: gsub => ["message", "Ы", "ø"]

I tried other encodings in filebeat, but they rendered all unknown characters with same code.

I will open a github issue concering this.

steffens · October 16, 2018, 9:06pm

I actually solved this by setting encoding in filebeat to cp866 and then just replace the affected characters. Ex: gsub => ["message", "Ы", "ø"]

Didn't expect cp866 to be close enough. But great you found a workaround.

I will open a github issue concering this.

Thank you!

system · November 13, 2018, 9:06pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Help with Filebeat for Windows Beats filebeat	6	3002	October 24, 2016
Default encoding for filebeat Beats filebeat	1	341	December 12, 2019
Trouble with log in UCS-2 LE BOM encoding Beats filebeat	3	1797	July 24, 2020
Filebeat - received an event - has different character encoding Beats	21	24833	July 5, 2017
LogStash encoding Issue from Filebeat IIS Access Logs 7.4.0 Stack Logstash	6	846	November 13, 2019

Cp865 encoded logs

Related topics