Cp865 encoded logs


(Svein Tore Eikeskog) #1

Hi,
Some of the files I am trying to read are encoded in cp865.
I tried to set encoding: "cp865", but it seems that cp865 is not supported.

# Configure the file encoding for reading files with international characters
_ # following the W3C recommendation for HTML5 (http://www.w3.org/TR/encoding)._
_ # Some sample encodings:_
_ # plain, utf-8, utf-16be-bom, utf-16be, utf-16le, big5, gb18030, gbk,_
_ # hz-gb-2312, euc-kr, euc-jp, iso-2022-jp, shift-jis, ..._
_ # encoding: utf-8_

I have tried other encodings, but none will output the Nordic characters correctly.
Using filebeat-5.6.2
input_type: log
output.redis

What are my options here?


(Steffen Siering) #2

Checking our dependencies, IBM865 (CP 865) is defined, but it's no official HTML5 encoding, that's why it's not included in the tables :frowning:

A many other IBM encodings are missing as well. Please open a github issue about missing codecs in filebeat.

Potential workaround. This is quite hacky and might actually not work out (I'd prefer to fix the bug in filebeat):
As filebeat already reads the contents and tries to serialize it to UTF-8, we'd have to use some 8bit code map, which is ASCII compatible for all values up to 0x7f. Reading with utf-8 codec might combine 2 consecutive characters, getting you something non-reconstructible (plus, it might insert invalid-code-point control characters). You can configure iso-8859-1 (check the lib it's actually windows-1252, but this should be no problem). The official code map of iso-8859-1 does not specify all required mappings, but there are some code points defined in the decoder source code it seems. Now you will have some invalid characters. Next, one can use the translate or ruby filter to fix wrong code points. For example in CP865 the ø character has the code map ID 0x9B (155), and code point 00F8 (unicode). In the windows 152 mapping the code map ID 0x9B has the unicode code point 203a. That is, we can add a mapping of u+203a => u+00F8 to our translation table and this way create the correct utf-8 encoded text.


(Svein Tore Eikeskog) #3

Thanks for reply.

I actually solved this by setting encoding in filebeat to cp866 and then just replace the affected characters. Ex: gsub => ["message", "Ы", "ø"]

I tried other encodings in filebeat, but they rendered all unknown characters with same code.

I will open a github issue concering this.


(Steffen Siering) #4

I actually solved this by setting encoding in filebeat to cp866 and then just replace the affected characters. Ex: gsub => ["message", "Ы", "ø"]

Didn't expect cp866 to be close enough. But great you found a workaround.

I will open a github issue concering this.

Thank you!


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.