Issue with Character Encoding When Receiving Data from RSYSLOG in Logstash 8.4.3

Hello,

I'm currently using Logstash version 8.4.3 and encountering an issue when processing data received from RSYSLOG. The error message I'm seeing is as follows:

[WARN ][logstash.codecs.plain][main][b162f9e29529bc0184cfb35d26d7bc3a946f5283f15e0938bfcbb4bdce0da719] Received an event that has a different character encoding than you configured. ...omitted... catdesc="フリーウェア\xE3", :expected_charset=>"UTF-8"}

Upon inspecting the logs, it appears that the text which should read "フリーウェア・ソフトウェアダウンロード" is being truncated, and the log shows the text being cut off, followed by the error message above.

Additionally, I intentionally created binary data that mimics this truncation and sent it through RSYSLOG to Logstash, which resulted in the error being reproduced 100% of the time.

Based on this, am I correct in understanding that Logstash expects string data to be in UTF-8 encoding and that this error is inevitable when the data gets truncated at "xE3"?

If that's the case, are there any workarounds or solutions available on the Logstash side to handle this issue? Note that changing the character encoding to us-ascii is not an option due to our requirements.

Thank you for your assistance.

The plain codec has a charset option that allows you to specify one of dozens of encodings other than UTF-8.

Thank you for your prompt reply. I understand that the plain codec has a charset option that allows specifying various encodings other than UTF-8. However, the data being sent is already in UTF-8, and since I need to handle Japanese characters, using an encoding other than UTF-8 is not a viable option for me.

Given these constraints, my question is whether there is a way to prevent or handle the issue of truncated UTF-8 character sequences within Logstash. Is there a mechanism or configuration that allows Logstash to either gracefully handle these incomplete sequences or ensure that the truncation does not lead to errors?

I appreciate your guidance on how to address this specific challenge.

Which input are you using? Some inputs truncate data into chunks, which can result in breakage at the boundaries between chunks. The data the input is consuming is not UTF-8. If the data being sent to is UTF-8 then something is breaking it betwixt.

The input codec is the default setting.

What I would like to ask is whether there are any remedies available in the event that data is missing.

for example
E3 83 95 E3 83 AA
If so, it will be loaded normally with "フリ".

but,
E3 83 95 E3 83 AA E3
In this case, NG because there is a byte (E3) indicating that there is a next character

I would like to know if there are any remedies for cases where this fails (for example, due to network problems).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.