Issues with encoding

I came across a major issue( atleast in my case ). The issue is as explained below and I am assuming it has to do with some sort of encoding.

The logs in my case which are being parsed contain special characters and such. See below the actual line from my log:

START,Gooseberry-SA,0x0001FFC100DB7646,3068140336,GMT+05:30-Calcutta,11/20/2015,11:56:51.0,0,0,0,VoIP,IP-TO-IP,DEFAULT,,,9977554432,,0,,0,,0,,RL_BELUR_EXT_EMS,1,GOOSEBERRY-SA:BELUR_EXT_EMS,10.54.26.102,10.54.80.143,BELUR_INT_EMS,,10.54.24.101:38262/10.54.80.140:6000,,10.54.26.101:38134/10.54.80.143:6000,0,,,0x0080023A,,,,2,"SIP,1-32415@10.54.80.140,%22sipp%22;tag=32415SIPpTag001,%22sut%22;tag=gK0cfd65f5,0,,,,sip:9977554432@10.54.24.102:5060,,,,sip:sipp@10.54.80.140:5555,sip:9977554432@10.54.24.102:5060,,,,,,,0,0,,0,0,,,,,,,,1,0,0,0,,,,,,,,0,,,,,,,,,0,,,,,,,,,,,",12,12,0,5,,,0x0a,9977554432,1,1,,BELUR_EXT_EMS,"SIP,474775912_58694532@10.54.26.102,%22sipp%22;tag=gK0c7d662e,;tag=4829SIPpTag0114382283,0,,,,sip:+19977554432@10.54.80.143:6781;user=phone,,,,sip:sipp@10.54.26.102:5060,sip:10.54.80.143:6781;transport=UDP,,,,,,,0,0,,0,0,,,,,,,,1,0,0,0,,,,,,,,0,,,,,,,,,0,,,,,,,,,,,",,110,,,1,1,,,2,0x1C4C8168,0,,,,,,0,,,,1,,,,,,,6,,,,"sipp",2,1,1,1,1,,0,,,1,7,1,,,10.54.24.102,10.54.80.140,60121,16,8,,,,,,,,,0,,,TANDEM,,,,,,,13,1,,,,,,,,,,,,,,,,0,,,,,,,,0,,,"2,1,0,3",0,,,,,,,,,,,,,,,,,,,,,,,,,,

See below the output received on logstash/redis:

"{"@timestamp":"2015-11-20T11:56:53.525Z","beat":{"hostname":"Gooseberry","name":"Gooseberry"},"count":1,"fields":{"sbc_version":"5.0.0"},"input_type":"log","message":"0,,,1,1,,,2,0x1C4C8168,0,,,,,,0,,,,1,,,,,,,6,,,,\"sipp\",2,1,1,1,1,,0,,,1,7,1,,,10.54.24.102,10.54.80.140,60121,16,8,,,,,,,,,0,,,TANDEM,,,,,,,13,1,,,,,,,,,,,,,,,,0,,,,,,,,0,,,\"2,1,0,3\",0,,,,,,,,,,,,,,,,,,,,,,,,,,\n143,BELUR_INT_EMS,,10.54.24.101:38262/10.54.80.140:6000,,10.54.26.101:38134/10.54.80.143:6000,0,,,0x0080023A,,,,2,\"SIP,1-32415@10.54.80.140,%22sipp%22\u003csip:sipp@10.54.80.140:5555\u003e;tag=32415SIPpTag001,%22sut%22\u003csip:9977554432@10.54.24.102:5060\u003e;tag=gK0cfd65f5,0,,,,sip:9977554432@10.54.24.102:5060,,,,sip:sipp@10.54.80.140:5555,sip:9977554432@10.54.24.102:5060,,,,,,,0,0,,0,0,,,,,,,,1,0,0,0,,,,,,,,0,,,,,,,,,0,,,,,,,,,,,\",12,12,0,5,,,0x0a,9977554432,1,1,,BELUR_EXT_EMS,\"SIP,474775912_58694532@10.54.26.102,%22sipp%22\u003csip:sipp@10.54.26.102:5060\u003e;tag=gK0c7d662e,\u003csip:+19977554432@10.54.80.143:6781;user=phone\u003e;tag=4829SIPpTag0114382283,0,,,,sip:+19977554432@10.54.80.143:6781;user=phone,,,,sip:sipp@10.54.26.102:5060,sip:10.54.80.143:6781;transport=UDP,,,,,,,0,0,,0,0,,,,,,,,1,0,0,0,,,,,,,,0,,,,,,,,,0,,,,,,,,,,,\",,110,,,1,1,,,2,0x1C4C8168,0,,,,,,0,,,,1,,,,,,,6,,,,\"sipp\",2,1,1,1,1,,0,,,1,7,1,,,10.54.24.102,10.54.80.140,60121,16,8,,,,,,,,,0,,,TANDEM,,,,,,,13,1,,,,,,,,,,,,,,,,0,,,,,,,,0,,,\"2,1,0,3\",0,,,,,,,,,,,,,,,,,,,,,,,,,,","offset":1897501,"source":"/var/log/sonus/sbx/evlog/1004A6C.ACT","type":"log"}"

As seen above in the output message it has skipped some portion of the input from beginning and another concerning part is that the input message is not in the actual sequence. It has stitched the message string in some random fashion. Also the output message is in encoded form.I tried encoding set to plain, utf-8, utf-16be-bom etc but in vain. All I want is that the input message parsed to be returned as it is in the output. How could I do that?

So would you describe as lines are partially truncated? We've identified an issue in rc2 with very long log lines being incomplete. Updating to most recent nightlies should solve it.

Link for nightlies: https://beats-nightlies.s3.amazonaws.com/index.html?prefix=filebeat/

@vinod8427 Can you post your config here? What is the encoding of your files?

The config used is as below:

filebeat:
prospectors:
path:
- /var/log/sonus/sbx/evlog/*.ACT
encoding: plain # Also tried utf-8 & utf-16be-bom
ignore_older: 20s
scan_frequency: 5s
harvester_buffer_size: 2000000
output
logstash:
hosts: ["localhost:5044"]

I have taken the input message and have tried to explain how the output message has been formed from it (See comments section in the image).

so utf-16be-bom is not supported anymore. You can either try utf-16be (default utf16 over network) or utf-16le (default utf16 on windows) .

Having the original log file it would be somewhat easier. Which operating system are you running on? Who is producing this log file (csv SIP CDRs?) ? Which locale is configured?

digging into the lines I found this:

original:

SIP,1-32415@10.54.80.140,%22sipp%22;tag=32415SIPpTag001,%22sut%22;tag=gK0cfd65f5,0,,,,sip:9977554432@10.54.24.102:5060,,,,sip:sipp@10.54.80.140:5555

now 'truncated' one:

SIP,1-32415@10.54.80.140,%22sipp%22\u003csip:sipp@10.54.80.140

Interestingly the ';' in first line is replaced by '\u003c' in unicode. The replacement is kinda correct, as '\u003c' stands for '<'. But no idea why string contains '<' character. From this point everything get's lost until the address: sip:sipp@10.54.80.140.

If you use plain, the bytes are read as is. It is the json encoder escaping bad byte sequences. The decoder (configured by encoding) is responsible for reading files and translating all symbols into valid utf-8 for the json encoder not escaping wrong sequences.

I am using Linux. The log files are SIP CDR's and generated by running some scripts. Locale used is IST. BTW I tried using utf-16be and its the same result I notice. Accordingly I guess I must be using plain encoding but as you mentioned the json encoder is messing things up. Logstash shipper used to handle this fine but I would like to migrate to filebeat instead. Please let me know if you require any further information from my end.

@vinod8427 Can you send me one of these log files?

So I created some tests locally about reading and writing said line and tests with multiple lines of random length, but can not reproduce the issue. Also json encoder seems to be fine with the line. Just press 'Run' on the sample encoding your line to json: http://play.golang.org/p/Jee4wQst5t

Maybe it's some interplay with other lines in your log file?

can you get us the output of

$ locale

?

locale output is as below:
LANG=en_US
LANGUAGE=
LC_CTYPE="en_US"
LC_NUMERIC="en_US"
LC_TIME="en_US"
LC_COLLATE="en_US"
LC_MONETARY="en_US"
LC_MESSAGES="en_US"
LC_PAPER="en_US"
LC_NAME="en_US"
LC_ADDRESS="en_US"
LC_TELEPHONE="en_US"
LC_MEASUREMENT="en_US"
LC_IDENTIFICATION="en_US"
LC_ALL=

Shortly will be sending a mail with the attached log file. Sorry for the delay.

Sent a mail to Nicolas with the log attached. Steffen, could you message me your mail id?

@vinod8427 Thanks, got the e-mail. I forwarded it to Steffen. We will have a look and get back to you.

@vinod8427 I'm currently trying to reproduce this with the log file. I'm using FB -> LS -> ES. What is your setup? You list Redis above? Can you post your LS config?

I tried with both redis and LS and both received the same output. My LS config is pretty simple for the time being which is as below:

input {
#redis { host =>"localhost" port =>"6379" data_type => "list" key =>"filebeat" threads => 2 }
beats {
port => 5044
}
}
filter {
grok { match => { "message" => "^START|^STOP|^ATTEMPT|^INTERMEDIATE|^REBOOT" } }
}
output {
stdout { codec=>rubydebug }
}

BTW my current config is FB->LS. If it goes well it would be FB->LS->ES

Can you also post your exact operating system? I think I have to test it on the same OS.

Does this happen for all events (means as soon as you start filebeat) or does it only happen after a certain time?

Its Debian OS 7.8. It happens everytime. Some events has slightly different outputs and never it is correct but the problem statement still holds the same for most of it.

Forgot to add to the list above: Exact LS and ES version.