Management of illegal characters

Hello Everyone,
I am in the process of converting a "in-house" Delphi application that receives an XML string over udp port and send it to an elasticsearch instance, doing a conversion from XML to JSON and sending it thru _bulk api.

Now, I am trying to receive an XML string over UPD Input Plugin, and happens that this XML string contains illegal characters, like #0 (Nul string termination).
I am using this configuration:

input {
	udp {
		port => 517
	}
}
filter {
	xml {
		force_array => false
		source => "message"
		target => "myxml"
	}
}
output {
	file {
		path => "/log_streaming/my_app/records/log-%{+yyyy-MM-dd_HH.mm.ss.SSS}.log"	
		codec => line { format => "%{myxml}" }
	}
}

And when everything is good I can receive data in this format:

{
    "APPVERSION": "1.0.1.11",
    "EVENTDATETIME": "04/14/2025 18:38:20:203",
    "EVENTNAME": "TestEvent\n2\n04/14/2025 18:38:20:38",
    "APPLICATION": "TESTUDPLOGGER",
    "HOST": "FRANCESCOE-RMT",
    "EVENTINFO": "04/14/2025 18:38:20:38",
    "LINENO": "1",
    "INSTANCEID": "BD2051FB-525E-49CD-BEDB-3DEF967ADCFB",
    "SEVERITY": "0",
    "THREADID": "13852",
    "EVENTSEQNO": "1"
}

But if a #0 is received, I have this error:

Illegal character "\u0000" in raw string "04/15/2025 11:55:55:55\u0000ben\u0000frank\u0000sue"

Writing to the disc what I receive (removing "filter" part) I receive:

{"@timestamp":"2025-04-14T17:52:37.590740500Z","event":{"original":"<EVENT><HOST>FRANCESCOE-RMT</HOST><INSTANCEID>65C3FEC1-B288-437F-B0C3-8CA3EB1956EC</INSTANCEID><APPLICATION>TESTUDPLOGGER</APPLICATION><THREADID>7080</THREADID><APPVERSION>1.0.1.11</APPVERSION><LINENO>1</LINENO><EVENTSEQNO>1</EVENTSEQNO><EVENTDATETIME>04/14/2025 13:52:37:587</EVENTDATETIME><SEVERITY>0</SEVERITY><EVENTNAME>TestEvent\r\n1\r\n04/14/2025 13:52:37:52</EVENTNAME><EVENTINFO>04/14/2025 13:52:37:52\u0000ben\u0000frank\u0000sue</EVENTINFO></EVENT>"},"host":{"ip":"127.0.0.1"},"@version":"1","message":"<EVENT><HOST>FRANCESCOE-RMT</HOST><INSTANCEID>65C3FEC1-B288-437F-B0C3-8CA3EB1956EC</INSTANCEID><APPLICATION>TESTUDPLOGGER</APPLICATION><THREADID>7080</THREADID><APPVERSION>1.0.1.11</APPVERSION><LINENO>1</LINENO><EVENTSEQNO>1</EVENTSEQNO><EVENTDATETIME>04/14/2025 13:52:37:587</EVENTDATETIME><SEVERITY>0</SEVERITY><EVENTNAME>TestEvent\r\n1\r\n04/14/2025 13:52:37:52</EVENTNAME><EVENTINFO>04/14/2025 13:52:37:52\u0000ben\u0000frank\u0000sue</EVENTINFO></EVENT>"}

As you can see, it was converted in "\u0000". I need to convert #0, #13#10, #13 and #10 to a one character space. How can I do that?

Sorry everyone, it was simply:

input {
	udp {
		port => 517
	}
}
filter {
	mutate { gsub => [ "message", "\u0000", "[0x00]" ] }
	mutate { gsub => [ "message", "\r\n", "[0x01]" ] }
	mutate { gsub => [ "message", "\r", "[0x02]" ] }
	mutate { gsub => [ "message", "\n", "[0x03]" ] }
	xml {
		force_array => false
		source => "message"
		target => "myxml"
	}
}
output {
	file {
		path => "/log_streaming/my_app/records/log-%{+yyyy-MM-dd_HH.mm.ss.SSS}.log"	
		codec => line { format => "%{myxml}" }
	}
}
mutate { gsub => [ "message" , "[\r\n^@]", " " ] }

That ^@ is a literal NUL character. In vim I can type it using Ctrl/v Ctrl/Shift/2

I then get

        "EVENTNAME" => "TestEvent1 04/14/2025 14:11:24:11",
        "EVENTINFO" => "04/14/2025 14:11:24:11 ben frank sue"

using the example from your other thread.

Okay, that's good @Badger . I have used my formatting to remain consistent with what I had before in my delphi application, but your solution works.

{
    "APPVERSION": "1.0.1.12",
    "EVENTDATETIME": "04/15/2025 15:00:41:955",
    "EVENTNAME": "TestEvent[0x01]9[0x01]04/15/2025 15:00:41:00",
    "APPLICATION": "TESTUDPLOGGER",
    "EVENTINFO": "04/15/2025 15:00:41:00[0x00]ben[0x01]frank[0x02]sue[0x03]john",
    "HOST": "FRANCESCOE-RMT",
    "LINENO": "1",
    "INSTANCEID": "CA61547F-3154-48A9-A50D-E650D9243F8E",
    "SEVERITY": "0",
    "THREADID": "16912",
    "EVENTSEQNO": "1"
}

This is my file now.

If I want to send it to elasticsearch, shouldn't be enough to activate elasticsearch plugin like this?

input {
	udp {
		port => 517
	}
}
filter {
	mutate { gsub => [ "message", "\u0000", "[0x00]" ] }
	mutate { gsub => [ "message", "\r\n", "[0x01]" ] }
	mutate { gsub => [ "message", "\r", "[0x02]" ] }
	mutate { gsub => [ "message", "\n", "[0x03]" ] }
	xml {
		force_array => false
		source => "message"
		target => "myxml"
	}
}
output {
	file {
		path => "/log_streaming/my_app/records/log-%{+yyyy-MM-dd_HH.mm.ss.SSS}.log"
		codec => line { format => "%{myxml}" }
	}
	elasticsearch {
		hosts => "https://myhost:9200"
		user => "myuser"
		password => "mypassword"
		ssl_certificate_verification => "false"
		index => "delphi-processlog_write"
		codec => line { format => "%{myxml}" }
	}
}

Because I see my data inserted in fields called "myxml.my fields".

Every output allows you to specify the codec option because it is implemented by the base codec class that they extend. But that doesn't mean they use the codec to send the event to the destination.

The elasticsearch output ignores the codec option and formats the event as a JSON string, because that's what the _bulk API requires.

If you use store_xml you have to specify a target, if you want to move the fields back up to the top-level of the event then see this thread.

It works perfectly, thanks! Last question @Badger , I receive a date in this format:

04/15/2025 16:28:09:361

That gives me this error:

Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"delphi-processlog_write", :routing=>nil}, {"eventseqno"=>"1", "exeversion"=>"1.0.1.12", "severity"=>"0", "@timestamp"=>2025-04-15T20:28:09.363306700Z, "apphost"=>"FRANCESCOE-RMT", "exename"=>"TESTUDPLOGGER", "eventname"=>"TestEvent[0x01]5[0x01]04/15/2025 16:28:09:28", "logeventdate"=>"04/15/2025 16:28:09:361", "eventinfo"=>"04/15/2025 16:28:09:28[0x00]ben[0x01]frank[0x02]sue[0x03]john", "instanceid"=>"0518D8D7-28CD-4AF3-99DC-8D4CE30CCC6E", "threadid"=>"23476"}], :response=>{"index"=>{"status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse field [logeventdate] of type [date] in document with id 'GkcjO5YBP4xpwjYARoME'. Preview of field's value: '04/15/2025 16:28:09:361'", "caused_by"=>{"type"=>"illegal_argument_exception", "reason"=>"failed to parse date field [04/15/2025 16:28:09:361] with format [yyyy/MM/dd HH:mm:ss.SSS||yyyy-MM-dd HH:mm:ss.SSS||strict_date_optional_time||epoch_millis]", "caused_by"=>{"type"=>"date_time_parse_exception", "reason"=>"date_time_parse_exception: Failed to parse with all enclosed parsers"}}}}}}

How would you make elasticsearch accept this format or how to convert it?

Your elasticsearch index is configured to expect a logeventdate in one of the formats

yyyy/MM/dd HH:mm:ss.SSS||yyyy-MM-dd HH:mm:ss.SSS||strict_date_optional_time||epoch_millis

Your field actually has the format yyyy/MM/dd HH:mm:ss:SSS. I don't know if you can add an additional date format to an existing index (obviously you can update the index template so that future indexes will accept that format).

Otherwise change the colon before the milliseconds to a stop using

    mutate { gsub => [ "EVENTDATETIME", "(\d{2}):(\d{3})$", "\1.\2" ] }

Unfortunately, my format looks like being

MM/dd/yyyy HH:mm:ss:SSS

So that's not enough... :frowning:

You could parse it using a date filter. I believe that will get sent to elasticsearch in an acceptable format.

If not, you can reformat it using ruby. See this thread.

You could also reformat it using a more complex gsub, but that doesn't feel right to me.

mutate { gsub => [ "someField", "(\d{2})/(\d{2})/(\d{4}) (\d{2}:\d{2}:\d{2}):(\d{3})", "\3/\1/\2 \4.\5" ] }

Just looking at that makes my eyes bleed!