Global tag from XML header

I'd like to apply a global tag to every message imported from the same log file, based on information found in the first five lines.

Background: I'm parsing XML files recorded on a remote system, but imported on a local source. Each XML file includes five lines at the beginning to declare xml version, stylesheet, and relevant information about the original source (name, ID, software version). After those five lines, the real messages begin. The main messages are being parsed perfectly, but I'd like to apply tags to them to reflect the source name, ID, and software version.

Ideally, each message would either have 3 additional fields for Laboratory, System ID, and Software Version, or 3 additional tags reflecting those values. I'm not concerned about the xml version or stylesheet info. At present, my config file will import the first five lines as separate messages, rather than as header info for the remainder of the log file. I've included the config file below for reference.

I've been bumbling about with "if" statements and grok filters with minimal success for hours now, so I'd appreciate any suggestions you might have.

Example XML log:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="automationErrorLog.xsl"?>
<!-- Laboratory: Dexters Lab -->
<!-- System ID: Brainalyzer 4000.1.26 -->
<!-- Software Version: v3.0 -->

<ErrorLog>
<Error timestamp="20180528090858" nodeID="4" nodetype="LAZR" errorcode="1234"  
moreinfo="Brain size too small ">
<Error timestamp="20180528090858" nodeID="7" nodetype="TNT" errorcode="5678" 
moreinfo="Deedee pushed self destruct">
</ErrorLog>

Logstash config:

input { 
file {
	path => "D:/Data/Kibana Imports/*/*/*.xml"
	start_position => "beginning"
	sincedb_path => "/dev/null"
}
}
filter {
if [path] =~ "ERROR" {
	xml {
		source => "message"
		target => "Error"
		xpath => ["/Error/@timestamp","ERRORtimestampString"]
	}
	mutate{
		replace => { "ERRORtimestampString" => "%{ERRORtimestampString[0]}" }
		remove_field => "host"
		remove_field => "@version"
	}
	date {
		match => ["ERRORtimestampString", "yyyyMMddHHmmss"]
		target => "@timestamp"
	}
	mutate{
		remove_field => "ERRORtimestampString"
	}
output {
    stdout {
        code => rubydebug
    }
}

The file input reads input files line by line so your xml filter is dependent on the exact formatting of the input. You should process your XML document in one piece. Examples of that has been posted in the past, but the short version is:

  • Use a multiline codec to join all files of each in put file into a single event.
  • Use one or more xml filters to extract the interesting parts of the XML document into fields. One of the resulting fields should be an array of objects looking like the kind of documents you want.
  • Use the split filter to process aforementioned array and put each array element into an event of its own.

Hi magnusbaeck,

Thanks for the quick response. I had attempted to use the multline codec before, but was not aware of the split filter. I have since resolved my issue. The only remaining issue is a concern regarding the representation of timestamps using stdout{ codec=> rubydebug} (see the end of this post).

Posting the code below for other future problem solvers' benefit:

[Path] set to import XML logs from a local directory called "Kibana Imports", with XML files located two folders deep. Logstash will import all XML files from all folders and subfolders in this directory. Multiline codec set to concatenate every entry in XML file onto one message until it sees "</ErrorLog>", which occurs at the end of each XML document. The first Xml filter parses the message contents into a useful field array. Grok filters parse the XML comments into separate fields, then a second xml filter parses the individual error entries. Result is that the commented fields are applied to every error entry.

Logstash config:

input { 
file {
	path => "D:/Data/Kibana Imports/*/*/*.xml"
	start_position => "beginning"
	sincedb_path => "/dev/null"
	codec => multiline {
		pattern => "</ErrorLog>"
		negate => "true"
		what => "previous"
	}	
}
}
filter {
xml {
	source => "message"
	store_xml => "false"
	xpath => ["/ErrorLog/Error","logentry"]
}
	
grok {
	break_on_match => "false"
	match => { "message" => "<!-- Laboratory: %{DATA:Laboratory} -->" }
	match => { "message" => "<!-- System ID: %{DATA:systemID} -->" }
	match => { "message" => "<!-- Software Version: %{DATA:version} -->" }
}
mutate {
	copy => { "version" => "@version" }
	copy => { "Laboratory" => "host" }
	remove_field => ["Laboratory", "version", "message", "path"]
}

split {
	field => "[logentry]"
}
xml {
	source => "logentry"
	target => "Error"
	xpath => ["/Error/@timestamp","ERRORtimestampString"]
}

mutate{
	copy => { "ERRORtimestampString[0]" => "ERRORtimestampString"}
}
date {
	match => ["ERRORtimestampString", "yyyyMMddHHmmss","ISO8601"]
	target => "@timestamp"
}
mutate{
	remove_field => ["ERRORtimestampString", "logentry", "tags"]
}
}

output {
stdout {
	codec => rubydebug
}
}

Result:

[2018-07-10T08:44:47,101][INFO ][logstash.agent           ] Successfully started
Logstash API endpoint {:port=>9600}
{
      "host" => "Dexters Lab",
     "Error" => {
    "timestamp" => "20180528090858",
    "errorcode" => "1234",
     "moreinfo" => "Brain size too small",
     "Off-line" => "No",
     "nodetype" => "LAZR",
       "nodeID" => "4"
 },
  "@version" => "v3.0",
"@timestamp" => 2018-05-28T13:08:58.000Z,
  "systemID" => "Brainalyzer 4000.1.26"
}
{
      "host" => "Dexters Lab",
     "Error" => {
    "timestamp" => "20180528090858",
    "errorcode" => "5678",
     "moreinfo" => "Deedee pushed self destruct",
     "Off-line" => "No",
     "nodetype" => "TNT",
       "nodeID" => "7"
},
  "@version" => "v3.0",
"@timestamp" => 2018-05-28T13:08:58.000Z,
  "systemID" => "Brainalyzer 4000.1.26"
}

However, one thing I would note is that the date does not seem to be represented accurately by the stdout { codec => rubydebug } command. The original timestamp for the first entry was 20180528090858. This should have been imported as 2018-05-28T09:08:58.000Z, but was represented as 2018-05-28T13:08:58.000Z. That said, I believe it may have been accurately parsed, because it is correct in Kibana:

@timestamp: May 28th 2018, 09:08:58.000

This behavior is expected. The date filter always converts timestamp into UTC. Kibana by default adjusts UTC timestamps to the browser's timezone.

I am running into a slight issue using this approach.

The config works very well for smaller log files, such as the example shown above, and will even parse a variety of different log files (see the slight change to my pattern). However, the real XML files often contain upwards of 5000 - 10,000 entries (the files are each limited to 1MB in size). Initially, Logstash would parse some of the logs but then start adding the tag "multiline_codec_max_lines_reached". As such, I included "max_lines => 10000" into my multiline codec after the "what => "previous" entry, as the biggest log file I've come across manually had 8500 lines:

input { 
file {
	path => "D:/Data/Kibana Imports/*/*/*.xml"
	start_position => "beginning"
	sincedb_path => "/dev/null"
	codec => multiline {
		pattern => "(</ControlLog>|</ErrorLog>)"
		negate => "true"
		what => "previous"
		max_lines => 10000
	}	
}
}

That got rid of the tag, but now Logstash seems to randomly stop parsing entries. I can't post the error output, because no error is posted. It is also not consistent, in that if I start logstash again, it will stop at a completely different point, despite including the "sincedb_path => "/dev/null" option. It will also not move on to the next file. It seems like it should be doing something, because my CPU usage spiking and my memory usage is a steady 7.3GB out of 8GB total, but there's nothing printing to stdout{}.

Do you have any recommendations?

If you are running on Windows, which your path parameter suggests you are, then you should use 'sincedb_path => "NUL"'.

If the log isn't giving you enough information to diagnose the problem then perhaps increasing the logging level will help.

Hi Badger,

Thanks for your input. I am using Windows and have corrected the sincedb_path => "NUL" statement per your recommendations; however, that has not resolved the issue.

I let it run for an hour and eventually it posted the following error:

java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid3556.hprof ...
Heap dump file created [2864134638 bytes in 35.838 secs]
[2018-07-13T14:02:52,278][ERROR][org.logstash.Logstash    ] 
java.lang.OutOfMemoryError: Java heap space

I attempted to increase the heap size, but I don't know if I did it properly. Initially I altered the jvm.options file to read -Xms4g and -Xmx4g. When that didn't work, another post suggested Windows didn't use the jvm.options file, so I put the following into my setup.bat file

if "%LS_HEAP_SIZE%" == "" (
    set LS_HEAP_SIZE=4g
)

The result was java.lang.OutOfMemoryError again.

My initial configuration file did not experience this issue, but also did not globally apply the host, version, and systemID from the first few lines of the log file to the remaining entries. This newer configuration parses the logs correctly, but crashes with heap space errors.

Do you have any suggestions?

If you would like assistance with a different configuration it would be helpful to show what the new configuration looks like.

Is it possible there are files that end in neither </ControlLog> or </ErrorLog> ? Your file input would tail them forever waiting for one of those patterns to end it. Accumulating ever more data. Could you try adding

auto_flush_interval => 600

to the multiline codec so that after 10 minutes it gives up on a file?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.