Problems with parsing XML log files with XML filter of Logstash

Hello together,

I tried multiple solutions for parsing my logs which are XML files to JSON in Logstash.
One log file does look like in this way:

<log level="INFO" time="Wed Sep 09 09:18:48 EDT 2015" timel="1441804728245" id="123456789" cat="COMMUNICATION" comp="" host="127.0.0.0.1" req="" app="" usr="" thread="" origin="">
	<msg>
		<![CDATA[Method=GET URL=http://test.de/24dsdf3=0TReq(provider=Test, Decoding_Feat=[], Accept-Encoding=gzip, Accept=*/*) Result(Content-Encoding=[gzip], Content-Length=[3540], ntCoent-Length=[6660], Content-Type=[text/xml; charset=utf-8]) Status=200 Times=TISP:426/CSI:-/Me:0/Total:426]]>
	</msg>
	<info>
	</info>
	<excp>
	</excp>
</log>
<log level="INFO" time="Wed Sep 09 09:18:48 EDT 2015" timel="1441804728245" id="123456789" cat="COMMUNICATION" comp="" host="127.0.0.0.1" req="" app="" usr="" thread="" origin="">
	<msg>
		<![CDATA[Method=GET URL=http://test.de/24dsdf3=0TReq(provider=Test, Decoding_Feat=[], Accept-Encoding=gzip, Accept=*/*) Result(Content-Encoding=[gzip], Content-Length=[3540], ntCoent-Length=[6660], Content-Type=[text/xml; charset=utf-8]) Status=200 Times=TISP:426/CSI:-/Me:0/Total:426]]>
	</msg>
	<info>
	</info>
	<excp>
	</excp>
</log>
<log level="INFO" time="Wed Sep 09 09:18:48 EDT 2015" timel="1441804728245" id="123456789" cat="COMMUNICATION" comp="" host="127.0.0.0.1" req="" app="" usr="" thread="" origin="">
	<msg>
		<![CDATA[Method=GET URL=http://test.de/24dsdf3=0TReq(provider=Test, Decoding_Feat=[], Accept-Encoding=gzip, Accept=*/*) Result(Content-Encoding=[gzip], Content-Length=[3540], ntCoent-Length=[6660], Content-Type=[text/xml; charset=utf-8]) Status=200 Times=TISP:426/CSI:-/Me:0/Total:426]]>
	</msg>
	<info>
	</info>
	<excp>
	</excp>
</log>

I have in a log file multiple logs (in this example 3).
I tried this filter:

input {
file {
  path => "/path/to/file.log.*"
  start_position => "beginning"

}
}
filter{ multiline {
  pattern => "<log"
  negate => "true"
  what => "previous"
}
  xml {
  	store_xml => "false"
  	source => "message"
  xpath => [
     "/log/at.level", "level",
     "/log/at.time", "time",
     "/log/at.timel", "timel",
     "/log/at.id", "id",
     "/log/at.cat", "cat",
     "/log/at.comp", "comp",
     "/log/at.host", "host",
     "/log/at.req", "req",
     "/log/at.app", "app",
     "/log/at.usr", "usr",
     "/log/at.thread", "thread",
     "/log/at.origin", "origin",
     "/log/msg/text()","msg_txt"
  ]
  }

}
output {
  elasticsearch {
hosts => "localhost:9200"

}
}

Of course, the "at's", must be replaced with the at sign.
But when I starting to run Logstash it creates really weird output. However, as a consequence Elasticsearch cannot read it.
Maybe do you have some suggestions, where I miss something?

Best regards,
Oemer

But when I starting to run Logstash it creates really weird output.

Could you be more specific? To start with, is the multiline filter producing correctly joined lines?

Hello Magnus,
thank you for your reply.

The Output looks like in this way (excerpt)

Defaulting filter worker threads to 1 because there are some filters that might not work with multiple worker threads {:count_was=>4, :filters=>["multiline"], :level=>:warn}
Logstash startup completed
2015-12-09T21:11:24.687Z host T, Decoding_Feat=[], Accept-Encoding=gzip, Accept=*/*) Result(Content-Encoding=[gzip], Content-Length=[4218], ntCoent-Length=[7592], Content-Type=[text/xml; charset=utf-8]) Status=200 Times=TISP:453/CSI:-/Me:1/Total:454]]></msg><info></info><excp></excp></log>
2015-12-09T21:12:54.906Z host <log level="INFO" time="Tue Sep 08 17:41:27 EDT 2015" timel="1441748487311" id="123456789" cat="COMMUNICATION" comp="CNGW (WEB)" host="Test" req="" app="" usr="" thread="" origin=""><msg><![CDATA[Method=GET URL=http://test.de/24dsdf3=0TReq(provider=Test, Decoding_Feat=[], Accept-Encoding=gzip, Accept=*/*) Result(Content-Encoding=[gzip], Content-Length=[10270], ntCoent-Length=[19922], Content-Type=[text/xml; charset=utf-8]) Status=200 Times=TISP:644/CSI:-/Me:0/Total:644]]></msg><info></info><excp></excp></log>
2015-12-09T21:12:54.908Z host TestTest<log level="INFO" time="Tue Sep 08 17:41:27 EDT 2015" timel="1441748487460" id="123456789" cat="COMMUNICATION" comp="CNGW (WEB)" host="Test" req="" app="" usr="" thread="" origin=""><msg><![CDATA[Method=GET URL=http://test.de/24dsdf3=0TReq(provider=Test, Decoding_Feat=[], Accept-Encoding=gzip, Accept=*/*) Result(Content-Encoding=[gzip], Content-Length=[4424], ntCoent-Length=[7944], Content-Type=[text/xml; charset=utf-8]) Status=200 Times=TISP:561/CSI:-/Me:0/Total:561]]></msg><info></info><excp></excp></log>

After a certain time the output looks like this:
TestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTest

Maybe my understanding is not correct.
I use multiline because in my file, I have multiple log tags. Means for each log, one event should be created or (in rel db. one row)

Best regards

What does "Test" come from? That string doesn't appear in your configuration AFAICT.

Test comes from here:
<![CDATA[Method=GET URL=http://test.de/24dsdf3=0TReq(provider=Test, Decoding_Feat=[], Accept-Encoding=gzip, Accept=*/*) Result(Content-Encoding=[gzip], Content-Length=[3540], ntCoent-Length=[6660], Content-Type=[text/xml; charset=utf-8]) Status=200 Times=TISP:426/CSI:-/Me:0/Total:426]]>
It is the first attribute in the URL: "provider"

Test comes from the Provider attribute within the URL

The question is also, if I need to specify a type or have to create an mapping in Elastisearch. Because atm, I didn't configured something in Elasticsearch.

The question is also, if I need to specify a type or have to create an mapping in Elastisearch. Because atm, I didn't configured something in Elasticsearch.

You don't have to, but often the default mappings aren't a perfect fit.

1 Like