Problems with parsing XML log files with XML filter of Logstash


(Ömer Uludağ) #1

Hello together,

I tried multiple solutions for parsing my logs which are XML files to JSON in Logstash.
One log file does look like in this way:

<log level="INFO" time="Wed Sep 09 09:18:48 EDT 2015" timel="1441804728245" id="123456789" cat="COMMUNICATION" comp="" host="127.0.0.0.1" req="" app="" usr="" thread="" origin="">
	<msg>
		<![CDATA[Method=GET URL=http://test.de/24dsdf3=0TReq(provider=Test, Decoding_Feat=[], Accept-Encoding=gzip, Accept=*/*) Result(Content-Encoding=[gzip], Content-Length=[3540], ntCoent-Length=[6660], Content-Type=[text/xml; charset=utf-8]) Status=200 Times=TISP:426/CSI:-/Me:0/Total:426]]>
	</msg>
	<info>
	</info>
	<excp>
	</excp>
</log>
<log level="INFO" time="Wed Sep 09 09:18:48 EDT 2015" timel="1441804728245" id="123456789" cat="COMMUNICATION" comp="" host="127.0.0.0.1" req="" app="" usr="" thread="" origin="">
	<msg>
		<![CDATA[Method=GET URL=http://test.de/24dsdf3=0TReq(provider=Test, Decoding_Feat=[], Accept-Encoding=gzip, Accept=*/*) Result(Content-Encoding=[gzip], Content-Length=[3540], ntCoent-Length=[6660], Content-Type=[text/xml; charset=utf-8]) Status=200 Times=TISP:426/CSI:-/Me:0/Total:426]]>
	</msg>
	<info>
	</info>
	<excp>
	</excp>
</log>
<log level="INFO" time="Wed Sep 09 09:18:48 EDT 2015" timel="1441804728245" id="123456789" cat="COMMUNICATION" comp="" host="127.0.0.0.1" req="" app="" usr="" thread="" origin="">
	<msg>
		<![CDATA[Method=GET URL=http://test.de/24dsdf3=0TReq(provider=Test, Decoding_Feat=[], Accept-Encoding=gzip, Accept=*/*) Result(Content-Encoding=[gzip], Content-Length=[3540], ntCoent-Length=[6660], Content-Type=[text/xml; charset=utf-8]) Status=200 Times=TISP:426/CSI:-/Me:0/Total:426]]>
	</msg>
	<info>
	</info>
	<excp>
	</excp>
</log>

I have in a log file multiple logs (in this example 3).
I tried this filter:

input {
file {
  path => "/path/to/file.log.*"
  start_position => "beginning"

}
}
filter{ multiline {
  pattern => "<log"
  negate => "true"
  what => "previous"
}
  xml {
  	store_xml => "false"
  	source => "message"
  xpath => [
     "/log/at.level", "level",
     "/log/at.time", "time",
     "/log/at.timel", "timel",
     "/log/at.id", "id",
     "/log/at.cat", "cat",
     "/log/at.comp", "comp",
     "/log/at.host", "host",
     "/log/at.req", "req",
     "/log/at.app", "app",
     "/log/at.usr", "usr",
     "/log/at.thread", "thread",
     "/log/at.origin", "origin",
     "/log/msg/text()","msg_txt"
  ]
  }

}
output {
  elasticsearch {
hosts => "localhost:9200"

}
}

Of course, the "at's", must be replaced with the at sign.
But when I starting to run Logstash it creates really weird output. However, as a consequence Elasticsearch cannot read it.
Maybe do you have some suggestions, where I miss something?

Best regards,
Oemer


(Magnus Bäck) #2

But when I starting to run Logstash it creates really weird output.

Could you be more specific? To start with, is the multiline filter producing correctly joined lines?


(Ömer Uludağ) #3

Hello Magnus,
thank you for your reply.

The Output looks like in this way (excerpt)

Defaulting filter worker threads to 1 because there are some filters that might not work with multiple worker threads {:count_was=>4, :filters=>["multiline"], :level=>:warn}
Logstash startup completed
2015-12-09T21:11:24.687Z host T, Decoding_Feat=[], Accept-Encoding=gzip, Accept=*/*) Result(Content-Encoding=[gzip], Content-Length=[4218], ntCoent-Length=[7592], Content-Type=[text/xml; charset=utf-8]) Status=200 Times=TISP:453/CSI:-/Me:1/Total:454]]></msg><info></info><excp></excp></log>
2015-12-09T21:12:54.906Z host <log level="INFO" time="Tue Sep 08 17:41:27 EDT 2015" timel="1441748487311" id="123456789" cat="COMMUNICATION" comp="CNGW (WEB)" host="Test" req="" app="" usr="" thread="" origin=""><msg><![CDATA[Method=GET URL=http://test.de/24dsdf3=0TReq(provider=Test, Decoding_Feat=[], Accept-Encoding=gzip, Accept=*/*) Result(Content-Encoding=[gzip], Content-Length=[10270], ntCoent-Length=[19922], Content-Type=[text/xml; charset=utf-8]) Status=200 Times=TISP:644/CSI:-/Me:0/Total:644]]></msg><info></info><excp></excp></log>
2015-12-09T21:12:54.908Z host TestTest<log level="INFO" time="Tue Sep 08 17:41:27 EDT 2015" timel="1441748487460" id="123456789" cat="COMMUNICATION" comp="CNGW (WEB)" host="Test" req="" app="" usr="" thread="" origin=""><msg><![CDATA[Method=GET URL=http://test.de/24dsdf3=0TReq(provider=Test, Decoding_Feat=[], Accept-Encoding=gzip, Accept=*/*) Result(Content-Encoding=[gzip], Content-Length=[4424], ntCoent-Length=[7944], Content-Type=[text/xml; charset=utf-8]) Status=200 Times=TISP:561/CSI:-/Me:0/Total:561]]></msg><info></info><excp></excp></log>

After a certain time the output looks like this:
TestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTestTest

Maybe my understanding is not correct.
I use multiline because in my file, I have multiple log tags. Means for each log, one event should be created or (in rel db. one row)

Best regards


(Magnus Bäck) #4

What does "Test" come from? That string doesn't appear in your configuration AFAICT.


(Ömer Uludağ) #5

Test comes from here:
<![CDATA[Method=GET URL=http://test.de/24dsdf3=0TReq(provider=Test, Decoding_Feat=[], Accept-Encoding=gzip, Accept=*/*) Result(Content-Encoding=[gzip], Content-Length=[3540], ntCoent-Length=[6660], Content-Type=[text/xml; charset=utf-8]) Status=200 Times=TISP:426/CSI:-/Me:0/Total:426]]>
It is the first attribute in the URL: "provider"


(Ömer Uludağ) #6

Test comes from the Provider attribute within the URL


(Ömer Uludağ) #7

The question is also, if I need to specify a type or have to create an mapping in Elastisearch. Because atm, I didn't configured something in Elasticsearch.


(Magnus Bäck) #8

The question is also, if I need to specify a type or have to create an mapping in Elastisearch. Because atm, I didn't configured something in Elasticsearch.

You don't have to, but often the default mappings aren't a perfect fit.


(system) #9