I have an XML that isn't parsing correctly unless I use something like xmllint to prettify it. The XML file is written line by line but doesn't contain indentation and such and for whatever reason my logstash config doesn't parse it properly but it does when I format it with xmllint. Is there a way I can handle that with Logstash natively?
Thanks
What does the original XML look like? What does it look like after being processed by xmllint?
Thanks for replying Magnus. So the data before xmlint looks like below where some opening xml elements are all on the same line and none of it is spaced properly.
<?xml version="1.0" encoding="UTF-8"?> <system><Processes><threadinfo> <threadinfoIndex>1</threadinfoIndex> <threadinfoId>3552</threadinfoId> <Version>2.20.1</Version> <Description>threadinfo tracker</Description> <mods> <mod> <Timestamp>131341024052549184</Timestamp> <BaseAddress>0x12d0000</BaseAddress> <Size>2077768</Size> <Path>blah.exe</Path> <Version>2.20.1</Version> <Company></Company> <Description>threadinfo tracker</Description> </mod> </mods> </threadinfo> </Processes> </system>
After xmlint formatting it looks like
<?xml version="1.0" encoding="UTF-8"?> <system> <Processes> <threadinfo> <threadinfoIndex>1</threadinfoIndex> <threadinfoId>3552</threadinfoId> <Version>2.20.1</Version> <Description>threadinfo tracker</Description> <mods> <mod> <Timestamp>131341024052549184</Timestamp> <BaseAddress>0x12d0000</BaseAddress> <Size>2077768</Size> <Path>blah.exe</Path> <Version>2.20.1</Version> <Company></Company> <Description>threadinfo tracker</Description> </mod> </mods> </threadinfo> </Processes> </system>
Largely it processes okay with my multiline xpath splitting although I am having problems ignoring the xpaths with mod or mods but that is probably best to mention in another thread and after I try and tinker with it some more first.
So is there a way I can get Beats or Logstash reading in the unformatted lines without having the need to format it prior? Does xpath require elements to be stylized properly?
Thanks
Whitespace between elements are not significant in XML so it's very hard to believe that Logstash's XML parser is having problems with the first XML document. What undesired behavior are you seeing? What does your configuration look like?
Thanks again Magnus, I appreciate your help. Sorry for the week long delayed response. My config looks like the following:
> input {
> file {
> path => "/tmp/single-unformatted.xml"
> start_position => beginning
> sincedb_path => "NUL"
> codec => multiline
> {
> pattern => "^<\?threadinfo .*\>"
> negate => true
> what => "previous"
> }
> }
> }
> filter {
> if [message] == "<system>" or [message] == "<Processes>" {
> drop {}
> }
> xml {
> source => "message"
> target => "xml_parsed"
> store_xml => "false"
> xpath => ["/system/Processes/threadinfo", "threadinfo"]
> }
> mutate {
> remove_field => [ "message", "xml_parsed"]
> }
> split {
> field => "[threadinfo]"
> }
> xml {
> source => "threadinfo"
> store_xml => "false"
> xpath =>
> [
> "/threadinfo/threadinfoIndex/text()", "threadinfoIndex",
> "/threadinfo/threadinfoId/text()", "threadinfoId",
> "/threadinfo/Version/text()", "Version",
> "/threadinfo/Description/text()", "Description"
> ]
> }
> mutate {
> convert => {
> "threadinfoIndex" => "integer"
> "threadinfoId" => "integer"
> }
> remove_field => ['threadinfo']
> }
> }
> output
> {
> elasticsearch
> {
> hosts => "localhost"
> index => "logstash-"
> document_type => "threads"
> }
> stdout { codec =>rubydebug}
> }
If I run it on the whitespace formatted one, I get the following:
{ "_index": "logstash-", "_type": "threads", "_id": "AVsX05PMqINox8YUDLc3", "_score": null, "_source": { "path": "/tmp/single.xml", "@timestamp": "2017-03-29T02:09:53.346Z", "Description": [ "threadinfo tracker" ], "threadinfoId": [ 3552 ], "Version": [ "2.20.1" ], "threadinfoIndex": [ 1 ], "@version": "1", "host": "computer", "tags": [ "multiline" ] }, "fields": { "@timestamp": [ 1490753393346 ] }, "sort": [ 1490753393346 ] }
When I run it on the unformatted one I get the following:
{ "_index": "logstash-", "_type": "threads", "_id": "AVsX0Ag_qINox8YUDLc2", "_score": null, "_source": { "path": "/tmp/single-unformatted.xml", "@timestamp": "2017-03-29T02:06:01.020Z", "@version": "1", "host": "computer", "tags": [ "multiline", "_split_type_failure" ] }, "fields": { "@timestamp": [ 1490753161020 ] }, "sort": [ 1490753161020 ] }
I am only interested in the "threadinfo" elements but not their sub-elements, like "mods" & "mod". Am I splitting and parsing this incorrectly? I assume that is the case because of the split_type_failure but I can't figure out the proper way to tackle the parsing.
Thanks again!
I don't understand what you're trying to do with the multiline codec and I suspect that's where the problem is. I don't see how the ^<\?threadinfo .*\>
pattern can match anything in your XML file. If I change the input file and multiline pattern to make sure the XML document is emitted as expected things work just fine:
$ cat test.config
# input { stdin {} }
input {
file {
path => "/tmp/trash.e4Tt/single-unformatted.xml"
start_position => beginning
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^<\?xml version"
negate => true
what => "previous"
}
}
}
output { stdout { codec => rubydebug } }
filter {
xml {
source => "message"
store_xml => false
xpath => [
"/system/Processes/threadinfo/threadinfoIndex/text()", "threadinfoIndex",
"/system/Processes/threadinfo/threadinfoId/text()", "threadinfoId",
"/system/Processes/threadinfo/Version/text()", "Version",
"/system/Processes/threadinfo/Description/text()", "Description"
]
}
}
$ cat single-unformatted.xml
<?xml version="1.0" encoding="UTF-8"?>
<system><Processes><threadinfo>
<threadinfoIndex>1</threadinfoIndex>
<threadinfoId>3552</threadinfoId>
<Version>2.20.1</Version>
<Description>threadinfo tracker</Description>
<mods>
<mod>
<Timestamp>131341024052549184</Timestamp>
<BaseAddress>0x12d0000</BaseAddress>
<Size>2077768</Size>
<Path>blah.exe</Path>
<Version>2.20.1</Version>
<Company></Company>
<Description>threadinfo tracker</Description>
</mod>
</mods>
</threadinfo>
</Processes>
</system>
<?xml version="1.0" encoding="UTF-8"?>
$ /opt/logstash/bin/logstash -f test.config
Settings: Default pipeline workers: 8
Pipeline main started
{
"@timestamp" => "2017-03-29T05:24:17.015Z",
"message" => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<system><Processes><threadinfo>\n<threadinfoIndex>1</threadinfoIndex>\n<threadinfoId>3552</threadinfoId>\n<Version>2.20.1</Version>\n<Description>threadinfo tracker</Description>\n<mods>\n<mod>\n<Timestamp>131341024052549184</Timestamp>\n<BaseAddress>0x12d0000</BaseAddress>\n<Size>2077768</Size>\n<Path>blah.exe</Path>\n<Version>2.20.1</Version>\n<Company></Company>\n<Description>threadinfo tracker</Description>\n</mod>\n</mods>\n</threadinfo>\n</Processes>\n</system>",
"@version" => "1",
"tags" => [
[0] "multiline"
],
"path" => "/tmp/trash.e4Tt/single-unformatted.xml",
"host" => "lnxolofon",
"threadinfoIndex" => [
[0] "1"
],
"threadinfoId" => [
[0] "3552"
],
"Version" => [
[0] "2.20.1"
],
"Description" => [
[0] "threadinfo tracker"
]
}
^CSIGINT received. Shutting down the agent. {:level=>:warn}
stopping pipeline {:id=>"main"}
{
"@timestamp" => "2017-03-29T05:24:38.430Z",
"message" => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>",
"@version" => "1",
"path" => "/tmp/trash.e4Tt/single-unformatted.xml",
"host" => "lnxolofon"
}
Pipeline main has been shutdown
Thanks Magnus and apologies for the confusion on my end and not being clear. I initially had a similar config but I added the ^<\?threadinfo .*\>
pattern because some of my input files will have more than one threadinfo element and I would like to treat each one as a separate event entry/record.
When I have a file that has multiple threadinfo elements, it is inserted as 1 event. For example:
..snip..
"_source": {
"path": "/tmp/multiple-unformatted.xml",
"@timestamp": "2017-03-29T14:53:08.492Z",
"Description": [
"threadinfo tracker",
"threadinfo tracker 2"
],
"threadinfoId": [
"3552",
"4444"
],
"Version": [
"2.20.1",
"2.20.1"
],
"threadinfoIndex": [
"1",
"2"
],
..snip..
Input file example:
<?xml version="1.0" encoding="UTF-8"?>
<system><Processes><threadinfo>
<threadinfoIndex>1</threadinfoIndex>
<threadinfoId>3552</threadinfoId>
<Version>2.20.1</Version>
<Description>threadinfo tracker</Description>
<mods>
<mod>
<Timestamp>131341024052549184</Timestamp>
<BaseAddress>0x12d0000</BaseAddress>
<Size>2077768</Size>
<Path>blah.exe</Path>
<Version>2.20.1</Version>
<Company></Company>
<Description>threadinfo tracker</Description>
</mod>
</mods>
</threadinfo>
<threadinfo>
<threadinfoIndex>2</threadinfoIndex>
<threadinfoId>4444</threadinfoId>
<Version>2.20.1</Version>
<Description>threadinfo tracker 2</Description>
<mods>
<mod>
<Timestamp>131341026052549184</Timestamp>
<BaseAddress>0x12d0000</BaseAddress>
<Size>2077768</Size>
<Path>blah.exe</Path>
<Version>2.20.1</Version>
<Company></Company>
<Description>threadinfo tracker</Description>
</mod>
</mods>
</threadinfo>
</Processes>
</system>
I also tried configs like:
input {
file {
path => "/tmp/multiple-unformatted.xml"
start_position => beginning
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^\<threadinfo\>"
negate => true
what => "previous"
}
}
}
filter {
xml {
source => "message"
store_xml => false
xpath => [
"/system/Processes/threadinfo/threadinfoIndex/text()", "threadinfoIndex",
"/system/Processes/threadinfo/threadinfoId/text()", "threadinfoId",
"/system/Processes/threadinfo/Version/text()", "Version",
"/system/Processes/threadinfo/Description/text()", "Description",
"/threadinfo/threadinfoIndex/text()", "threadinfoIndex",
"/threadinfo/threadinfoId/text()", "threadinfoId",
"/threadinfo/Version/text()", "Version",
"/threadinfo/Description/text()", "Description"
]
}
}
output
{
elasticsearch
{
hosts => "localhost"
index => "logstash-"
document_type => "threads"
}
stdout { codec =>rubydebug}
}
Unfortunately the last one isn't working in all my test cases and I haven't been able to determine all the causes.
How can I only read in certain elements and make each one an event record, plus ignoring certain elements, like <mod>
? I thought if I spit and broke those xpath elements up, fed the one I needed into an array, dropped the rest and split that array, it would do it but I am likely overthinking this.
Thanks again and sorry for not being clearer earlier.
You want to parse the whole file as a single document, which isn't too easy with Logstash. There's a separate input plugin for it but I don't believe it's ready. You need to find a multiline pattern that joins every single line together unless we've reached the end of the file. Perhaps this works?
codec => multiline {
pattern => "</system>"
what => "next"
negate => true
}
That is, unless the line contains </system>
, join it with the next line of input. Then you should get the whole document and you can extract elements from it as you please using the xml filter. You may need to use a ruby filter to clean things up afterwards, but ignore that for now.
Thanks for all the help Magnus. I will keep messing with it and post back if I find something that is working for my data.
The following Logstash config works with your unlinted, multi-threadinfo input file example.
input {
stdin {
codec => multiline {
pattern => "</system>"
what => "next"
negate => true
}
}
}
filter {
xml {
source => "message"
store_xml => false
xpath => [ "/system/Processes/threadinfo", "Process" ]
}
mutate {
remove_field => ["message"]
}
split {
field => "Process"
}
xml {
source => "Process"
target => "@metadata[xml_content]"
force_array => false
}
# Copy XML content to first-level fields with all-lowercase names
ruby {
code => '
event.get("@metadata[xml_content]").each do |key, value|
event.set(key, value)
end
'
}
mutate {
remove_field => ["Process", "@metadata"]
}
}
output {
stdout {
codec => rubydebug
}
stdout {
codec => json_lines
}
elasticsearch {
index => "test"
document_type => "test"
}
}
I invoked Logstash like this, redirecting the input XML file from stdin:
logstash -f system.conf < system.xml
I used the stdin
input because, when I tried using file
, Logstash would wait until I pressed Ctrl+C. @magnusbaeck, can you help? I’ve not used the file
input before. I was using the multiline
settings that you cited.
Output
Two events: one for each threadinfo
in the input XML document.
Note that I tweaked the first threadinfo
element in your example XML: I added a second mod
child element to the mods
element (for “blah2.exe”, and with a slightly different time stamp), just to see what the output would look like. I’m guessing that’s why you have a mods
element: because it’s possible to have more than one mod
per threadinfo
? I’m not thrilled with the result—an array when there are two mod
child elements, but no array when there’s only one—but I’ve run out of time tonight to come up with a fix (no promises, but I think I probably can).
For now, I’ve deliberately preserved the complete structure of your input XML. And I’ve not yet worried about trying to set @timestamp
based on your Timestamp
field (could be more than one per threadinfo
, right?).
Here’s the output in JSON format:
{
"mods": {
"mod": [
{
"Path": "blah.exe",
"Description": "threadinfo tracker",
"Version": "2.20.1",
"Size": "2077768",
"Timestamp": "131341024052549184",
"BaseAddress": "0x12d0000"
},
{
"Path": "blah2.exe",
"Description": "threadinfo tracker",
"Version": "2.20.1",
"Size": "2077768",
"Timestamp": "131341024052549185",
"BaseAddress": "0x12d0000"
}
]
},
"@timestamp": "2017-03-31T14:35:09.341Z",
"Description": "threadinfo tracker",
"threadinfoId": "3552",
"Version": "2.20.1",
"threadinfoIndex": "1",
"@version": "1",
"host": "58a3fe88f636",
"tags": [
"multiline"
]
}
{
"mods": {
"mod": {
"Path": "blah.exe",
"Description": "threadinfo tracker",
"Version": "2.20.1",
"Size": "2077768",
"Timestamp": "131341026052549184",
"BaseAddress": "0x12d0000"
}
},
"@timestamp": "2017-03-31T14:35:09.341Z",
"Description": "threadinfo tracker 2",
"threadinfoId": "4444",
"Version": "2.20.1",
"threadinfoIndex": "2",
"@version": "1",
"host": "58a3fe88f636",
"tags": [
"multiline"
]
}
Let me know what you think.
I wonder if I can set the multiline
pattern to an EOF character, and if that will solve this “hanging” (I’m guessing: waiting for more input) problem. I might try that tomorrow, unless @magnusbaeck chips in with advice in the meantime.
In the past, when I’ve had an XML document containing data that I needed to forward to Elasticsearch, I’ve written an XSLT style sheet that transforms the XML into JSON that I send directly to Elasticsearch via the HTTP bulk API, bypassing Logstash.
This is one reason your topic caught my eye: my XML documents also contained multiple elements that I wanted to split into individual events in Elasticsearch, selecting only some elements from the original XML. That was easy enough to do in XSLT; I’d wondered how straightforward it would be in Logstash. Your topic has given me an excuse to find out.
I wonder if I can set the multiline pattern to an EOF character,
No, you can't do that.
and if that will solve this “hanging” (I’m guessing: waiting for more input) problem.
Yeah, it's most likely tailing the log file. The start_position
, sincedb_path
, and ignore_older
file input options can help resolve the problem.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.