Dealing with improperly formatted XML

Kvetch · March 15, 2017, 4:27am

I have an XML that isn't parsing correctly unless I use something like xmllint to prettify it. The XML file is written line by line but doesn't contain indentation and such and for whatever reason my logstash config doesn't parse it properly but it does when I format it with xmllint. Is there a way I can handle that with Logstash natively?
Thanks

magnusbaeck · March 15, 2017, 6:27am

What does the original XML look like? What does it look like after being processed by xmllint?

Kvetch · March 16, 2017, 2:36am

Thanks for replying Magnus. So the data before xmlint looks like below where some opening xml elements are all on the same line and none of it is spaced properly.

<?xml version="1.0" encoding="UTF-8"?>
<system><Processes><threadinfo>
<threadinfoIndex>1</threadinfoIndex>
<threadinfoId>3552</threadinfoId>
<Version>2.20.1</Version>
<Description>threadinfo tracker</Description>
<mods>
<mod>
<Timestamp>131341024052549184</Timestamp>
<BaseAddress>0x12d0000</BaseAddress>
<Size>2077768</Size>
<Path>blah.exe</Path>
<Version>2.20.1</Version>
<Company></Company>
<Description>threadinfo tracker</Description>
</mod>
</mods>
</threadinfo>
</Processes>
</system>

After xmlint formatting it looks like

<?xml version="1.0" encoding="UTF-8"?>
<system>
  <Processes>
    <threadinfo>
      <threadinfoIndex>1</threadinfoIndex>
      <threadinfoId>3552</threadinfoId>
      <Version>2.20.1</Version>
      <Description>threadinfo tracker</Description>
      <mods>
        <mod>
          <Timestamp>131341024052549184</Timestamp>
          <BaseAddress>0x12d0000</BaseAddress>
          <Size>2077768</Size>
          <Path>blah.exe</Path>
          <Version>2.20.1</Version>
          <Company></Company>
          <Description>threadinfo tracker</Description>
        </mod>
      </mods>
    </threadinfo>
  </Processes>
</system>

Largely it processes okay with my multiline xpath splitting although I am having problems ignoring the xpaths with mod or mods but that is probably best to mention in another thread and after I try and tinker with it some more first.
So is there a way I can get Beats or Logstash reading in the unformatted lines without having the need to format it prior? Does xpath require elements to be stylized properly?

Thanks

magnusbaeck · March 16, 2017, 5:30am

Whitespace between elements are not significant in XML so it's very hard to believe that Logstash's XML parser is having problems with the first XML document. What undesired behavior are you seeing? What does your configuration look like?

Kvetch · March 29, 2017, 2:15am

Thanks again Magnus, I appreciate your help. Sorry for the week long delayed response. My config looks like the following:

> input {
>  file {
>   path => "/tmp/single-unformatted.xml"
>   start_position => beginning
>   sincedb_path => "NUL"
>   codec => multiline
>   {
>    pattern => "^<\?threadinfo .*\>"
>    negate => true
>    what => "previous"
>   }
>  }
> }

> filter {
>   if [message] == "<system>" or [message] == "<Processes>" {
>     drop {}
>   }
>   xml {
>    source => "message"
>    target => "xml_parsed"
>    store_xml => "false"
>    xpath => ["/system/Processes/threadinfo", "threadinfo"]
>   }
>   mutate {
>     remove_field => [ "message", "xml_parsed"]
>   }
>   split {
>     field => "[threadinfo]"
>   }
>   xml {
>     source => "threadinfo"
>     store_xml => "false"
>     xpath =>
>       [
>         "/threadinfo/threadinfoIndex/text()", "threadinfoIndex",
>         "/threadinfo/threadinfoId/text()", "threadinfoId",
>         "/threadinfo/Version/text()", "Version",
>         "/threadinfo/Description/text()", "Description"
>       ]
>     }
>   mutate {
>     convert => {
>       "threadinfoIndex"       => "integer"
>       "threadinfoId"  => "integer"
>     }
>     remove_field => ['threadinfo']
>   }
> }

> output
> {
>     elasticsearch
>     {
>         hosts => "localhost"
>         index => "logstash-"
>         document_type => "threads"
>     }
> stdout { codec =>rubydebug}
> }

If I run it on the whitespace formatted one, I get the following:

{
  "_index": "logstash-",
  "_type": "threads",
  "_id": "AVsX05PMqINox8YUDLc3",
  "_score": null,
  "_source": {
    "path": "/tmp/single.xml",
    "@timestamp": "2017-03-29T02:09:53.346Z",
    "Description": [
      "threadinfo tracker"
    ],
    "threadinfoId": [
      3552
    ],
    "Version": [
      "2.20.1"
    ],
    "threadinfoIndex": [
      1
    ],
    "@version": "1",
    "host": "computer",
    "tags": [
      "multiline"
    ]
  },
  "fields": {
    "@timestamp": [
      1490753393346
    ]
  },
  "sort": [
    1490753393346
  ]
}

When I run it on the unformatted one I get the following:

{
  "_index": "logstash-",
  "_type": "threads",
  "_id": "AVsX0Ag_qINox8YUDLc2",
  "_score": null,
  "_source": {
    "path": "/tmp/single-unformatted.xml",
    "@timestamp": "2017-03-29T02:06:01.020Z",
    "@version": "1",
    "host": "computer",
    "tags": [
      "multiline",
      "_split_type_failure"
    ]
  },
  "fields": {
    "@timestamp": [
      1490753161020
    ]
  },
  "sort": [
    1490753161020
  ]
}

I am only interested in the "threadinfo" elements but not their sub-elements, like "mods" & "mod". Am I splitting and parsing this incorrectly? I assume that is the case because of the split_type_failure but I can't figure out the proper way to tackle the parsing.
Thanks again!

magnusbaeck · March 29, 2017, 5:25am

I don't understand what you're trying to do with the multiline codec and I suspect that's where the problem is. I don't see how the ^<\?threadinfo .*\> pattern can match anything in your XML file. If I change the input file and multiline pattern to make sure the XML document is emitted as expected things work just fine:

$ cat test.config 
# input { stdin {} }
input {
  file {
    path => "/tmp/trash.e4Tt/single-unformatted.xml"
    start_position => beginning
    sincedb_path => "/dev/null"
    codec => multiline {
      pattern => "^<\?xml version"
      negate => true
      what => "previous"
    }
  }
}
output { stdout { codec => rubydebug } }
filter {
  xml {
    source => "message"
    store_xml => false
    xpath => [
      "/system/Processes/threadinfo/threadinfoIndex/text()", "threadinfoIndex",
      "/system/Processes/threadinfo/threadinfoId/text()", "threadinfoId",
      "/system/Processes/threadinfo/Version/text()", "Version",
      "/system/Processes/threadinfo/Description/text()", "Description"
    ]
  }
}
$ cat single-unformatted.xml 
<?xml version="1.0" encoding="UTF-8"?>
<system><Processes><threadinfo>
<threadinfoIndex>1</threadinfoIndex>
<threadinfoId>3552</threadinfoId>
<Version>2.20.1</Version>
<Description>threadinfo tracker</Description>
<mods>
<mod>
<Timestamp>131341024052549184</Timestamp>
<BaseAddress>0x12d0000</BaseAddress>
<Size>2077768</Size>
<Path>blah.exe</Path>
<Version>2.20.1</Version>
<Company></Company>
<Description>threadinfo tracker</Description>
</mod>
</mods>
</threadinfo>
</Processes>
</system>
<?xml version="1.0" encoding="UTF-8"?>
$ /opt/logstash/bin/logstash -f test.config
Settings: Default pipeline workers: 8
Pipeline main started
{
         "@timestamp" => "2017-03-29T05:24:17.015Z",
            "message" => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<system><Processes><threadinfo>\n<threadinfoIndex>1</threadinfoIndex>\n<threadinfoId>3552</threadinfoId>\n<Version>2.20.1</Version>\n<Description>threadinfo tracker</Description>\n<mods>\n<mod>\n<Timestamp>131341024052549184</Timestamp>\n<BaseAddress>0x12d0000</BaseAddress>\n<Size>2077768</Size>\n<Path>blah.exe</Path>\n<Version>2.20.1</Version>\n<Company></Company>\n<Description>threadinfo tracker</Description>\n</mod>\n</mods>\n</threadinfo>\n</Processes>\n</system>",
           "@version" => "1",
               "tags" => [
        [0] "multiline"
    ],
               "path" => "/tmp/trash.e4Tt/single-unformatted.xml",
               "host" => "lnxolofon",
    "threadinfoIndex" => [
        [0] "1"
    ],
       "threadinfoId" => [
        [0] "3552"
    ],
            "Version" => [
        [0] "2.20.1"
    ],
        "Description" => [
        [0] "threadinfo tracker"
    ]
}
^CSIGINT received. Shutting down the agent. {:level=>:warn}
stopping pipeline {:id=>"main"}
{
    "@timestamp" => "2017-03-29T05:24:38.430Z",
       "message" => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>",
      "@version" => "1",
          "path" => "/tmp/trash.e4Tt/single-unformatted.xml",
          "host" => "lnxolofon"
}
Pipeline main has been shutdown

Kvetch · March 29, 2017, 3:14pm

Thanks Magnus and apologies for the confusion on my end and not being clear. I initially had a similar config but I added the ^<\?threadinfo .*\> pattern because some of my input files will have more than one threadinfo element and I would like to treat each one as a separate event entry/record.

When I have a file that has multiple threadinfo elements, it is inserted as 1 event. For example:
..snip..

  "_source": {
    "path": "/tmp/multiple-unformatted.xml",
    "@timestamp": "2017-03-29T14:53:08.492Z",
    "Description": [
      "threadinfo tracker",
      "threadinfo tracker 2"
    ],
    "threadinfoId": [
      "3552",
      "4444"
    ],
    "Version": [
      "2.20.1",
      "2.20.1"
    ],
    "threadinfoIndex": [
      "1",
      "2"
    ],

..snip..
Input file example:

<?xml version="1.0" encoding="UTF-8"?>
<system><Processes><threadinfo>
<threadinfoIndex>1</threadinfoIndex>
<threadinfoId>3552</threadinfoId>
<Version>2.20.1</Version>
<Description>threadinfo tracker</Description>
<mods>
<mod>
<Timestamp>131341024052549184</Timestamp>
<BaseAddress>0x12d0000</BaseAddress>
<Size>2077768</Size>
<Path>blah.exe</Path>
<Version>2.20.1</Version>
<Company></Company>
<Description>threadinfo tracker</Description>
</mod>
</mods>
</threadinfo>
<threadinfo>
<threadinfoIndex>2</threadinfoIndex>
<threadinfoId>4444</threadinfoId>
<Version>2.20.1</Version>
<Description>threadinfo tracker 2</Description>
<mods>
<mod>
<Timestamp>131341026052549184</Timestamp>
<BaseAddress>0x12d0000</BaseAddress>
<Size>2077768</Size>
<Path>blah.exe</Path>
<Version>2.20.1</Version>
<Company></Company>
<Description>threadinfo tracker</Description>
</mod>
</mods>
</threadinfo>
</Processes>
</system>

I also tried configs like:

input {
  file {
    path => "/tmp/multiple-unformatted.xml"
    start_position => beginning
    sincedb_path => "/dev/null"
    codec => multiline {
      pattern => "^\<threadinfo\>"
      negate => true
      what => "previous"
    }
  }
}

filter {
  xml {
    source => "message"
    store_xml => false
    xpath => [
      "/system/Processes/threadinfo/threadinfoIndex/text()", "threadinfoIndex",
      "/system/Processes/threadinfo/threadinfoId/text()", "threadinfoId",
      "/system/Processes/threadinfo/Version/text()", "Version",
      "/system/Processes/threadinfo/Description/text()", "Description",
      "/threadinfo/threadinfoIndex/text()", "threadinfoIndex",
      "/threadinfo/threadinfoId/text()", "threadinfoId",
      "/threadinfo/Version/text()", "Version",
      "/threadinfo/Description/text()", "Description"
    ]
  }
}
output
{
    elasticsearch
    {
        hosts => "localhost"
        index => "logstash-"
        document_type => "threads"
    }
stdout { codec =>rubydebug}
}

Unfortunately the last one isn't working in all my test cases and I haven't been able to determine all the causes.

How can I only read in certain elements and make each one an event record, plus ignoring certain elements, like <mod>? I thought if I spit and broke those xpath elements up, fed the one I needed into an array, dropped the rest and split that array, it would do it but I am likely overthinking this.

Thanks again and sorry for not being clearer earlier.

magnusbaeck · March 30, 2017, 5:11am

You want to parse the whole file as a single document, which isn't too easy with Logstash. There's a separate input plugin for it but I don't believe it's ready. You need to find a multiline pattern that joins every single line together unless we've reached the end of the file. Perhaps this works?

codec => multiline {
  pattern => "</system>"
  what => "next"
  negate => true
}

That is, unless the line contains </system>, join it with the next line of input. Then you should get the whole document and you can extract elements from it as you please using the xml filter. You may need to use a ruby filter to clean things up afterwards, but ignore that for now.

Kvetch · March 31, 2017, 1:31am

Thanks for all the help Magnus. I will keep messing with it and post back if I find something that is working for my data.

GrahamHannington · March 31, 2017, 3:02pm

The following Logstash config works with your unlinted, multi-threadinfo input file example.

input {
  stdin {
	codec => multiline {
	  pattern => "</system>"
	  what => "next"
	  negate => true
	}
  }
}
filter {
  xml {
	source => "message"
	store_xml => false
	xpath => [ "/system/Processes/threadinfo", "Process" ]
  }
  mutate {
	remove_field => ["message"]
  }
  split {
	field => "Process"
  }
  xml {
	source => "Process"
	target => "@metadata[xml_content]"
	force_array => false
  }
  # Copy XML content to first-level fields with all-lowercase names
  ruby {
	  code => '
		  event.get("@metadata[xml_content]").each do |key, value|
			  event.set(key, value)
		  end
	  '
  }
  mutate {
	   remove_field => ["Process", "@metadata"]
  }
}
output {
  stdout {
	codec => rubydebug
  }
  stdout {
	codec => json_lines
  }
  elasticsearch {
	index => "test"
	document_type => "test"
  }
}

I invoked Logstash like this, redirecting the input XML file from stdin:

logstash -f system.conf < system.xml

I used the stdin input because, when I tried using file, Logstash would wait until I pressed Ctrl+C. @magnusbaeck, can you help? I’ve not used the file input before. I was using the multiline settings that you cited.

Output

Two events: one for each threadinfo in the input XML document.

Note that I tweaked the first threadinfo element in your example XML: I added a second mod child element to the mods element (for “blah2.exe”, and with a slightly different time stamp), just to see what the output would look like. I’m guessing that’s why you have a mods element: because it’s possible to have more than one mod per threadinfo? I’m not thrilled with the result—an array when there are two mod child elements, but no array when there’s only one—but I’ve run out of time tonight to come up with a fix (no promises, but I think I probably can).

For now, I’ve deliberately preserved the complete structure of your input XML. And I’ve not yet worried about trying to set @timestamp based on your Timestamp field (could be more than one per threadinfo, right?).

Here’s the output in JSON format:

{
  "mods": {
    "mod": [
      {
        "Path": "blah.exe",
        "Description": "threadinfo tracker",
        "Version": "2.20.1",
        "Size": "2077768",
        "Timestamp": "131341024052549184",
        "BaseAddress": "0x12d0000"
      },
      {
        "Path": "blah2.exe",
        "Description": "threadinfo tracker",
        "Version": "2.20.1",
        "Size": "2077768",
        "Timestamp": "131341024052549185",
        "BaseAddress": "0x12d0000"
      }
    ]
  },
  "@timestamp": "2017-03-31T14:35:09.341Z",
  "Description": "threadinfo tracker",
  "threadinfoId": "3552",
  "Version": "2.20.1",
  "threadinfoIndex": "1",
  "@version": "1",
  "host": "58a3fe88f636",
  "tags": [
    "multiline"
  ]
}
{
  "mods": {
    "mod": {
      "Path": "blah.exe",
      "Description": "threadinfo tracker",
      "Version": "2.20.1",
      "Size": "2077768",
      "Timestamp": "131341026052549184",
      "BaseAddress": "0x12d0000"
    }
  },
  "@timestamp": "2017-03-31T14:35:09.341Z",
  "Description": "threadinfo tracker 2",
  "threadinfoId": "4444",
  "Version": "2.20.1",
  "threadinfoIndex": "2",
  "@version": "1",
  "host": "58a3fe88f636",
  "tags": [
    "multiline"
  ]
}

Let me know what you think.

GrahamHannington · March 31, 2017, 3:20pm

I wonder if I can set the multiline pattern to an EOF character, and if that will solve this “hanging” (I’m guessing: waiting for more input) problem. I might try that tomorrow, unless @magnusbaeck chips in with advice in the meantime.

GrahamHannington · April 1, 2017, 1:15pm

In the past, when I’ve had an XML document containing data that I needed to forward to Elasticsearch, I’ve written an XSLT style sheet that transforms the XML into JSON that I send directly to Elasticsearch via the HTTP bulk API, bypassing Logstash.

This is one reason your topic caught my eye: my XML documents also contained multiple elements that I wanted to split into individual events in Elasticsearch, selecting only some elements from the original XML. That was easy enough to do in XSLT; I’d wondered how straightforward it would be in Logstash. Your topic has given me an excuse to find out.

magnusbaeck · April 3, 2017, 5:08am

I wonder if I can set the multiline pattern to an EOF character,

No, you can't do that.

and if that will solve this “hanging” (I’m guessing: waiting for more input) problem.

Yeah, it's most likely tailing the log file. The start_position, sincedb_path, and ignore_older file input options can help resolve the problem.

system · May 1, 2017, 5:08am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Get the XML node from unformatted xml Logstash	9	417	February 27, 2019
How to parse xml from a single line Logstash	7	2413	June 21, 2018
Parsing multiline, netsted XML using Logstash Logstash	8	310	April 10, 2019
XML Confusion Logstash	2	486	April 12, 2017
Parsing xml in logstash [0] "_xmlparsefailure" Logstash	5	786	June 22, 2021

Dealing with improperly formatted XML

Output

Related topics