Manipulating XML before hitting XML Filter

wwalker · February 14, 2018, 6:29am

Alright, I have a couple changes I want to make prior to an XML file hitting the XML parser.

Strip versioning/encoding statement from the file
Ensure tags <feedback> and </feedback> are placed on their own lines if they're not already there.

The XML reports I receive come in a variety of line formats and the it seems like the XML filter has issues if all the tags reside on the same line. I'm already using a multiline pattern to fold everything together so I can't use that method. Any ideas?

<?xml version='1.0' encoding='utf-8'?>
<feedback><field1><field2></field1></field2></feedback>

or

<?xml version='1.0' encoding='utf-8'?>
<feedback>
<field1>
<field2>
</field1>
</field2>
</feedback>

or

<?xml version='1.0' encoding='utf-8'?>
<feedback>
<field1><field2></field1></field2>
</feedback>

Here's my input/filter config

file {
   id => "Ingest"
   path => "C:/DMARC/*.xml"
   discover_interval => 5
   close_older => 5
   codec => multiline {
     negate => true
     pattern => "<record>"
     what => "previous"
   }
 }
}
filter {
  xml {
    id => "Parse"
    force_array => true
    store_xml => false
    source => "message"
    xpath => [
      "feedback/report_metadata/org_name/text()", "Reporting Org",
      "feedback/report_metadata/email/text()", "Org Contact",
      "feedback/report_metadata/report_id/text()", "Report ID",
      "feedback/report_metadata/date_range/begin/text()", "Start Date",
      "feedback/report_metadata/date_range/end/text()", "End Date",
      "feedback/policy_published/domain/text()", "Policy Domain",
      "feedback/policy_published/aspf/text()", "SPF Mode",
      "feedback/policy_published/adkim/text()", "DKIM Mode",
      "feedback/policy_published/p/text()", "DMARC Policy Action",
      "feedback/policy_published/sp/text()", "DMARC Sub-Domain Action",
      "feedback/policy_published/pct/text()", "Application Percentage",
      "record/row/source_ip/text()", "Sender IP",
      "record/row/count/text()", "Message Count",
      "record/row/policy_evaluated/disposition/text()", "Policy Disposition",
      "record/row/policy_evaluated/spf/text()", "SPF Disposition",
      "record/identifiers/header_from/text()", "Message Header",
      "record/auth_results/dkim/domain/text()", "DKIM Domain",
      "record/auth_results/dkim/result/text()", "DKIM Result",
      "record/auth_results/spf/domain/text()", "SPF Domain",
      "record/auth_results/spf/scope/text()", "SPF Scope",
      "record/auth_results/spf/result/text()", "SPF Result"
    ]
  }

magnusbaeck · February 15, 2018, 8:50am

The XML reports I receive come in a variety of line formats and the it seems like the XML filter has issues if all the tags reside on the same line.

I find that very hard to believe. What makes you think that?

wwalker · February 15, 2018, 1:55pm

This is the original format of one of the XML reports I am working on:

<?xml version='1.0' encoding='utf-8'?>
<feedback><report_metadata><org_name>Mail.Ru</org_name><email>dmarc_support@corp.mail.ru</email><extra_contact_info>http://help.mail.ru/mail-help</extra_contact_info><report_id>37256247916566362691518220800</report_id><date_range><begin>1518220800</begin><end>1518307200</end></date_range></report_metadata><policy_published><domain>192.168.1.1</domain><adkim>r</adkim><aspf>r</aspf><p>none</p><sp>none</sp><pct>100</pct></policy_published><record><row><source_ip>148.163.159.153</source_ip><count>1</count><policy_evaluated><disposition>none</disposition><dkim>fail</dkim><spf>pass</spf></policy_evaluated></row><identifiers><header_from>192.168.1.1</header_from></identifiers><auth_results><spf><domain>192.168.1.1</domain><scope>mfrom</scope><result>pass</result></spf></auth_results></record></feedback>

These are the two records generated from it:

If I reformatted the data to:

<feedback>
<report_metadata><org_name>Mail.Ru</org_name><email>dmarc_support@corp.mail.ru</email><extra_contact_info>http://help.mail.ru/mail-help</extra_contact_info><report_id>37256247916566362691518220800</report_id><date_range><begin>1518220800</begin><end>1518307200</end></date_range></report_metadata><policy_published><domain>192.168.1.1</domain><adkim>r</adkim><aspf>r</aspf><p>none</p><sp>none</sp><pct>100</pct></policy_published>
<record><row><source_ip>148.163.159.153</source_ip><count>1</count><policy_evaluated><disposition>none</disposition><dkim>fail</dkim><spf>pass</spf></policy_evaluated></row><identifiers><header_from>192.168.1.1</header_from></identifiers><auth_results><spf><domain>192.168.1.1</domain><scope>mfrom</scope><result>pass</result></spf></auth_results>
</record>
</feedback>

Then I get the desired output:

:

magnusbaeck · February 15, 2018, 1:58pm

And what does your configuration look like?

wwalker · February 15, 2018, 2:02pm

What do you mean, my pipeline? That's up in the my original post.

magnusbaeck · February 15, 2018, 2:44pm

Right, sorry. I'm pretty sure it's your multiline configuration. Isn't the goal to slurp the whole file into a single event? Then why involve the <record> element? Flushing the current event when the first line containing that element doesn't make any sense to me.

wwalker · February 15, 2018, 2:59pm

No, each entry inside of the record XML tags is a single event. These XMLs are reports sent daily by remote mail servers, as part of DMARC compliance.

magnusbaeck · February 16, 2018, 7:47am

Regardless of what final representation you want you need to make sure the whole XML document is stored in a field, intact. Having <record> in the multiline configuration is incorrect. Ignore the xml filter and the rest of the configuration until you're able to produce one event per XML document.

wwalker · February 16, 2018, 9:25am

There IS one event per line using that multiline config, it's the repeated record field. The additional information that I would like to be included on the same line only occurs once per file.

magnusbaeck · February 16, 2018, 11:24am

There IS one event per line using that multiline config, it's the repeated record field.

Yes, and that's what's making this so difficult.

The additional information that I would like to be included on the same line only occurs once per file.

If you have one XML document per file you'll want to get the whole file into a single Logstash event. Once you've reached that state we can talk about parsing the XML, consolidating records with feedback (or whatever), and eventually perhaps splitting things into multiple events if you want a multi-record document to result in multiple Logstash events.

wwalker · February 16, 2018, 6:12pm

I'm sorry, there's an obvious skill/knowledge gap here, I'm a network/system administrator, not a programmer/developer. I'm REALLY trying to understand how it functions and get the desired result because this stands to save many organizations a lot of money and would only result in ElasticStack gaining broader use.

Based on observations, it looks like Logstash looks at the file and for each line, it creates an event that is recorded. Because all the data that constitutes a single event is across multiple lines, I'm assuming (wrongly?) we want to collapse all that data for a single event into a single line, that is again, my assumed purpose of the multiline codec.

If we were to collapse all the data into a single long line, how do we instruct Logstash to identify the end of one event and the beginning of another? I understand xpath can assist with that when there is a known number of elements to target, but these files can have any number of elements (I have one report with over 1,400 events).

Please don't misunderstand me, I'm not trying to be contradictory or argumentative in any way, I'm just ignorant to a lot of this and, while I have a strong desire to learn it and do it, I still have my day job and sleep to attend to, lol.

wwalker · February 16, 2018, 6:14pm

If you want sample data to see what I mean/play with I can definitely offer it up.

wwalker · February 18, 2018, 11:00pm

Unfortunately, I couldn't come up with a way to do this within LogStash to keep the process operating system agnostic. I came up with a script using PowerShell that extracts all the files and appends the data where I want it to go in the XML file and then saves the modified files to the directory Logstash monitors.

magnusbaeck · February 19, 2018, 7:00am

Based on observations, it looks like Logstash looks at the file and for each line, it creates an event that is recorded. Because all the data that constitutes a single event is across multiple lines, I'm assuming (wrongly?) we want to collapse all that data for a single event into a single line, that is again, my assumed purpose of the multiline codec.

That's correct.

If we were to collapse all the data into a single long line, how do we instruct Logstash to identify the end of one event and the beginning of another? I understand xpath can assist with that when there is a known number of elements to target, but these files can have any number of elements (I have one report with over 1,400 events).

A ruby filter could help out with that.

If your input consists of XML files you'll want to map each such file into exactly one Logstash event (which the multiline codec could do). Splitting that event into multiple events (possibly after rearranging the fields resulting from parsing the XML document) can be done with Logstash.

wwalker · February 19, 2018, 7:41pm

I think you have a misunderstanding of what these XMLs are. A single DMARC report comes from remote email servers. These reports contain an aggregated summary for each IP the remote server has had contact with. Each of these IPs needs to be considered a single event, not the file. Each IP, and the relevant information for the contact is contained inside of a record XML tag. Each of these records needs to be seen as a single event With this in mind, placing the entire file on a single line and considering it a single event doesn't make sense.

Regardless, I abandoned trying to get this done in the pipeline and am using PowerShell to do the appropriate changes. Unfortunately, this means my solution isn't compatible with Linux based setups.

magnusbaeck · February 19, 2018, 9:14pm

I think you have a misunderstanding of what these XMLs are. A single DMARC report comes from remote email servers. These reports contain an aggregated summary for each IP the remote server has had contact with. Each of these IPs needs to be considered a single event, not the file. Each IP, and the relevant information for the contact is contained inside of a record XML tag. Each of these records needs to be seen as a single event With this in mind, placing the entire file on a single line and considering it a single event doesn't make sense.

I repeat: Splitting that event into multiple events (possibly after rearranging the fields resulting from parsing the XML document) can be done with Logstash.

Attempting to optimize the process by using regular expressions to only extract the record elements from the original XML document will result in a fragile solution, never mind if you want to piece together or correlate different parts of the document.

wwalker · February 19, 2018, 10:03pm

If you can provide an example then I'd love to see it. Over the course of a few threads I've provided lots of sample data and then there are more samples on the internet. I'm not saying I don't believe you but I can't figure it out, no matter how many times
I've been told it can be done.

I've noticed a chronic issue on these forums where people ask for assistance and they're told to go read such and such. There's only so far someone can go reading documentation that, seemingly, is written by developers, for developers, with zero standardization
on how examples are displayed, if displayed at all. Am I saying if someone asks how to do something you should do the work for them? Definitely not, but giving something more than a reference link and a push out the door would go a long way towards giving
the internet a repository of information to work with your product as well help build experience for people wanting to use it.

Then again, if the business strategy is to keep that knowledge within the confines of paid license users, in an attempt to push more people to pay for the product, continue on.

magnusbaeck · February 20, 2018, 9:47am

If you can provide an example then I'd love to see it.

If you have an XML document with multiple record elements and parse that with the xml filter you'll end up with a JSON document looking something like this:

{
  ...
  "record": [
    {
      "a": "b",
      ...
    },
    {
      "a": "c",
      ...
    },
    ...
  ],
  ...

Adding a

split {
  field => "record"
}

filter will split the event in question in multiple events, like this:

{"a": "b", ...}
{"a": "c", ...}

Before the split filter it might be necessary to rearrange the original event somewhat.

I've noticed a chronic issue on these forums where people ask for assistance and they're told to go read such and such.

Yes, and that might very well be a reasonable response to a question like "how do I do X" if the documentation covers X. On the other hand, "I've read the documentation about X but there are a few things I don't understand, ..." should elicit a more detailed and exact answer.

As I don't work for Elastic I can't comment on their prioritization of this support channel.

wwalker · February 20, 2018, 2:12pm

I see...I may tinker with this tonight then and see where I get. At present, it seems like I am adding complexity to the pipe by adding another file processing stage but I'll see where it gets me. Thank you for the example.

Though I am left with an immediate question, the split filter says it creates a clone of the data, leaving the original data intact. Am I correct to think that multiline would put everything into a single line into field a, then split would turn around and clone everything into field b, then the xml filter would process b into field c-z? If that's the case, outside of debugging, would there be any value in retaining the original multiline field, it seems like there wouldn't be.

magnusbaeck · February 20, 2018, 2:48pm

The xml filter would come before the split filter. The latter doesn't touch any fields except the one in the field option ("record" in my previous example). Most of the other fields can probable be deleted, either before or after the split filter. A prune filter can be helpful if you don't want to enumerate all possible names of fields you want to get rid of.

Topic		Replies	Views
Using logstash Logstash	8	1792	July 6, 2017
Can i use file filter for xml docs Logstash	12	2876	July 6, 2017
How to parse xml from a single line Logstash	7	2418	June 21, 2018
Logstash xml parsing Logstash	2	623	April 5, 2017
XML tag filtering Logstash	3	453	November 23, 2018

Manipulating XML before hitting XML Filter

Related topics