Manipulating XML before hitting XML Filter

Alright, I have a couple changes I want to make prior to an XML file hitting the XML parser.

  1. Strip versioning/encoding statement from the file
  2. Ensure tags <feedback> and </feedback> are placed on their own lines if they're not already there.

The XML reports I receive come in a variety of line formats and the it seems like the XML filter has issues if all the tags reside on the same line. I'm already using a multiline pattern to fold everything together so I can't use that method. Any ideas?

<?xml version='1.0' encoding='utf-8'?>
<feedback><field1><field2></field1></field2></feedback>

or

<?xml version='1.0' encoding='utf-8'?>
<feedback>
<field1>
<field2>
</field1>
</field2>
</feedback>

or

<?xml version='1.0' encoding='utf-8'?>
<feedback>
<field1><field2></field1></field2>
</feedback>

Here's my input/filter config

file {
   id => "Ingest"
   path => "C:/DMARC/*.xml"
   discover_interval => 5
   close_older => 5
   codec => multiline {
     negate => true
     pattern => "<record>"
     what => "previous"
   }
 }
}
filter {
  xml {
    id => "Parse"
    force_array => true
    store_xml => false
    source => "message"
    xpath => [
      "feedback/report_metadata/org_name/text()", "Reporting Org",
      "feedback/report_metadata/email/text()", "Org Contact",
      "feedback/report_metadata/report_id/text()", "Report ID",
      "feedback/report_metadata/date_range/begin/text()", "Start Date",
      "feedback/report_metadata/date_range/end/text()", "End Date",
      "feedback/policy_published/domain/text()", "Policy Domain",
      "feedback/policy_published/aspf/text()", "SPF Mode",
      "feedback/policy_published/adkim/text()", "DKIM Mode",
      "feedback/policy_published/p/text()", "DMARC Policy Action",
      "feedback/policy_published/sp/text()", "DMARC Sub-Domain Action",
      "feedback/policy_published/pct/text()", "Application Percentage",
      "record/row/source_ip/text()", "Sender IP",
      "record/row/count/text()", "Message Count",
      "record/row/policy_evaluated/disposition/text()", "Policy Disposition",
      "record/row/policy_evaluated/spf/text()", "SPF Disposition",
      "record/identifiers/header_from/text()", "Message Header",
      "record/auth_results/dkim/domain/text()", "DKIM Domain",
      "record/auth_results/dkim/result/text()", "DKIM Result",
      "record/auth_results/spf/domain/text()", "SPF Domain",
      "record/auth_results/spf/scope/text()", "SPF Scope",
      "record/auth_results/spf/result/text()", "SPF Result"
    ]
  }

The XML reports I receive come in a variety of line formats and the it seems like the XML filter has issues if all the tags reside on the same line.

I find that very hard to believe. What makes you think that?

This is the original format of one of the XML reports I am working on:

<?xml version='1.0' encoding='utf-8'?>
<feedback><report_metadata><org_name>Mail.Ru</org_name><email>dmarc_support@corp.mail.ru</email><extra_contact_info>http://help.mail.ru/mail-help</extra_contact_info><report_id>37256247916566362691518220800</report_id><date_range><begin>1518220800</begin><end>1518307200</end></date_range></report_metadata><policy_published><domain>192.168.1.1</domain><adkim>r</adkim><aspf>r</aspf><p>none</p><sp>none</sp><pct>100</pct></policy_published><record><row><source_ip>148.163.159.153</source_ip><count>1</count><policy_evaluated><disposition>none</disposition><dkim>fail</dkim><spf>pass</spf></policy_evaluated></row><identifiers><header_from>192.168.1.1</header_from></identifiers><auth_results><spf><domain>192.168.1.1</domain><scope>mfrom</scope><result>pass</result></spf></auth_results></record></feedback>

These are the two records generated from it:

If I reformatted the data to:

<feedback>
<report_metadata><org_name>Mail.Ru</org_name><email>dmarc_support@corp.mail.ru</email><extra_contact_info>http://help.mail.ru/mail-help</extra_contact_info><report_id>37256247916566362691518220800</report_id><date_range><begin>1518220800</begin><end>1518307200</end></date_range></report_metadata><policy_published><domain>192.168.1.1</domain><adkim>r</adkim><aspf>r</aspf><p>none</p><sp>none</sp><pct>100</pct></policy_published>
<record><row><source_ip>148.163.159.153</source_ip><count>1</count><policy_evaluated><disposition>none</disposition><dkim>fail</dkim><spf>pass</spf></policy_evaluated></row><identifiers><header_from>192.168.1.1</header_from></identifiers><auth_results><spf><domain>192.168.1.1</domain><scope>mfrom</scope><result>pass</result></spf></auth_results>
</record>
</feedback>

Then I get the desired output:

:

And what does your configuration look like?

What do you mean, my pipeline? That's up in the my original post.

Right, sorry. I'm pretty sure it's your multiline configuration. Isn't the goal to slurp the whole file into a single event? Then why involve the <record> element? Flushing the current event when the first line containing that element doesn't make any sense to me.

No, each entry inside of the record XML tags is a single event. These XMLs are reports sent daily by remote mail servers, as part of DMARC compliance.

Regardless of what final representation you want you need to make sure the whole XML document is stored in a field, intact. Having <record> in the multiline configuration is incorrect. Ignore the xml filter and the rest of the configuration until you're able to produce one event per XML document.

There IS one event per line using that multiline config, it's the repeated record field. The additional information that I would like to be included on the same line only occurs once per file.

There IS one event per line using that multiline config, it's the repeated record field.

Yes, and that's what's making this so difficult.

The additional information that I would like to be included on the same line only occurs once per file.

If you have one XML document per file you'll want to get the whole file into a single Logstash event. Once you've reached that state we can talk about parsing the XML, consolidating records with feedback (or whatever), and eventually perhaps splitting things into multiple events if you want a multi-record document to result in multiple Logstash events.

I'm sorry, there's an obvious skill/knowledge gap here, I'm a network/system administrator, not a programmer/developer. I'm REALLY trying to understand how it functions and get the desired result because this stands to save many organizations a lot of money and would only result in ElasticStack gaining broader use.

Based on observations, it looks like Logstash looks at the file and for each line, it creates an event that is recorded. Because all the data that constitutes a single event is across multiple lines, I'm assuming (wrongly?) we want to collapse all that data for a single event into a single line, that is again, my assumed purpose of the multiline codec.

If we were to collapse all the data into a single long line, how do we instruct Logstash to identify the end of one event and the beginning of another? I understand xpath can assist with that when there is a known number of elements to target, but these files can have any number of elements (I have one report with over 1,400 events).

Please don't misunderstand me, I'm not trying to be contradictory or argumentative in any way, I'm just ignorant to a lot of this and, while I have a strong desire to learn it and do it, I still have my day job and sleep to attend to, lol.

If you want sample data to see what I mean/play with I can definitely offer it up.

Unfortunately, I couldn't come up with a way to do this within LogStash to keep the process operating system agnostic. I came up with a script using PowerShell that extracts all the files and appends the data where I want it to go in the XML file and then saves the modified files to the directory Logstash monitors.

Based on observations, it looks like Logstash looks at the file and for each line, it creates an event that is recorded. Because all the data that constitutes a single event is across multiple lines, I'm assuming (wrongly?) we want to collapse all that data for a single event into a single line, that is again, my assumed purpose of the multiline codec.

That's correct.

If we were to collapse all the data into a single long line, how do we instruct Logstash to identify the end of one event and the beginning of another? I understand xpath can assist with that when there is a known number of elements to target, but these files can have any number of elements (I have one report with over 1,400 events).

A ruby filter could help out with that.

If your input consists of XML files you'll want to map each such file into exactly one Logstash event (which the multiline codec could do). Splitting that event into multiple events (possibly after rearranging the fields resulting from parsing the XML document) can be done with Logstash.

I think you have a misunderstanding of what these XMLs are. A single DMARC report comes from remote email servers. These reports contain an aggregated summary for each IP the remote server has had contact with. Each of these IPs needs to be considered a single event, not the file. Each IP, and the relevant information for the contact is contained inside of a record XML tag. Each of these records needs to be seen as a single event With this in mind, placing the entire file on a single line and considering it a single event doesn't make sense.

Regardless, I abandoned trying to get this done in the pipeline and am using PowerShell to do the appropriate changes. Unfortunately, this means my solution isn't compatible with Linux based setups.

I think you have a misunderstanding of what these XMLs are. A single DMARC report comes from remote email servers. These reports contain an aggregated summary for each IP the remote server has had contact with. Each of these IPs needs to be considered a single event, not the file. Each IP, and the relevant information for the contact is contained inside of a record XML tag. Each of these records needs to be seen as a single event With this in mind, placing the entire file on a single line and considering it a single event doesn't make sense.

I repeat: Splitting that event into multiple events (possibly after rearranging the fields resulting from parsing the XML document) can be done with Logstash.

Attempting to optimize the process by using regular expressions to only extract the record elements from the original XML document will result in a fragile solution, never mind if you want to piece together or correlate different parts of the document.

If you can provide an example then I'd love to see it. Over the course of a few threads I've provided lots of sample data and then there are more samples on the internet. I'm not saying I don't believe you but I can't figure it out, no matter how many times
I've been told it can be done.

I've noticed a chronic issue on these forums where people ask for assistance and they're told to go read such and such. There's only so far someone can go reading documentation that, seemingly, is written by developers, for developers, with zero standardization
on how examples are displayed, if displayed at all. Am I saying if someone asks how to do something you should do the work for them? Definitely not, but giving something more than a reference link and a push out the door would go a long way towards giving
the internet a repository of information to work with your product as well help build experience for people wanting to use it.

Then again, if the business strategy is to keep that knowledge within the confines of paid license users, in an attempt to push more people to pay for the product, continue on.

If you can provide an example then I'd love to see it.

If you have an XML document with multiple record elements and parse that with the xml filter you'll end up with a JSON document looking something like this:

{
  ...
  "record": [
    {
      "a": "b",
      ...
    },
    {
      "a": "c",
      ...
    },
    ...
  ],
  ...

Adding a

split {
  field => "record"
}

filter will split the event in question in multiple events, like this:

{"a": "b", ...}
{"a": "c", ...}

Before the split filter it might be necessary to rearrange the original event somewhat.

I've noticed a chronic issue on these forums where people ask for assistance and they're told to go read such and such.

Yes, and that might very well be a reasonable response to a question like "how do I do X" if the documentation covers X. On the other hand, "I've read the documentation about X but there are a few things I don't understand, ..." should elicit a more detailed and exact answer.

As I don't work for Elastic I can't comment on their prioritization of this support channel.

I see...I may tinker with this tonight then and see where I get. At present, it seems like I am adding complexity to the pipe by adding another file processing stage but I'll see where it gets me. Thank you for the example.

Though I am left with an immediate question, the split filter says it creates a clone of the data, leaving the original data intact. Am I correct to think that multiline would put everything into a single line into field a, then split would turn around and clone everything into field b, then the xml filter would process b into field c-z? If that's the case, outside of debugging, would there be any value in retaining the original multiline field, it seems like there wouldn't be.

The xml filter would come before the split filter. The latter doesn't touch any fields except the one in the field option ("record" in my previous example). Most of the other fields can probable be deleted, either before or after the split filter. A prune filter can be helpful if you don't want to enumerate all possible names of fields you want to get rid of.