Xml parse files help

timj123 · May 31, 2017, 3:35pm

Everyone -
I am really new to elk and I have a requirement to parse a very large (~1.9M line) xml file.
In this xml file I want to capture two tag fields and create an timestamp field that all the events that follow will use.
The events in this file are surrounded by tags and are of different line lengths.

I've tried on multiple tries, (for about a week), to parse this file unsuccessfully.

The two fields, I'm trying to capture to be used as a timestamp, is right under the root in the xml file.

TAGS:
ReportStartDate
ReportStartTime

I want to combine the two fields above with a "T" between them, so that the timestamp will look like:
2017-05-30T12:15:00+00:00

Then I need to create events using the timestamp above with data between TAGs <measInfo measInfoId="PNODE"> and </measInfo>

Below is a very small sample of the data I'm trying to parse.

    <?xml version="1.0" encoding="UTF-8"?>
<NODES>
  <ReportStartDate>2017-05-30</ReportStartDate>
  <ReportStartTime>12:15:00+00:00</ReportStartTime>
  <OriginalFile>PM201705301232+000048NODE.xml</OriginalFile>
  <measData>
    <managedElement/>
    <measInfo measInfoId="PNODE">
      <granPeriod duration="PT900S" endTime="2017-05-30T12:30:00+00:00"/>
      <ResultType>"PNODE-1"</ResultType>
      <Mif500RespRecRegCtr>0</Mif500RespRecRegCtr>
      <SipOrigInviteRecCntr>146</SipOrigInviteRecCntr>
      <Mif487RespSentInvCtr>46</Mif487RespSentInvCtr>
      ...
      <SrtpE2aeEnforceCtr>0</SrtpE2aeEnforceCtr>
      <RxASAnsSentCntr>0</RxASAnsSentCntr>
      <Mif404RespSentInvCtr>0</Mif404RespSentInvCtr>
      <MsrpTlsE2aeFailCtr>0</MsrpTlsE2aeFailCtr>
    </measInfo>
    <measInfo measInfoId="PNODE">
      <granPeriod duration="PT900S" endTime="2017-05-30T12:30:00+00:00"/>
      <ResultType>"PPNODE-2"</ResultType>
      <Mif500RespRecRegCtr>0</Mif500RespRecRegCtr>
      <SipOrigInviteRecCntr>1971</SipOrigInviteRecCntr>
      <Mif487RespSentInvCtr>468</Mif487RespSentInvCtr>
      ...
      <SrtpE2aeEnforceCtr>0</SrtpE2aeEnforceCtr>
      <RxASAnsSentCntr>0</RxASAnsSentCntr>
      <Mif404RespSentInvCtr>0</Mif404RespSentInvCtr>
      <MsrpTlsE2aeFailCtr>0</MsrpTlsE2aeFailCtr>
    </measInfo>
</NODES>

Please let me know what I need to provide for help.
Thanks!

timj123 · May 31, 2017, 10:31pm

Here's the config that I'm trying to use to parse this with.
Still haven't resolved my issue. Any help would be appreciated.

input {

  file {
    path => "/home/nodelogs/data-xml/20170530.121500.twoRecords.short.ns"   
    start_position => "beginning"
    sincedb_path => "/dev/null"
    type => "nodedata"
    codec => multiline {
      pattern => "<\/measData>"  
      negate => "true"
      what => "previous"
      multiline_tag => "test_multiTag"
      max_lines => 1000
      auto_flush_interval => 1

    } 

  }
} 

filter {
 if [type] == "nodedata" {

  xml {
    source => "message"
    target => "parsed"
    xpath => [
       "/NODES/ReportStartDate", "StartDate",
       "/NODES/ReportStartTime", "StartTime"

    ]   
  } 

  date {
    match => ["endTime", "yyyy-MM-dd HH:mm:ss", "ISO8601"]

  }     
 }      
}       


output {
 if [type] == "nodedata" {
  stdout {codec => rubydebug}
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "nodedata-%{+YYYY.MM.dd}"
    document_type => "nodedata"                  
  }
 }      
}

magnusbaeck · June 4, 2017, 1:48pm

I suspect Logstash might not deal with such large XML files in a good way. If you can parse the whole file in one swoop then it's technically pretty easy to do what you want (though getting the multiline config right can be tricky). It would be nice if you could parse the measInfo elements one by one but then you won't be able to pick up the timestamp correctly.

Here are a few options to explore:

Try to parse the file in one swoop.
Use another program for parsing the file and rewriting it to a more convenient format.
Write a custom plugin for parsing the timestamps and making them accessible as you process the measInfo elements.
Use Logstash to parse the file twice; once to extract the timestamp and store it in the name of the output file (but otherwise don't try to parse the XML), then another file input that reads those files and uses a multiline codec to extract the measInfo elements into events, stamping them with the timestamp found in the input filename.

Many of these options involve not parsing the files as XML but rather taking regexp shortcuts. Beware.

timj123 · June 5, 2017, 11:58am

Awesome! Thanks much on the options. I should be able to figure something out with your suggestions.

Thanks again!

timj123 · June 27, 2017, 7:36pm

I'm still stuck on this.
How could I parse the file in one swoop using the timestamp for all sections of this file?
Would I have to "loop" through the each section of the file inserting the timestamp for each??

timj123 · June 28, 2017, 9:09pm

Does anyone have any insight on this, I'm stuck in a bad way.

timj123 · July 3, 2017, 5:40pm

Can anyone help with this?

system · July 31, 2017, 5:41pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Parsing very large xml file Logstash	1	723	June 27, 2017
Logstash XML parsing problem Logstash	3	1354	January 18, 2018
Multiline between tags Logstash	1	426	June 27, 2017
Need help to parse XML log in logstash Logstash	6	324	August 8, 2018
Extract timestamp from xml comment Logstash	1	427	March 23, 2018

Xml parse files help

Related topics