Xml parse files help

Everyone -
I am really new to elk and I have a requirement to parse a very large (~1.9M line) xml file.
In this xml file I want to capture two tag fields and create an timestamp field that all the events that follow will use.
The events in this file are surrounded by tags and are of different line lengths.

I've tried on multiple tries, (for about a week), to parse this file unsuccessfully.

The two fields, I'm trying to capture to be used as a timestamp, is right under the root in the xml file.

TAGS:
ReportStartDate
ReportStartTime

I want to combine the two fields above with a "T" between them, so that the timestamp will look like:
2017-05-30T12:15:00+00:00

Then I need to create events using the timestamp above with data between TAGs <measInfo measInfoId="PNODE"> and </measInfo>

Below is a very small sample of the data I'm trying to parse.

    <?xml version="1.0" encoding="UTF-8"?>
<NODES>
  <ReportStartDate>2017-05-30</ReportStartDate>
  <ReportStartTime>12:15:00+00:00</ReportStartTime>
  <OriginalFile>PM201705301232+000048NODE.xml</OriginalFile>
  <measData>
    <managedElement/>
    <measInfo measInfoId="PNODE">
      <granPeriod duration="PT900S" endTime="2017-05-30T12:30:00+00:00"/>
      <ResultType>"PNODE-1"</ResultType>
      <Mif500RespRecRegCtr>0</Mif500RespRecRegCtr>
      <SipOrigInviteRecCntr>146</SipOrigInviteRecCntr>
      <Mif487RespSentInvCtr>46</Mif487RespSentInvCtr>
      ...
      <SrtpE2aeEnforceCtr>0</SrtpE2aeEnforceCtr>
      <RxASAnsSentCntr>0</RxASAnsSentCntr>
      <Mif404RespSentInvCtr>0</Mif404RespSentInvCtr>
      <MsrpTlsE2aeFailCtr>0</MsrpTlsE2aeFailCtr>
    </measInfo>
    <measInfo measInfoId="PNODE">
      <granPeriod duration="PT900S" endTime="2017-05-30T12:30:00+00:00"/>
      <ResultType>"PPNODE-2"</ResultType>
      <Mif500RespRecRegCtr>0</Mif500RespRecRegCtr>
      <SipOrigInviteRecCntr>1971</SipOrigInviteRecCntr>
      <Mif487RespSentInvCtr>468</Mif487RespSentInvCtr>
      ...
      <SrtpE2aeEnforceCtr>0</SrtpE2aeEnforceCtr>
      <RxASAnsSentCntr>0</RxASAnsSentCntr>
      <Mif404RespSentInvCtr>0</Mif404RespSentInvCtr>
      <MsrpTlsE2aeFailCtr>0</MsrpTlsE2aeFailCtr>
    </measInfo>
</NODES>

Please let me know what I need to provide for help.
Thanks!

Here's the config that I'm trying to use to parse this with.
Still haven't resolved my issue. Any help would be appreciated.

input {

  file {
    path => "/home/nodelogs/data-xml/20170530.121500.twoRecords.short.ns"   
    start_position => "beginning"
    sincedb_path => "/dev/null"
    type => "nodedata"
    codec => multiline {
      pattern => "<\/measData>"  
      negate => "true"
      what => "previous"
      multiline_tag => "test_multiTag"
      max_lines => 1000
      auto_flush_interval => 1

    } 

  }
} 

filter {
 if [type] == "nodedata" {

  xml {
    source => "message"
    target => "parsed"
    xpath => [
       "/NODES/ReportStartDate", "StartDate",
       "/NODES/ReportStartTime", "StartTime"

    ]   
  } 

  date {
    match => ["endTime", "yyyy-MM-dd HH:mm:ss", "ISO8601"]

  }     
 }      
}       


output {
 if [type] == "nodedata" {
  stdout {codec => rubydebug}
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "nodedata-%{+YYYY.MM.dd}"
    document_type => "nodedata"                  
  }
 }      
}

I suspect Logstash might not deal with such large XML files in a good way. If you can parse the whole file in one swoop then it's technically pretty easy to do what you want (though getting the multiline config right can be tricky). It would be nice if you could parse the measInfo elements one by one but then you won't be able to pick up the timestamp correctly.

Here are a few options to explore:

  • Try to parse the file in one swoop.
  • Use another program for parsing the file and rewriting it to a more convenient format.
  • Write a custom plugin for parsing the timestamps and making them accessible as you process the measInfo elements.
  • Use Logstash to parse the file twice; once to extract the timestamp and store it in the name of the output file (but otherwise don't try to parse the XML), then another file input that reads those files and uses a multiline codec to extract the measInfo elements into events, stamping them with the timestamp found in the input filename.

Many of these options involve not parsing the files as XML but rather taking regexp shortcuts. Beware.

Awesome! Thanks much on the options. I should be able to figure something out with your suggestions.

Thanks again!

I'm still stuck on this.
How could I parse the file in one swoop using the timestamp for all sections of this file?
Would I have to "loop" through the each section of the file inserting the timestamp for each??

Does anyone have any insight on this, I'm stuck in a bad way.

Can anyone help with this?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.