Everyone -
I am really new to elk and I have a requirement to parse a very large (~1.9M line) xml file.
In this xml file I want to capture two tag fields and create an timestamp field that all the events that follow will use.
The events in this file are surrounded by tags and are of different line lengths.
I've tried on multiple tries, (for about a week), to parse this file unsuccessfully.
The two fields, I'm trying to capture to be used as a timestamp, is right under the root in the xml file.
TAGS:
ReportStartDate
ReportStartTime
I want to combine the two fields above with a "T" between them, so that the timestamp will look like:
2017-05-30T12:15:00+00:00
Then I need to create events using the timestamp above with data between TAGs <measInfo measInfoId="PNODE"> and </measInfo>
Below is a very small sample of the data I'm trying to parse.
I suspect Logstash might not deal with such large XML files in a good way. If you can parse the whole file in one swoop then it's technically pretty easy to do what you want (though getting the multiline config right can be tricky). It would be nice if you could parse the measInfo elements one by one but then you won't be able to pick up the timestamp correctly.
Here are a few options to explore:
Try to parse the file in one swoop.
Use another program for parsing the file and rewriting it to a more convenient format.
Write a custom plugin for parsing the timestamps and making them accessible as you process the measInfo elements.
Use Logstash to parse the file twice; once to extract the timestamp and store it in the name of the output file (but otherwise don't try to parse the XML), then another file input that reads those files and uses a multiline codec to extract the measInfo elements into events, stamping them with the timestamp found in the input filename.
Many of these options involve not parsing the files as XML but rather taking regexp shortcuts. Beware.
I'm still stuck on this.
How could I parse the file in one swoop using the timestamp for all sections of this file?
Would I have to "loop" through the each section of the file inserting the timestamp for each??
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.