Get the XML node from unformatted xml

I have a xml that is not parsing correctly unless I prettify the xml. XML tags can be a new line or same line with the last end tag, and does not have indentation for it. but i tried with prettify it worked fine. can we achieve this with out prettify or do we need to prettify the xml befor we process with logstash? Is there a way that i can handle with logstash?

Below is my xml

<?xml version="1.0" encoding="utf-8"?><batch_upload><batches><batch><id>00010</id><title>Batch Title</title><description><![CDATA[We're one of the largest Membership organizations in the country, but we’re so much more than our legendary roadside service. We call our club's vision, mission, values, and supporting pillars "Our House" because they are the foundation for all that we do.  We're working to transform life by unleashing the innovative spirit of our Team Members. We're community minded, and celebrate the growth, development and successes of our diverse Team Members. .]]></description><city>Anchorage</city><state>AK</state><zipcode>99503</zipcode><country>USA</country><parameters /><groups><group type="1">10</group><group type="1">11</group><classification type="1">12</group><group type="2" />212</group><requirements /><person><name>Person one</name><methods><method type="online">Method 1</method></methods></person></batch></batches></batch_upload>

below is my logstash config

input {
file {
path => "batches.xml"
start_position => beginning
sincedb_path => "NUL"
type => "xml"
codec => multiline
{
pattern => "|</batches"
negate => true
what => "previous"
auto_flush_interval => 1
}
}
}

filter {
if [message] == "" or [message] == "" or [message] == "\r" {
drop {}
}
xml {
store_xml => false
source => "message"
target => "message.parsed"
xpath => [
"/batch/id/text()", batch_id,
"/batch/title/text()", batch_title
]
force_array => false
}
}

You do not need to prettify XML before parsing it. Please edit your post and use </> in the tool bar above the edit pane to preserve the formatting of your XML and configuration.

1 Like

That is not valid XML. The groups element is never closed. The final group is closed using both /> and </group>. And the classification element is also closed using </group>. You cannot have spaces in <requirements /> (likewise parameters).

Once you fix all that you have to have the complete path in the xpath.

    xpath => {
        "/batch_upload/batches/batch/id/text()" => batch_id
        "/batch_upload/batches/batch/title/text()" => batch_title
    }

Thank You Badger for the reply,

its a typo.

here is the original XML we get.

<?xml version="1.0" encoding="utf-8"?><batch_upload><batches><batch><id>0101</id><title>Some Title 1</title><description><![CDATA[We're one of the largest Membership organizations in the country, but we’re so much more than our legendary roadside service. We call our club's vision, mission, values, and supporting pillars "Our House" because they are the foundation for all that we do. We're working to transform AAA for the next century with a mission to create Members

  • Sells International & Domestic vacation packages, cruises, tours, hotel, car rental, rail and air travel

  • Owns the relationship with the member from beginning to end of each travel related transaction

  • Researches, evaluates and compares appropriate AAA Travel Partner packages to match up with member needs for the purpose of “delivering exceptional member experiences” in every transaction

for life by unleashing the innovative spirit of our Team Members. We're community minded, and celebrate the growth, development and successes of our diverse Team Members.]]></description><city>Anchorage</city><state>AK</state><zipcode>99503</zipcode><country>USA</country><dateacquired>2018-12-19T08:19:40-05:00</dateacquired><parameters /><groups><group type="1">222</group><group type="1">333</group><group type="1">444</group><group type="2" /></groups><requirements /><application><person>Person One</person><methods><method type="online">5252klg</method></methods></application></batch><batch><id>0202</id><title>Some Title 2</title><description><![CDATA[We're one of the largest Membership organizations in the country, but we’re so much more than our legendary roadside service. We call our club's vision, mission, values, and supporting pillars

  • Sells International & Domestic vacation packages, cruises, tours, hotel, car rental, rail and air travel

  • Owns the relationship with the member from beginning to end of each travel related transaction

  • Researches, evaluates and compares appropriate AAA Travel Partner packages to match up with member needs for the purpose of “delivering exceptional member experiences” in every transaction

"Our House" because they are the foundation for all that we do. We're working to transform AAA for the next century with a mission to create Members for life by unleashing the innovative spirit of our Team Members.]]></description><city>Anchorage</city><state>AK</state><zipcode>99503</zipcode><country>USA</country><dateacquired>2018-12-19T23:30:11-05:00</dateacquired><parameters /><groups><group type="1">545454</group><group type="1">4545</group><group type="1">7878</group><group type="2" /></groups><requirements /><application><person>Person two</person><methods><method type="online">213234sdf</method></methods></application></batch><batch><id>0303</id><title>Some Title 3</title><description><![CDATA[We are s -

  • Sells International & Domestic vacation packages, cruises, tours, hotel, car rental, rail and air travel

  • Owns the relationship with the member from beginning to end of each travel related transaction

  • Researches, evaluates and compares appropriate AAA Travel Partner packages to match up with member needs for the purpose of “delivering exceptional member experiences” in every transaction

an industry leader in the sales and lease-to-own retailer known for quality brand names and superior customer service. We provide our team members the opportunity to reach their full potential in a team-oriented, high-energy, recognition-based environment with competitive pay and benefits. This is much more than a batch – It is a career with purpose]]></description><city>Anchorage</city><state>AK</state><zipcode>99503</zipcode><country>USA</country><dateacquired>2018-12-05T03:13:44-05:00</dateacquired><parameters /><groups><group type="1">454545</group><group type="1">7778</group><group type="1">45555</group><group type="2" /></groups><requirements /><application><person>Person three</person><methods><method type="online">asdfsafwer23</method></methods></application></batch></batches></batch_upload>

Thanks again Badger, I appreciate your help.

I want you know that we do have multiple batches. and the file is huge. can you help reading each batch a different event not the whole batches in single event.

OK, so this is less about XML and more about how to create events using a file input. There are a couple of options.

You say the file is huge, but do not quantify that. Huge is different things in different circumstances. For example, if you have a 64-bit JVM on a box with 2 TB of memory you could possibly ingest 100 GB files as single events. I'd say if the file size is more than 10% of the heap it is unlikely to work.

If you are going to use a multiline codec to handle each batch then you need to prettify it enough for the start of a batch to be on a new line, so that you can use a pattern that matches the end of a batch.

Then you will need some mutate+gsub to get rid of the xml and batch_upload elements.

Does a batch_upload element ever contain more than one batches element?

1 Like

Thank You, Badger.

I was very helpful!
I am working on the xml file will get back to you if I see any obstacle.

As you suggested we are formatting the XML file before processing the with Logstash. Below is the process we are following and found the documents processed to ES is holding the different number of the document on every run.

  1. We did validate the XML.

  2. We got the count of the Batches to make sure the documents are processed completely.

  3. FYI, the file we are processing is holding 1849702 batches.

  4. First time I got 1849840, Next time with a different number...

  5. We formatted the XML element to make sure every XML is starting in New Line i.e. We are adding Environment.NewLine with C# if we peek into it this is how the output is holding (\r\n) before every XML element.

  6. Below is my logstash config

Input
Please do check the max_lines property. we need this because object descriptions lines can be between 7000 to 10000 on an average.

input {
file {
path => "batches.xml"
start_position => beginning
sincedb_path => "NUL"
type => "xml"
codec => multiline
{
pattern =>"<batch>|<batch>\n|<batch>\r\n|<batch>\r"
negate => true
what => "previous"
max_lines => 10000
auto_flush_interval => 1
}
}
}

Filters

filter {

if [message] == "<batches>" or [message] == "</batches>" or [message] =~ "<?xml version" or [message] =~ "batches>" or [message] =~ "batch_upload>" {
drop {}
}
xml {
store_xml => false
source => "message"
target => "message.parsed"
xpath => [
"/job/id/text()", batch_id,
"/job/title/text()", batch_title,
"/job/zipcode/text()", zipcode,
"/job/application/country/text()", country
]
force_array => false
}

mutate {
remove_field => [ "path","host","type","tags"]
}
fingerprint {
target => "uuid"
method => "UUID"
}
}

Output

output {
elasticsearch {
index => "batchesindex"
document_type => "batches"
hosts => "10.10.10.3:9200"
manage_template => true
template => "/data/test/config/elasticsearch-template1.json"
template_overwrite => "true"
document_id => "%{@timestamp}%{uuid}"
}
}

By using this configuration

I can accomplish to load the complete batches without any issue if I brake it down to the small file, that is holding 20 to 30 batches, we are following the above steps and using the same config

But when I try to load the whole file is see some events are not processed properly.
i.e.

Message is holding from the start tag of <id>00010</id>... In this case we are having extra batches documents been processed to ES.

Can you please help us with this issue. How to make sure the batches are processed completely without any data leeks or duplicates or event brake downs.

@Badger/ @magnusbaeck Can you please suggest on this!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.