How to handle XML file "Last ## unconsumed characters" error

Elastic has an issue parsing xml. Returns error "Error parsing xml with XmlSimple...." Also "Last ## unconsumed characters". Logstash config file is as follows

input {
file {
path => "filepath"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
xml {
source => "message"
target => "theXML"
}
}
output {
elasticsearch {
hosts => ["hostserver:9200"]
index => "crtest"
}
}

<?xml version="1.0" encoding="utf-8"?>
<PlanRequests>
   <Header export_date="25-Jun-2018 18:00" query="test_query extents"/>
   <ChgTestRev test_id="TD-TD-00000004" test_rev="001" type="TT-TD" status="TEst" url="https://test.some.com/test/#com.company.more.testfx.test.write.showObject;nid=tgERGEsdEWR" last_modified_date="21-Mar-2017 12:07">
      <Property name="test_data">TestData, Test Data</Property>
      <Property name="test_data2">TestData, Test Data/002</Property>
      <Property name="TDTest">TestData</Property>
      <Property name="TDTest2">TestData</Property>
      <Property name="TDTEst3">TestData</Property>
      <Property name="TDTest4"></Property>
      <Property name="test_data3"></Property>
      <Property name="test_data4">TestData</Property>
      <Property name="test_data5">TestData, TestData (testt)</Property>
      <Property name="test_data6">TestData, TestData (twett)</Property>
      <Property name="test_data7">TestData, TestData (ttett)</Property>
      <Property name="test_data8"></Property>
      <Property name="test_data9">TestData/TestData</Property>
      <Property name="test_data10"><![CDATA[<p>testingtest test test</p>

<p>Make sure <em><strong><span style="background-color:rgb(64, 224, 208)">spell check </span></strong></em>works</p>]]></Property>
      <Property name="test_data57"><![CDATA[<p>test test <span style="color:rgb(128, 0, 0)">make sure </span>spell check works, bolding et.&nbsp; -- ty</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<h2 style="font-style: italic;">test again for <u><strong>allot of words </strong></u>and test</h2>]]></Property>
      <Property name="test_data11">TestData TestData</Property>
      <Property name="test_data12"></Property>
      <Property name="test_data13"></Property>
      <Property name="test_data14"></Property>
      <Property name="test_data15">TestData</Property>
      <Property name="test_data16">test.data@testing.com</Property>
      <Property name="test_data17">Test A. Testing</Property>
      <Property name="test_data18">Test test, test, test</Property>
      <Property name="test_data19"></Property>
      <Property name="test_data20">Test Data</Property>
      <Property name="test_data21">gsdgge</Property>
      <Property name="test_data22"></Property>
      <Property name="test_data23">TestData</Property>
      <Property name="test_data24"></Property>
      <Property name="test_data25">test, test, test, test</Property>
      <Property name="test_data26">test, test, test, test</Property>
      <Property name="test_data27"></Property>
      <Property name="test_data28">2003</Property>
      <Property name="test_data29">2005</Property>
      <Property name="test_data30"></Property>
      <Property name="test_data31"></Property>
      <Property name="test_data32">TestData, TestData (testt)</Property>
      <Property name="test_data33"></Property>
      <Property name="test_data34"></Property>
      <Property name="test_data35"></Property>
      <Property name="test_data36">TestData, TestData (testt)</Property>
      <Property name="test_data37"></Property>
      <Property name="test_data38">TEstDATa200</Property>
      <Property name="test_data39">test what happens if make this very long does it wrapt appropriately - spel chek does not work:test what happens if make this very long does it wrapt appropriately - spel chek does not work:test what happens if make this very long does it wrapt appropriately - spel chek does not work: test what happens if make this very long does it wrapt appropriately - spel chek does not work:test what happens if make this very long does it wrapt appropriately - spel chek does not work:test what happens if make this very long does it wrapt appropriately - spel chek does not work</Property>
      <Property name="test_data40"></Property>
      <Property name="test_data41"></Property>
      <Property name="test_data42"></Property>
      <Property name="test_data43"></Property>
      <Property name="test_data44"></Property>
      <Property name="test_data45"><![CDATA[25.25 - Test Data, 24.57 - Test Data, 24.57 - Test Data, 24.57 - Test Data, 25.25 - Test Data]]></Property>
      <Property name="test_data46"></Property>
      <Property name="test_data47"></Property>
      <TestImpact test_data_48="23-Mar-2017 12:07">
         <Program test_data49="TestData" test_data50="TestData" test_data51="1998" test_data52="1994" test_data53="Yes" test_data54="TDE" test_data55="Red" test_data56=""/>
      </TestImpact>
   </ChgTestRev>
</PlanRequests>

Is an example of part of the xml file being ingested. The field names would be the "test_data" or "TestData" fields and the values would be those which equal are in quotes and equal to it or are in between opening and closing tags.

A file input in tail mode (the default) will consume a file one line at a time. You could change the file input to read mode, making sure you understand the default file_completion_action.

Or else use a multiline codec with a pattern that never matches. For example

codec => multiline {
    pattern => "^Spalanzani"
    negate => true
    what => previous
    auto_flush_interval => 1
}

Hi @Badger,

Sorry I have a few questions. The codec options seemed to work, however, everything is being pulled in as one whole document but they should be pulled in as multiple objects (not for this file but for one with more objects like the one above). if that makes sense. Is there a way to split those or separate them?

Also, is it possible to not include the message field since it seems to pretty much repeat the whole doc over again? Would just be wasted space since its being pulled in those fields. The store_xml option didn't work.

The Properties also come in as an array with name (field) and content (value) pairs. Is it possible just have these pulled in not as an array of pairs but just have the name as fields for the doc and their corresponding content (value)?

Does the pattern control which doc or part of the xml file is included or no? In other words is there a way to filter the xml document for objects that one does not want included based on a field?

You really need to provide an example with your questions. Otherwise it is unclear exactly what you are asking.

When you say more than one object, an xml filter cannot parse an event like

<a>1</a><a>2</a>

It will fail with "attempted adding second root element to document". If you have XML like

<a><b>1</b><b>2</b></a>

The the xml filter will create an array which you can split.

    xml { source => "message" store_xml => true target => "theXML" remove_field => [ "message" ] }
    split { field => "[theXML][b]" }

results in

{
      "path" => "/home/user/foo.txt",
"@timestamp" => 2019-07-03T17:02:06.239Z,
  "@version" => "1",
    "theXML" => {
    "b" => "1"
}
}
{
      "path" => "/home/user/foo.txt",
"@timestamp" => 2019-07-03T17:02:06.239Z,
  "@version" => "1",
    "theXML" => {
    "b" => "2"
}
}

So if you ask a more specific question I will try to answer it.

So as an example:

 (1)  <PlanRequests>
         ....<field1>
         ....<field2>
         ....<field3>
         ....<field4>
   </PlanRequests>
 (2)  <PlanRequests>
         ....<field1>
         ....<field2>
         ....<field3>
         ....<field4>
   </PlanRequests>

I mean like splitting these up. Having (1) as one object and (2) as another and so on. They have the same fields but are separate objects. Im trying to get both or rather multiple of these as their docs within an index.

If you have a file that looks like

<?xml version="1.0" encoding="utf-8"?>
<PlanRequests>
   <Header export_date="25-Jun-2018 18:00" query="test_query extents"/>
</PlanRequests>
<PlanRequests>
   <Header export_date="26-Jun-2018 18:00" query="foo"/>
</PlanRequests>

then use a multiline codec to read each object

codec => multiline { pattern => "</PlanRequests>" negate => true what => next auto_flush_interval => 1 }

Hi Badger,

It's working and pulling data way better than it before. Thank you. There is only one more issue which is that the beginning <?xml version...> and an extra line or two part of the xml but not for the separate objects creates errors and pulls in an extra object with the tag "xmlParseFailure". When these lines are removed the errors don't occur and all data pulled is valid. Having to remove these lines manually would be tedious since it would need to be done often. Is there anyway to remove these lines or ignore them when ingesting this or any xml file?

The '<?xml version...' does not cause a parsing problem in the example I gave. If you have extra text that does cause a problem then you could use mutate+gsub to remove it.

The lines include an opening tag with a different name at the beginning of the xml file and a closing tag at the end. And an extra header for it but its not part of an of the objects. The mutate+gsub take field names right? I wouldn't be able to remove those 4 lines with it no?

I would expect mutate+gsub to be able to do it. For example, to remove the <?xml version... you could use

    mutate { gsub => [ "message", "^<\?xml[^
]+
", "" ] }

Use literal newlines inside the pattern. So that is start of line, followed by <?xml followed by one or more not-newline followed by newline.