Elastic has an issue parsing xml. Returns error "Error parsing xml with XmlSimple...." Also "Last ## unconsumed characters". Logstash config file is as follows
<?xml version="1.0" encoding="utf-8"?>
<PlanRequests>
<Header export_date="25-Jun-2018 18:00" query="test_query extents"/>
<ChgTestRev test_id="TD-TD-00000004" test_rev="001" type="TT-TD" status="TEst" url="https://test.some.com/test/#com.company.more.testfx.test.write.showObject;nid=tgERGEsdEWR" last_modified_date="21-Mar-2017 12:07">
<Property name="test_data">TestData, Test Data</Property>
<Property name="test_data2">TestData, Test Data/002</Property>
<Property name="TDTest">TestData</Property>
<Property name="TDTest2">TestData</Property>
<Property name="TDTEst3">TestData</Property>
<Property name="TDTest4"></Property>
<Property name="test_data3"></Property>
<Property name="test_data4">TestData</Property>
<Property name="test_data5">TestData, TestData (testt)</Property>
<Property name="test_data6">TestData, TestData (twett)</Property>
<Property name="test_data7">TestData, TestData (ttett)</Property>
<Property name="test_data8"></Property>
<Property name="test_data9">TestData/TestData</Property>
<Property name="test_data10"><![CDATA[<p>testingtest test test</p>
<p>Make sure <em><strong><span style="background-color:rgb(64, 224, 208)">spell check </span></strong></em>works</p>]]></Property>
<Property name="test_data57"><![CDATA[<p>test test <span style="color:rgb(128, 0, 0)">make sure </span>spell check works, bolding et. -- ty</p>
<p> </p>
<p> </p>
<p> </p>
<h2 style="font-style: italic;">test again for <u><strong>allot of words </strong></u>and test</h2>]]></Property>
<Property name="test_data11">TestData TestData</Property>
<Property name="test_data12"></Property>
<Property name="test_data13"></Property>
<Property name="test_data14"></Property>
<Property name="test_data15">TestData</Property>
<Property name="test_data16">test.data@testing.com</Property>
<Property name="test_data17">Test A. Testing</Property>
<Property name="test_data18">Test test, test, test</Property>
<Property name="test_data19"></Property>
<Property name="test_data20">Test Data</Property>
<Property name="test_data21">gsdgge</Property>
<Property name="test_data22"></Property>
<Property name="test_data23">TestData</Property>
<Property name="test_data24"></Property>
<Property name="test_data25">test, test, test, test</Property>
<Property name="test_data26">test, test, test, test</Property>
<Property name="test_data27"></Property>
<Property name="test_data28">2003</Property>
<Property name="test_data29">2005</Property>
<Property name="test_data30"></Property>
<Property name="test_data31"></Property>
<Property name="test_data32">TestData, TestData (testt)</Property>
<Property name="test_data33"></Property>
<Property name="test_data34"></Property>
<Property name="test_data35"></Property>
<Property name="test_data36">TestData, TestData (testt)</Property>
<Property name="test_data37"></Property>
<Property name="test_data38">TEstDATa200</Property>
<Property name="test_data39">test what happens if make this very long does it wrapt appropriately - spel chek does not work:test what happens if make this very long does it wrapt appropriately - spel chek does not work:test what happens if make this very long does it wrapt appropriately - spel chek does not work: test what happens if make this very long does it wrapt appropriately - spel chek does not work:test what happens if make this very long does it wrapt appropriately - spel chek does not work:test what happens if make this very long does it wrapt appropriately - spel chek does not work</Property>
<Property name="test_data40"></Property>
<Property name="test_data41"></Property>
<Property name="test_data42"></Property>
<Property name="test_data43"></Property>
<Property name="test_data44"></Property>
<Property name="test_data45"><![CDATA[25.25 - Test Data, 24.57 - Test Data, 24.57 - Test Data, 24.57 - Test Data, 25.25 - Test Data]]></Property>
<Property name="test_data46"></Property>
<Property name="test_data47"></Property>
<TestImpact test_data_48="23-Mar-2017 12:07">
<Program test_data49="TestData" test_data50="TestData" test_data51="1998" test_data52="1994" test_data53="Yes" test_data54="TDE" test_data55="Red" test_data56=""/>
</TestImpact>
</ChgTestRev>
</PlanRequests>
Is an example of part of the xml file being ingested. The field names would be the "test_data" or "TestData" fields and the values would be those which equal are in quotes and equal to it or are in between opening and closing tags.
A file input in tail mode (the default) will consume a file one line at a time. You could change the file input to read mode, making sure you understand the default file_completion_action.
Or else use a multiline codec with a pattern that never matches. For example
Sorry I have a few questions. The codec options seemed to work, however, everything is being pulled in as one whole document but they should be pulled in as multiple objects (not for this file but for one with more objects like the one above). if that makes sense. Is there a way to split those or separate them?
Also, is it possible to not include the message field since it seems to pretty much repeat the whole doc over again? Would just be wasted space since its being pulled in those fields. The store_xml option didn't work.
The Properties also come in as an array with name (field) and content (value) pairs. Is it possible just have these pulled in not as an array of pairs but just have the name as fields for the doc and their corresponding content (value)?
Does the pattern control which doc or part of the xml file is included or no? In other words is there a way to filter the xml document for objects that one does not want included based on a field?
I mean like splitting these up. Having (1) as one object and (2) as another and so on. They have the same fields but are separate objects. Im trying to get both or rather multiple of these as their docs within an index.
It's working and pulling data way better than it before. Thank you. There is only one more issue which is that the beginning <?xml version...> and an extra line or two part of the xml but not for the separate objects creates errors and pulls in an extra object with the tag "xmlParseFailure". When these lines are removed the errors don't occur and all data pulled is valid. Having to remove these lines manually would be tedious since it would need to be done often. Is there anyway to remove these lines or ignore them when ingesting this or any xml file?
The '<?xml version...' does not cause a parsing problem in the example I gave. If you have extra text that does cause a problem then you could use mutate+gsub to remove it.
The lines include an opening tag with a different name at the beginning of the xml file and a closing tag at the end. And an extra header for it but its not part of an of the objects. The mutate+gsub take field names right? I wouldn't be able to remove those 4 lines with it no?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.