Hi,
I am consuming large (~600Mb) XML files with one line of data in the file. I am able to (mostly) consume the data using Logstash using xpath to extract fields from the XML and I am stuck on an issue I can't get around.
Every time I ingest data, i am missing data in Elasticsearch. I have created a smaller version of the XML file (purely to make handling the data easier) and I can re-create the missing records (x10 in this case) every time. When I used the xml filter option : parse_options => "strict"
, I found my reason in the logs and in a separate index i created to hold bad data. Two XML records are being consumed as one line item by logstash and there is an XML validation error thrown in logstash (adding the _xmlparsefailure
tag to the record in the index).
The error thrown is below (for ease of reading I have removed most of the XML content, the delimiter and one other tag remains):
[2020-09-09T16:20:28,200][WARN ][logstash.filters.xml ][xml_delta]
[d5db39a22f47f58100f7ef6053298bbdb6f09208d47b5732fe2c12913d551919]
Error parsing xml {:source=>"message", :value=>
"<FinInstrm><ModfdRcrd>...</ModfdRcrd></FinInstrm><FinInstrm><ModfdRcrd>...</ModfdRcrd></FinInstrm>",
:exception=>#<Nokogiri::XML::SyntaxError: The markup in the document following the root element must be well-formed.>,
If you look at the raw XML data, there is what appears to be no difference in these and surrounding XML records. To ensure the XML I am consuming (from a third party provider) is good XML, I run the file through the linux command line tool xmlstarlet which doesn't complain. There are no duplicate records in the data, I have checked many, many times. Given the delimiter I am using in the Logstash File Input - the records should not have been merged together but the error in the "bad" index <message> has two of these delimiters.
My configuration is below, I am only showing the pieces I think are relevant but I am happy to share more if needed - it's quite large with many xpaths and a few more filters where I am copying fields around - I don't think the issue can lie there though. And the thing is - this configuration works for 499,845 records out of 500,000!!
input{
file{
path => "/data/XMLS/DELTA/DLTINS_*.xml"
start_position => "beginning"
mode => "read"
delimiter => "</FinInstrm>"
type => "dltins"
enable_metric => true
}
}
filter{
mutate {
replace => { "message" => "%{message}</FinInstrm>" }
}
xml{
target => "doc"
source => "message"
store_xml => false
parse_options => "strict"
enable_metric => true
xpath =>
[
...
...
...
]
}
}
Any ideas/thoughts ??
Thanks,
Steve.