Logstash missing XML records

Hi,

I am consuming large (~600Mb) XML files with one line of data in the file. I am able to (mostly) consume the data using Logstash using xpath to extract fields from the XML and I am stuck on an issue I can't get around.

Every time I ingest data, i am missing data in Elasticsearch. I have created a smaller version of the XML file (purely to make handling the data easier) and I can re-create the missing records (x10 in this case) every time. When I used the xml filter option : parse_options => "strict", I found my reason in the logs and in a separate index i created to hold bad data. Two XML records are being consumed as one line item by logstash and there is an XML validation error thrown in logstash (adding the _xmlparsefailure tag to the record in the index).

The error thrown is below (for ease of reading I have removed most of the XML content, the delimiter and one other tag remains):

[2020-09-09T16:20:28,200][WARN ][logstash.filters.xml     ][xml_delta]
[d5db39a22f47f58100f7ef6053298bbdb6f09208d47b5732fe2c12913d551919] 
Error parsing xml {:source=>"message", :value=>
"<FinInstrm><ModfdRcrd>...</ModfdRcrd></FinInstrm><FinInstrm><ModfdRcrd>...</ModfdRcrd></FinInstrm>", 
:exception=>#<Nokogiri::XML::SyntaxError: The markup in the document following the root element must be well-formed.>,

If you look at the raw XML data, there is what appears to be no difference in these and surrounding XML records. To ensure the XML I am consuming (from a third party provider) is good XML, I run the file through the linux command line tool xmlstarlet which doesn't complain. There are no duplicate records in the data, I have checked many, many times. Given the delimiter I am using in the Logstash File Input - the records should not have been merged together but the error in the "bad" index <message> has two of these delimiters.

My configuration is below, I am only showing the pieces I think are relevant but I am happy to share more if needed - it's quite large with many xpaths and a few more filters where I am copying fields around - I don't think the issue can lie there though. And the thing is - this configuration works for 499,845 records out of 500,000!!

    input{
        file{
            path => "/data/XMLS/DELTA/DLTINS_*.xml"
            start_position => "beginning"
            mode => "read"
            delimiter => "</FinInstrm>"
            type => "dltins"
            enable_metric => true
        }
    }

    filter{
            mutate {
               replace => { "message" => "%{message}</FinInstrm>" }
            }
            xml{
            target => "doc"
            source => "message"
            store_xml => false
            parse_options => "strict"
            enable_metric => true

     xpath =>
            [
    		...
    		...
    		...
    		]
        }
    }

Any ideas/thoughts ??

Thanks,

Steve.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.