Relatively new to Logstash, but I'm just trying to understand the operation of the underlying framework.
I am using 5.5.0.
So I am parsing an XML file using, 4000 unique records. Not that it helps, this is the gist of my schema, many more fields, which are all extracted correctly by the XML filter with no problem.
<Document> <Event> <name></name> <description></description> <timestamp></timestamp> <lla></lla> </Event> <Event> <name></name> <description></description> <timestamp></timestamp> <lla></lla> </Event> </Document>
So I read in the whole document at one point, but when I tried to split on my event, the object that came back was outrageously complex, so I modified my input to read in line by line, detecting when it has an event so that it can continue processing it in the pipeline. This works great. Data is aggregated and processed correctly. I can see that it processes each line, one by one, applying the appropriate regex's to determine how the line should be handled. Common process, I have written log listeners that operate in the same manner where listeners are notified of various system events by monitoring the logs.
Here's my dilemma. Probably 40% of the time I get 4000 (the correct number) objects directed to stdout. Often times I will get 4001 to 4003. I am still working thru the logs, but I can see that the lines appear to be processed by the input stage once (or at least I only see them once).
When I check the output that has been sent to stdout, when the number of records in the file does not match the output, I can see that I have at least one record that is duplicate, and typically I see them back to back which I know is wrong since I don't have any duplicates, but also because of my projection of how the pipeline works.
My input is standard, reading line by line from an XML file, defined by my defined patterns. I use a handful of filters, the xml xpath to extract the data, ruby to perform some validation and add some derived fields, and then some mutate to remove unnecessary fields/tags. I then have the output written to stdout for now, I detected this problem in verifying the output as my kafka consumer received more than 4000 records.
Obviously my functions are invoked from the interface, basically an IOC, but I am genuinely stumped by what could be implemented in the configuration file that would cause this to happen.
Has anyone seen this before?