Parsing XML files with Logstash in Windows

Hi,

I m trying to parse XML files in Windows under C:\busesdata folder using Logstash 2.3.3 and sending them to ES 1.7.1.
I was getting errors and fixing them, until logstash started without errors (as i think), but it still blocked like this:

Following is the XML structure i want to parse:

<HF_DOCUMENT>
	<H_Ticket>
		<IDH_Ticket>1</IDH_Ticket>
		<CodeBus>186</CodeBus>
		<CodeCh>5531</CodeCh>
		<CodeConv>5531</CodeConv>
		<Codeligne>12</Codeligne>
		<Date>20150903</Date>
		<Heur>1101</Heur>
		<NomFR1>SOUK AHAD</NomFR1>
		<NomFR2>SOVIVA </NomFR2>
		<Prix>0.66</Prix>
		<IDTicket>1</IDTicket>
		<CodeRoute>107</CodeRoute>
		<origine>01</origine>
		<Distination>07</Distination>
		<Num>3</Num>
		<Ligne>107</Ligne>
		<requisition> </requisition>
		<voyage>0</voyage>
		<faveur> </faveur>
	</H_Ticket>
</HF_DOCUMENT>

And here is my logstash.conf file

input {  
file 
{
    path => "C:\busesdata\*.xml"
    start_position => "beginning"
    type => "ticket"
    codec => multiline 
    {
        pattern => "^<\?H_Ticket .*\>"
        negate => true
        what => "previous"
    }
  }
}
filter 
{
    xml 
    {
        source => "ticket"
        xpath => 
        [
            "/ticket/IDH_Ticket/text()", "ticketId",
            "/ticket/CodeBus/text()", "codeBus",
            "/ticket/CodeCh/text()", "codeCh",
            "/ticket/CodeConv/text()", "codeConv",
            "/ticket/Codeligne/text()", "codeLigne",
            "/ticket/Date/text()", "date",
            "/ticket/Heur/text()", "heure",
            "/ticket/NomFR1/text()", "nomFR1",
            "/ticket/NomFR2/text()", "nomFR2",
            "/ticket/Prix/text()", "prix",
            "/ticket/IDTicket/text()", "idTicket",
            "/ticket/CodeRoute/text()", "codeRoute",
            "/ticket/origine/text()", "origine",
            "/ticket/Distination/text()", "destination",
            "/ticket/Num/text()", "num",
            "/ticket/Ligne/text()", "ligne",
            "/ticket/requisition/text()", "requisition",
            "/ticket/voyage/text()", "voyage",
            "/ticket/faveur/text()", "faveur"
        ]
        store_xml => true
        target => "doc"
    }
}

output 
{
    elasticsearch 
    { 
        hosts => "localhost"
        index => "buses"
        document_type => "ticket"
    }
    file {
    path => "C:\busesdata\logstash.log"
}
}

I have no idea about this issue and i don't know even how to access logstash logs under Windows environnment.

Any kind of help with that will be appreciated

You'll probably want to set sincedb_path => "nul" for your file input so that it never tries to start where it left off. Also, make sure you adjust ignore_older if any of your input files might be older than 24 hours.

The problem could also be that Logstash is waiting for the next line that would complete the current event, but that never happens because how the multiline codec has been configured.

In my use case, i need to catch every modification on these XML files and to do not start from the beginning every time.
How it might be configured to do so?

Oh, so the files will contain multiple XML documents?

<HF_DOCUMENT>
  ...
</HF_DOCUMENT>
<HF_DOCUMENT>
  ...
</HF_DOCUMENT>
...

That should work fine, but keep in mind that Logstash will have problems emitting the last event of the file since you're using the start of the next element as the signal to emit the currently buffered event. There's probably a GitHub issue tracking this. I don't recall the details.

Files contains something like that:

<HF_DOCUMENT>
	<H_Ticket>
		<IDH_Ticket>1</IDH_Ticket>
		<CodeBus>186</CodeBus>
		<CodeCh>5531</CodeCh>
		<CodeConv>5531</CodeConv>
		<Codeligne>12</Codeligne>
		<Date>20150903</Date>
		<Heur>1101</Heur>
		<NomFR1>SOUK AHAD</NomFR1>
		<NomFR2>SOVIVA </NomFR2>
		<Prix>0.66</Prix>
		<IDTicket>1</IDTicket>
		<CodeRoute>107</CodeRoute>
		<origine>01</origine>
		<Distination>07</Distination>
		<Num>3</Num>
		<Ligne>107</Ligne>
		<requisition> </requisition>
		<voyage>0</voyage>
		<faveur> </faveur>
	</H_Ticket>
	<H_Ticket>
		<IDH_Ticket>1</IDH_Ticket>
		<CodeBus>186</CodeBus>
		<CodeCh>5531</CodeCh>
		<CodeConv>5531</CodeConv>
		<Codeligne>12</Codeligne>
		<Date>20150903</Date>
		<Heur>1101</Heur>
		<NomFR1>SOUK AHAD</NomFR1>
		<NomFR2>SOVIVA </NomFR2>
		<Prix>0.66</Prix>
		<IDTicket>1</IDTicket>
		<CodeRoute>107</CodeRoute>
		<origine>01</origine>
		<Distination>07</Distination>
		<Num>3</Num>
		<Ligne>107</Ligne>
		<requisition> </requisition>
		<voyage>0</voyage>
		<faveur> </faveur>
	</H_Ticket>
	<H_Ticket>
		<IDH_Ticket>1</IDH_Ticket>
		<CodeBus>186</CodeBus>
		<CodeCh>5531</CodeCh>
		<CodeConv>5531</CodeConv>
		<Codeligne>12</Codeligne>
		<Date>20150903</Date>
		<Heur>1101</Heur>
		<NomFR1>SOUK AHAD</NomFR1>
		<NomFR2>SOVIVA </NomFR2>
		<Prix>0.66</Prix>
		<IDTicket>1</IDTicket>
		<CodeRoute>107</CodeRoute>
		<origine>01</origine>
		<Distination>07</Distination>
		<Num>3</Num>
		<Ligne>107</Ligne>
		<requisition> </requisition>
		<voyage>0</voyage>
		<faveur> </faveur>
	</H_Ticket>
</HF_DOCUMENT>

If I want to keep track on my files, how can i do that using sincedb(content of sincedb_path config) ?

Actually, I did parse the XML files using Filebeat and successfully send them to Logstash and from LS, i reached to send them to ES, but this time filter block failed to separate every XML field

I asked about that in StackOverflow, if you want details: http://stackoverflow.com/questions/38150042/parsing-xml-data-from-filebeat-using-logstash

It's done using Filebeat as a middle-ware which parse XML entries first and send them to LS for advanced parsing and forward to ES.