Context: I need to put (several millions) XML invoices into ES and query and aggregate over their content.
Proposed solution: use Logstash xml filter plugin to map the XML to JSON
First question: For my context, would this be the best solution?
If so, second question: can I create the filter definition automatically from an XSD describing the invoices? Because the XSD contains about 500 possible fields and I will have to make about 10 different indices.
Well if you need to use xpath, I suppose in theory you could use xslt to transform an xsd into a set of xpath expressions. But I would just use something like
The file input will read any files that match the path. So I would expect you to use
path => "C:\elastic\XML4Logstash*.xml"
If you run logstash once it will read the files and store the fact that it has done so in the sincedb. If you restart logstash it will know it has read the files and start tailing them at the point it had read to so that if anything is appended it can process it. Probably not useful in your case. However, it will read any new files.
If you want to consume an entire file as a single event then you will need to use a multiline filter. There is an example here. If you do not use a multiline filter then each line of the file will be a separate event.
The event will contain the contents of the file (either a single line, or the output of the multiline) in a field called "message", so that should be the source for you xml filter.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.