Xml filter: create filter definition based on xsd

Hi

Context: I need to put (several millions) XML invoices into ES and query and aggregate over their content.

Proposed solution: use Logstash xml filter plugin to map the XML to JSON

First question: For my context, would this be the best solution?
If so, second question: can I create the filter definition automatically from an XSD describing the invoices? Because the XSD contains about 500 possible fields and I will have to make about 10 different indices.

Well if you need to use xpath, I suppose in theory you could use xslt to transform an xsd into a set of xpath expressions. But I would just use something like

xml {
    source => "message"
    store_xml => true
    target => "theXML"
    force_array => false
}

It will parse the XML and create a JSON structure that reflects it in a field called theXML

Whaaaaaaaaaat? So simple?

That's awesome!

So I started this up and added this config-file, without changing any other stuff in logstash configuration:

input{ 
  file
  {
	path => "C:\elastic\XML4Logstash"
	type => "xml"
  } 
}
filter{
  xml{
	source => "message"
    store_xml => false
    target => "orderrsp"
    force_array => false
  }
}
output {
  elasticsearch { hosts => ["localhost:9200"] }
}

I was hoping logstash would now pick up any file in that folder?
I've been looking at the documentation. It says what exists, but not how it works.

You almost certainly want store_xml to be true.

The file input will read any files that match the path. So I would expect you to use

path => "C:\elastic\XML4Logstash*.xml"

If you run logstash once it will read the files and store the fact that it has done so in the sincedb. If you restart logstash it will know it has read the files and start tailing them at the point it had read to so that if anything is appended it can process it. Probably not useful in your case. However, it will read any new files.

I got it doing stuff, but still it's hard to understand what is going on.
Please correct me if I'm wrong:

input{ 
  file
  {
	path => "C:\elastic\XML4Logstash"
	type => "xml"
  } 
}
filter{
  xml{
	source => "Orderrsp"
	remove_namespaces => true
    store_xml => true
    target => "orderrsp"
    force_array => false
  }
}
output {
  elasticsearch { 
	hosts => ["localhost:9200"]
	index => "logstash" 
  }
}

Logstash reads xml from the inputfolder
Then filters it

  • it takes the source and puts it in the message field. That field is an internal logstash field?
  • it does some stuff to it an stores the result into the orderrsp field (again a logstash field).

next it will try to output it to ES in the logstash index (this is a very early proof of concept - no wildcards)

If you want to consume an entire file as a single event then you will need to use a multiline filter. There is an example here. If you do not use a multiline filter then each line of the file will be a separate event.

The event will contain the contents of the file (either a single line, or the output of the multiline) in a field called "message", so that should be the source for you xml filter.

Hi Badger

Thanks to your help I got it working all the way.

Only now I will need to define a mappings for my documents and that mapping is gonna be pretty large (like 100+ fields a piece).

Is there any way I can translate an XSD into an ES mapping on the fly?

Thanks

J.

You might want to ask that in the elasticsearch forum.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.