XML and Logstash aren't particularly well-matched, so if you have other options for the file format (such as ndjson), you might have better luck. The primary reason is that Logstash is an engine for processing streams of data (e.g., data being appended to files), while XML by definition cannot be appended to because a legal XML file already contains its closing element.
The sample XML you have pasted is also not valid XML:
╭─{ yaauie@castrovel:~/src/elastic/discuss-scratch/124158-xml }
╰─○ xmllint example-input.xml
example-input.xml:2: parser error : error parsing attribute name
<ELEM=0>
^
example-input.xml:2: parser error : attributes construct error
<ELEM=0>
^
example-input.xml:2: parser error : Couldn't find end of Start Tag ELEM line 2
<ELEM=0>
^
example-input.xml:7: parser error : error parsing attribute name
<ELEM=1>
^
example-input.xml:7: parser error : attributes construct error
<ELEM=1>
^
example-input.xml:7: parser error : Couldn't find end of Start Tag ELEM line 7
<ELEM=1>
^
example-input.xml:12: parser error : error parsing attribute name
<ELEM=2>
^
example-input.xml:12: parser error : attributes construct error
<ELEM=2>
^
example-input.xml:12: parser error : Couldn't find end of Start Tag ELEM line 12
<ELEM=2>
^
[error: 1]
That said, if you had valid XML, the pipeline would likely have a shape something like the following:
input {
# ...
}
filter {
# replaces value at `message` with the data structure it represents
xml {
source => "message"
target => "message"
}
# emits one event per element in data structure; operates on `message` field by default
split { }
}
filter {
# any additional filters to enrich/transform the individual elements
}
output {
elasticsearch {
# ...
}
}
It is still not valid. <ELEM=0> is missing an attribute name, and it has a numeric attribute, both of which will trip up the XML parser. <ELEM x="0"> is what it wants.
Edited to use > for the example of what it wants. The HTML parser didn't mangle the first one because it is not HTML
╭─{ yaauie@castrovel:~/src/elastic/discuss-scratch/124158-xml }
╰─○ xmllint sample.xml
sample.xml:2: parser error : error parsing attribute name
<ELEM=0>
^
sample.xml:2: parser error : attributes construct error
<ELEM=0>
^
sample.xml:2: parser error : Couldn't find end of Start Tag ELEM line 2
<ELEM=0>
^
sample.xml:6: parser error : Opening and ending tag mismatch: TEST_INTF line 1 and ELEM
</ELEM>
^
sample.xml:7: parser error : Extra content at the end of the document
<ELEM=1>
^
[error: 1]
If you don't have an explicit reason to use XML, I would seriously suggest avoiding using it; as stated before in this thread, while it is technically possible to parse XML when we need to, the format is not well matched to how Logstash works.
OK, there are two things to consider. The first is how to ingest the file and get one event for each outer XML element. There are a few different use cases here. If you have something like a J9 JVM garbage collection log where the JVM is forever appending XML to it, a logstash file input is an excellent fit. However, if you have one file that contains XML and it will not change and you want to ingest it then I think a file input is a poor fit (not least because logstash does not exit when it gets to EOF, it waits and tails the file), and it is much easier to use a stdin input.
If you were going to use a file input it would be something like this. You have to use auto_flush_interval because there is no second event to trigger emission of the first. I regard this as an ugly hack.
input {
file {
path => "/some/absolute/path/test.xml"
sincedb_path => "/dev/null"
start_position => "beginning"
codec => multiline {
what => "previous"
pattern => "^" # Every line has a beginning
auto_flush_interval => 2
}
}
With a stdin input I would do this:
(cat file.xml; echo "Monsieur Spalanzani n'aime pas la musique") | ./logstash -f ...
input{
stdin {
codec => multiline {
pattern => "^Monsieur Spalanzani n'aime pas la musique"
negate => "true"
what => "previous"
}
}
}
Next up, parse the XML and split the ELEM array up.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.