Parsing xml using logstaash xpath


(Karan Shah) #1

HI,

I am trying to parse an XML file in Logstash. I want to use XPath to do the parsing of documents in XML. So when I run my config file the data loads into elasticsearch but It is not in the way I want to load the data. The data loaded in elasticsearch is each line in xml document

Structure of my XML file


What I want to achieve:

create fields in elasticsearch that stores the follwing
ID =1
Name = "Finch"

My Config file:
input
{
file
{
path => "C:\Users\186181152\Downloads\stations.xml"
start_position => "beginning"
sincedb_path => "/dev/null"
exclude => "*.gz"
type => "xml"
}
}
filter
{
xml
{
source => "message"
store_xml => false
target => "stations"
xpath => [
"/stations/station/id/text()", "station_id",
"/stations/station/name/text()", "station_name"
]
}

}

output
{
elasticsearch
{
codec => json
hosts => "localhost"
index => "xmlns"
}
stdout
{
codec => rubydebug
}
}

Output in Logstash:
{
"station_name" => "%{station_name}",
"path" => "C:\Users\186181152\Downloads\stations.xml",
"@timestamp" => 2018-02-09T04:03:12.908Z,
"station_id" => "%{station_id}",
"@version" => "1",
"host" => "BW",
"message" => "\t\r",
"type" => "xml"
}


#2

Yes, because a file input creates one event for each line of the file. The fix depends on the exact problem you are trying to solve. Do you always want to capture the entire file as a single event? Are there ever two stations in a file (not 2 station elements, I can see there are, are there ever 2 stations elements)? Are you only dealing with stations as the outermost thingy?


(Karan Shah) #3

Yes, the stations will be the outermost thing. Yes I want to capture the file as a single event. Can you help me with how can I achieve it


#4

Append some pattern that you are confident will not occur in the XML (easy in this case) then use a stdin input with a multiline codec. That should capture the XML as a single event, which you can start attacking with an xml filter.

(cat file.xml; echo "Monsieur Spalanzani n'aime pas la musique") | ./logstash -f ...
input{
  stdin {
    codec => multiline {
      pattern => "^Monsieur Spalanzani n'aime pas la musique"
      negate => "true"
      what => "previous"
    }
  }
}

Extract Data from message to display each field as a column in kibana
(Karan Shah) #5

I tried the multiline codec. but now it does not even create the index.
Following is the code snippet I inserted after type = "xml" in my config file.

codec => multiline {
pattern => ""
negate => "true"
what => "previous"
}


#6

Did you try exactly what I suggested? There is no out-of-the-box codec that captures the entire contents of a file and inputs it as an event. You have to append a marker to the file and tell the multiline codec to look for that marker.


(Karan Shah) #7

My pattern was stations inside the xml tag


#8

That is not going to work, because what is in the pattern is not part of the event, so you end up with malformed XML. Try exactly what I suggested.


(Karan Shah) #9

I guess I didn't understand the suggestion properly. What I got in form your comment was to use a pattern which will not repeat and can capture the file as a complete event. So all the data I have is under the stations xml tag. Can you suggest me which pattern should I use then?


#10

Do exactly what I wrote. Append the line

Monsieur Spalanzani n'aime pas la musique

and then use a multiline code that searches for the pattern "^Monsieur Spalanzani n'aime pas la musique"


(Karan Shah) #12

Do you want me to add that sentence in first line of file ?


(Karan Shah) #13

I added that quote in my file and I ran the cofig file again with the mentioned changes. still no index being created.

codec => multiline {
pattern => "^Monsieur Spalanzani n'aime pas la musique"
negate => "true"
what => "previous"
}


#14

You have

output { stdout { codec => rubydebug } }

What does that output look like? Don't worry about trying to get them into elasticsearch until logstash is producing events that look right.

(Karan Shah) #15


My Logstash dump


#16

Please post text rather than images. I am not going to go and OCR that in order to be able to read it.

I am not asking for the logstash log file. I am looking for what got written to stdout. It will be pretty-printed, like this (but with different data, obviously)

{
          "tags" => [
        [0] "multiline"
    ],
         "foo" => "Riconisci in questo amplesso",
    "@timestamp" => 2018-02-09T22:13:46.370Z,
          "host" => "localhost",
       "message" => "[The input]",
         "bar" => "Osservate, leggete con me",
      "@version" => "1"
}

(Karan Shah) #17

Hi Badger,

I got the similar output as you pasted above. There is one thing that didn't occur as I planned.
Whenever i run the config file I do not get the pretty printed response immediately. I only get the logstash is running output for few hours. But when my laptop goes to sleep or i restart multiple times I suddenly get the pretty response as the message & multi line, host name and so on.


(Walker) #18

If he is using the multiline, why would you be targeting a name/title and not the record field itself? I would think his multiline would look like:

codec => multiline {
  pattern => "<station>"
  negate => "true"
  what => "previous"
}

(Karan Shah) #19

I used this as well and it works. But unfortunately the data ingestion after a reboot or sleep


(Walker) #20

Change your sincedb_path to include a filename.

File Input Reference

sincedb_pathedit

• Value type is string
• There is no default value for this setting.

Path of the sincedb database file (keeps track of the current position of monitored log files) that will be written to disk. The default will write sincedb files to <path.data>/plugins/inputs/file NOTE: it must be a file path and not a directory path


(Karan Shah) #21

I have another config file with same since_db path and that works perfectly fine