Unable to get XPath values in XML filter

Hello,

I started working on Logstash and Elasticsearch and my first task was trying to get XML files indexed by Elasticsearch.

For that I have a

  • File input plugin that reads whole XML files
  • XML filter plugin which creates properties using XPath
  • and an Elasticsearch output plugin where it all should get added to

My configuration looks like this

input {
  file {
    path => "/absolute/path/to/xmls/**/*.xml"
    start_position => "beginning"
    max_open_files => 10000
    mode => "read"
    close_older => "1 minute"
    codec => multiline {
      pattern => "\Z"
      what => "previous"
    }
  }
}

filter {
  xml {
    source => "message"
    store_xml => false
    force_array => false
    xpath => [
      '/html/head/meta_identity/identifier/text()', "meta_identity_identifier",
      '/html/head/meta_identity/sortkey/text()', "meta_identity_sortkey",
      '/html/head/meta_identity/database/text()', "meta_identity_database",
      '/html/head/meta_identity/langauge/text()', "meta_identity_language"
    ]
  }
}

output {
  elasticsearch {
    index => "xml-data"
    hosts => ["localhost:9200"]
    sniffing => false
  }

  stdout { codec => rubydebug }
}

Now to evaluate, let's take an XML file. Here is a snippet of it to visualise:

<?xml version="1.0" encoding="ISO-8859-1"?>
<html>
<head>
 <meta_identity>
  <identifier>a00001</identifier>
  <sortkey>000.000</sortkey>
  <database>Guidelines</database>
  <language>en</language>
 </meta_identity>
 ...
</head>
</html>

As can be seen from my filter, I want to do a very simple thing and extract these four element values into properties for Elasticsearch.

But my problem is that the XPath entries are not being parsed, I get everything else (@version, @timestamp, etc) but none of the properties I defined in either Elasticsearch or stdOut.

I tried creating a mutator to see if that might fix my issue:

filter {
  xml {
    ...
  }

  mutate {
    replace => [
      "meta_identity_identifier", "%{meta_identity_identifier}",
      "meta_identity_sortkey", "%{meta_identity_sortkey}",
      "meta_identity_database", "%{meta_identity_database}",
      "meta_identity_language", "%{meta_identity_language}"
    ]
  }
}

Now I can see the properties, but the value is not what it is supposed to be. The values are shown as %{meta_identity_language} etc.

Logstash doesn't give any insight when run in --verbose.

What am I missing?

Well I finally managed to solve it.

It turns out that some files had an xmlns attribute defined on html which caused XPath to not be able to parse anything.

Even if I assigned specifically that it exists in XPath: /html[@xmlns="http://domain.tld/path/to"] it wasn't able to resolve it.

Final solution was to set remove_namespaces to true.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.