Getting timestamp field from XML attribute

I'm using the xml filter plugin in Logstash to parse a XML document, and am having success with the exception of the @timestamp value, which is found as an attribute value. Here is a snippet of the XML file:

<cdf:Benchmark resolved="1" style="SCAP_1.2">
    <cdf:TestResult start-time="2022-01-17T11:15:01" end-time="2022-01-17T11:15:37">

and here is my full pipline config (the last line of the filter is the problem):

input {
  file {
    path => [ "C:/temp/SCAP/*.xml" ]
    start_position => "beginning"
    codec => multiline {
      pattern => "^ZsExDrC" 
      what => "previous" 
      negate => true 
      auto_flush_interval => 2
      max_lines => 50000
    }
  }
}

filter {
  xml {
    source => "message"
    target => "doc"
    xpath => [ "/cdf:Benchmark/cdf:title/text()", "benchmark",
               "/cdf:Benchmark/cdf:plain-text[@id='release-info']/text()", "release-info",
               "/cdf:Benchmark/cdf:Value[1]/cdf:title/text()", "setting-title", 
			   "/cdf:Benchmark/cdf:TestResult/cdf:score[1]/text()", "vulnerability.score.base",
			   "/cdf:Benchmark/cdf:TestResult/cdf:target/text()", "host.name",
			   "/cdf:Benchmark/cdf:TestResult/cdf:target-facts/cdf:fact[@name='urn:scap:fact:asset:identifier:os_name']/text()", "host.os.name", 
			   "/cdf:Benchmark/cdf:TestResult/cdf:target-address[normalize-space()][1]/text()", "host.ip",
			   "/cdf:Benchmark/cdf:TestResult/@start-time", "@timestamp"]
  }
}

output {
  elasticsearch {
  hosts => ["localhost:9200"]
  index => "scap-results-%{+YYYY.MM.dd}"
  }
}

The xpath is working correctly, but it doesn't identify that element attribute as a date field because I get the following output from the xpath: start-time=2022-01-17T11:15:01 instead of 2022-01-17T11:15:01. In other words, I don't seem to be able to select just the attribute value without the attribute name.

I tried adding /text() to the end like this:

/cdf:Benchmark/cdf:TestResult/@start-time/text()`

but then the xpath fails because this isn't proper xpath syntax.

I do not understand why that would happen. This works for me:

input { generator { count => 1 lines => [ '
<cdf:Benchmark resolved="1" style="SCAP_1.2" xmlns:cdf="http://www.example.com/">
<cdf:TestResult start-time="2022-01-17T11:15:01" end-time="2022-01-17T11:15:37">
</cdf:TestResult>
</cdf:Benchmark>
' ] } }
filter {
    xml {
        namespaces => { "cdf" => "http://www.example.com/" }
        source => "message"
        store_xml => false
        xpath => { "/cdf:Benchmark/cdf:TestResult/@start-time" => "[@metadata][timestamp]" }
    }
    date { match => [ "[@metadata][timestamp][0]", "YYYY-MM-dd'T'HH:mm:ss" ] }
}

You cannot store directly into [@timestamp] because that will result in the error

XML Parse Error {:exception=>"wrong argument type String (expected LogStash::Timestamp)"
1 Like

I changed my filter block to:

filter {
  xml {
    source => "message"
    target => "doc"
    xpath => [ "/cdf:Benchmark/cdf:title/text()", "benchmark",
               "/cdf:Benchmark/cdf:plain-text[@id='release-info']/text()", "release-info",
               "/cdf:Benchmark/cdf:Value[1]/cdf:title/text()", "setting-title",
               "/cdf:Benchmark/cdf:TestResult/cdf:score[1]/text()", "vulnerability.score.base",
               "/cdf:Benchmark/cdf:TestResult/cdf:target/text()", "host.name",
               "/cdf:Benchmark/cdf:TestResult/cdf:target-facts/cdf:fact[@name='urn:scap:fact:asset:identifier:os_name']/text()", "host.os.name",
               "/cdf:Benchmark/cdf:TestResult/cdf:target-facts/cdf:fact[@name='urn:scap:fact:asset:identifier:processor_mhz']/text()", "host.cpu.speed",
               "/cdf:Benchmark/cdf:TestResult/cdf:target-facts/cdf:fact[@name='urn:scap:fact:asset:identifier:physical_memory']/text()", "host.memory",
               "/cdf:Benchmark/cdf:TestResult/cdf:target-facts/cdf:fact[@name='urn:scap:fact:asset:identifier:processor']/text()", "host.cpu",
               "/cdf:Benchmark/cdf:TestResult/cdf:target-address[normalize-space()][1]/text()", "host.ip",
               "/cdf:Benchmark/cdf:TestResult/@start-time", "start-time" ]
  }
  date { match => [ "start-time", "YYYY-MM-dd'T'HH:mm:ss" ] }
}

But am still getting no available timestamp field in Kibana:
no-date-time

When I create the index pattern anyway, I see that I'm not getting any of my fields, only the "underscore" fields:
underscore_fields

I don't understand this because when I test my xpath in online xpath testers, it works.

I have no idea how that could possibly work without the namespace definitions.

I'm obviously a novice at parsing XML, and don't quite understand namespaces. I did look at Logstash documentation, but it didn't give much. Here is the full root element of the document:

<cdf:Benchmark resolved="1" style="SCAP_1.2" xsi:schemaLocation="http://checklists.nist.gov/xccdf/1.2 http://scap.nist.gov/schema/xccdf/1.2/xccdf_1.2.xsd http://cpe.mitre.org/dictionary/2.0 http://scap.nist.gov/schema/cpe/2.3/cpe-dictionary_2.3.xsd http://cpe.mitre.org/language/2.0 http://scap.nist.gov/schema/cpe/2.3/cpe-language_2.3.xsd" id="xccdf_mil.disa.stig_benchmark_Windows_10_STIG" xmlns:cdf="http://checklists.nist.gov/xccdf/1.2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/">

Given that, what should my namespaces setting be? I tried the following:

namespaces => { 
      "xccdf" => "http://scap.nist.gov/schema/xccdf/1.2/xccdf_1.2.xsd"
      "xsi" => "http://cpe.mitre.org/dictionary/2.0"
      "cpe" => "http://scap.nist.gov/schema/cpe/2.3/cpe-dictionary_2.3.xsd"
      "xml" => "http://www.w3.org/2001/XMLSchema-instance"
      "rdf" => "http://purl.org/dc/elements/1.1/"	  
	}

...and it ingested the file in the message field, but none of my other fields were created, just _xmlparsefailure tag with the following in logstash.plain.log:

org.apache.xpath.domapi.XPathStylesheetDOM3Exception: Prefix must resolve to a namespace: cdf

It did give me a timestamp of the time the file was ingested, but not the date/time I wanted to pull from the document.

This might be a stupid question, but namespaces requires an internet connection, correct? So if I'm working on an air-gapped system, could I set up an internal page for it to reference instead of an internet page?

Not a stupid question at all. logstash does not resolve the namespace URIs. You can literally use "http://www.example.com/" for all of them and it will quite happily parse the namespace name. It does not care about the namespace value.

The namespaces option you show does not include "cdf", which is the namespace used in your XML.

Given that your XML includes

xmlns:cdf="http://checklists.nist.gov/xccdf/1.2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:dc="DCMI: DCMI Metadata Terms">

I would go with

namespaces => {
    "cdf" => "http://checklists.nist.gov/xccdf/1.2" 
    "xsi" => "http://www.w3.org/2001/XMLSchema-instance" 
    "dc" => "http://purl.org/dc/elements/1.1/"
}
1 Like

Thanks for the explanation - very helpful! My config now looks like this:

input {
  file {
    path => [ "C:/temp/SCAP/*.xml" ]
    start_position => "beginning"
    codec => multiline {
      pattern => "^ZsExDrC" 
      what => "previous" 
      negate => true 
      auto_flush_interval => 2
      max_lines => 50000
    }
  }
}

filter {
  xml {
    namespaces => {
    "cdf" => "http://checklists.nist.gov/xccdf/1.2" 
    "xsi" => "http://www.w3.org/2001/XMLSchema-instance" 
    "dc" => "http://purl.org/dc/elements/1.1/"
    }
    source => "message"
    target => "doc"
    xpath => { 
    "/cdf:Benchmark/cdf:title/text()" => "benchmark"
    "/cdf:Benchmark/cdf:plain-text[@id='release-info']/text()" => "release-info"
    "/cdf:Benchmark/cdf:TestResult/@start-time" => "[@metadata][timestamp]"
    "/cdf:Benchmark/cdf:TestResult/cdf:target/text()" => "host.name"
    "/cdf:Benchmark/cdf:TestResult/cdf:target-address[normalize-space()][1]/text()" => "host.ip"
    "/cdf:Benchmark/cdf:TestResult/cdf:target-facts/cdf:fact[@name='urn:scap:fact:asset:identifier:os_name']/text()" => "host.os.name"
    "/cdf:Benchmark/cdf:TestResult/cdf:target-facts/cdf:fact[@name='urn:scap:fact:asset:identifier:processor']/text()" => "host.cpu"
    "/cdf:Benchmark/cdf:TestResult/cdf:target-facts/cdf:fact[@name='urn:scap:fact:asset:identifier:processor_mhz']/text()" => "host.cpu.speed"
    "/cdf:Benchmark/cdf:TestResult/cdf:target-facts/cdf:fact[@name='urn:scap:fact:asset:identifier:physical_memory']/text()" => "host.memory"
    "/cdf:Benchmark/cdf:TestResult/cdf:score[1]/text()" => "vulnerability.score.base"
    }
  }
  date { match => [ "[@metadata][timestamp][0]", "YYYY-MM-dd'T'HH:mm:ss" ] }
}

output {
  elasticsearch {
  hosts => ["localhost:9200"]
  index => "scap-results-%{+YYYY.MM.dd}"
  }
}

However, I'm still not getting any timestamp field or any other parsed fields. It just ingests into an index with zero documents. I do get this warning in logstash-plain.log (I added line breaks to make it more legible):

[2022-01-18T13:17:56,173][WARN ][logstash.outputs.elasticsearch][scap-results]
[fad371fb3a2f1bced415913c622407598fadd3ce093c68958e81938693f4259c] 
Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil,
:_index=>"scap-results-2022.01.17", :routing=>nil}, {
"@timestamp"=>2022-01-17T16:15:01.000Z, "host.ip"=>["192.168.40.208"],
"message"=>"<?xml version=\"1.0\" ....<the full XML document>...............
"tags"=>["multiline"], "benchmark"=>["Windows 10 Security Technical Implementation Guide"], "host.cpu"=>["AMD A6-5400K APU with Radeon(tm) HD Graphics   "], "host"=>"Finlandia", "host.memory"=>["16384"]...........
"error"=>{"type"=>"illegal_argument_exception", 
"reason"=>"can't merge a non object mapping [doc.Value.value] with an object mapping"}}}}

From that, you can see that it is in fact parsing the fields (for instance, the host.ip, host.cpu, etc. as well as the @timestamp), but it is not able to index it. I found this related discussion, but am not clear on how to use that solution, since that post seems more to do with machine learning jobs. In another discussion, @xeraa said:

Either have a concrete value or a subdocument in a field, but don't mix them.

But I don't understand this. My fields have concrete values (from either the XML element or attribute).

Not exactly. The default for store_xml is true, so the complete XML document is parsed and stored in the target field (doc, in your case). [doc] is an object which has many objects nested inside it. So, based on an SCAP file I found online, it might contain a [doc][Benchmark][TestResult][rule-result][result][override][old-result] field that has a concrete value in it. But the 6 fields that parent it are all objects, not just fields.

For debugging purposes, set store_xml to false and verify the fields you are extracting with xpath look OK.

If this mapping [doc.Value.value] is not obfuscated then that is the field that is causing the problem. Check the mapping of the index. Turn store_xml back on and replace the elasticsearch index with output { stdout { codec => rubydebug } } and see that the format of [doc][Value] is.

See this post for ideas on how to reformat the data once you have decided whether you want it to be a concrete value or an object.

1 Like

I appreciate all your help. Here are the results of your suggestions:

  • Even after setting store_xml to false, I got the same results (xpath fields don't appear in ES/Kibana).
  • That doc.Value.value was not obfuscated, but when I run GET /scap-results-2022.01.17/_mapping, I just get:
{
  "scap-results-2022.01.17" : {
    "mappings" : { }
  }
}

So there don't appear to be any mappings (?) ...At least I know I haven't defined any for this index.

  • I then re-enabled store_xml and sent output to stdout, but it went by too fast to catch, so I sent it to a file instead and looked for [doc][Value] and doc.Value but found nothing, but all the other xpath fields show up just fine, for example:
"settings.total":["211"],"host":"Finlandia","release-info":["Release: 2.3 Benchmark Date: 01 Nov 2021"],"vulnerability.score.base":["40.76"],"host.cpu.speed":["3593"],"settings.pass":["86"],"host.name":["FINLANDIA"]....<etc>

I was wondering, if using doc as the target is causing the conflict, can I just use a different target? Also, if I don't really need to store the xml (as long as I can extract my fields), then that would be fine too. But again, when I turned off store_xml, I'm not getting anything in ES.

If there are no mappings that suggests there is no data in the index. Did you turn off dynamic mapping?

I do not run Elasticsearch myself, so I am not sure what else to try.

I found the problem field. For some reason host.cpu.speed was causing the problem - even when I turned off store_xml. That must have been the doc.Value.value the log was referring to. I just changed it to host.cpu_speed and it worked. I decided to also leave store_xml off, since I don't really need it. (I still need to do more learning about the difference between objects and concrete values, but will tackle that another day.) Thank you again @Badger for all you multiple answers on this one thread!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.