Problem with xml filter

We're having issues setting a date pulled from a xml document as @timestamp in the date filter.

input xml looks like this
<Task> <TaskId>ServerTasks-5017</TaskId> <TaskState>Success</TaskState> <Created>2015-12-22T08:20:03</Created> <QueueTime>2015-12-22T08:20:03</QueueTime> <StartTime>2015-12-22T08:20:06</StartTime> <CompletedTime>2015-12-22T08:21:11</CompletedTime> <DurationSeconds>68</DurationSeconds> </Task>

Filer looks like this

filter { xml { source => "message" target => "xml_content" xpath => [ "//Created/text()", "created" ] } date { match => [ "created","yyyy-MM-dd'T'HH:mm:ss"] #match => ["xml_content.Created", "yyyy-MM-dd'T'HH:mm:ss"] } }

As you can see I've both tried to match the content pulled from the xml directly and with the xpath.

Both end up with _dateparsefailure and the @timestamp is set the the logstash time.

Any tips for a novice logstash user on how to resolve this?

I'm not XPath-fluent, but shouldn't the expression be //Task/Created/text()? What does the resulting event look like? Use a stdout { codec => rubydebug } output. Also, what's in the logs? When the date filter fails it tells you why in the log.

timestamp and "created" looks like this

"@timestamp" => 2017-03-30T18:29:09.354Z,
"created" => [
[0] "2015-12-22T08:20:03"
],
"@version" => "1",

Might be looking in the wrong log but cant find anything relating to _dateparsefailure in the log. Even tried to set level to debug.

The problem is that created is an array. Disable the xml filter's force_array option.

Thanks for your help! Much appreciated!

even setting 'force_array' to false gives xpath extracted element 'created' as array. Rest of the content that is not extracted with xpath is now not in array format.

Any tips? Can i reference the first element of the array 'created' somehow?

I can reproduce your behavior:

% bin/logstash -e 'filter { xml { source => "message" xpath => { "//Created/text()" => "foo" } store_xml => false } } '

{
  "@timestamp" => 2017-03-29T06:49:47.122Z,
  "foo"        => [
    [0] "2015-12-22T08:20:03"
  ],
...

I agree this behavior is not what I'd expect :wink:

There's an open issue on the logstash-filter-xml project that mentions this issue. Your feedback on this issue is welcome :slight_smile:

Yes.

Until the issue mentioned by @jordansissel is resolved, insert the following mutate filter after the xml filter:

mutate {
  replace => { "created" => "%{created[0]}" }
}

This replaces the array with the value of the first (and, in this case, only) element.

That is, instead of:

"created":["2015-12-22T08:20:03"]

the output contains:

"created":"2015-12-22T08:20:03"

Regarding the following match setting (commented-out in your example):

match => ["xml_content.Created", "yyyy-MM-dd'T'HH:mm:ss"]

From the Elastic docs topic “Accessing Event Data and Fields in the Configuration”:

The syntax to access a field is [fieldname]. If you are referring to a top-level field, you can omit the and simply use fieldname. To refer to a nested field, you specify the full path to that field: [top-level field][nested field].

Replace xml_content.Created with xml_content[Created]

Works with your sample XML data in Logstash 5.2.1, with—as recommended by @magnusbaeck—force_array => false.

The expression that you cite and the original expression will both work. However, /Task/Created/text() (note the single leading slash) is better.

From the XPath W3C recommendation:

// is short for /descendant-or-self::node()/.

A more specific path, such as /Task/Created, is typically more efficient (and less likely to select nodes that you didn’t intend to :slight_smile:).

Try this filter:

filter {
    xml {
        source => "message"
        target => "@metadata[xml_content]"
        force_array => false
    }
    # Copy XML content to first-level fields with all-lowercase names
    ruby {
        code => '
            event.get("@metadata[xml_content]").each do |key, value|
                event.set(key.downcase, value)
            end
        '
    }
    mutate {
         remove_field => ["message", "@metadata"]
         convert => {
           "durationseconds" => "integer"
         }
    }
    date {
        match => ["created", "ISO8601"]
    }
}

Notes:

  • @magnusbaeck: I thought @metadata wasn’t supposed to get passed through to the output, but it does get included in output to stdin and Elasticsearch. Hence its presence in remove_field. Did I miss a memo? (I’m using Logstash 5.2.1.)
  • That Ruby code is a workaround for the issue I describe in “Set target of xml filter to root?”
  • If you want to preserve the original case of the XML element names, remove .downcase from key.downcase

Example Logstash output

In JSON format:

"@timestamp": "2015-12-22T08:20:03.000Z",
"completedtime": "2015-12-22T08:21:11",
"created": "2015-12-22T08:20:03",
"@version": "1",
"host": "58a3fe88f636",
"starttime": "2015-12-22T08:20:06",
"durationseconds": 68,
"taskid": "ServerTasks-5017",
"queuetime": "2015-12-22T08:20:03",
"taskstate": "Success"

More unsolicited tips

If you can—if you are responsible for creating the original XML-format events—consider adding a zone designator to the time stamps. Otherwise, be sure that you understand the repercussions of specifying local times, and how those values might be interpreted.

I have added my two cents to that issue.

This works perfectly! Thank you all for all tips!