Problem with xml filter

We're having issues setting a date pulled from a xml document as @timestamp in the date filter.

input xml looks like this
<Task> <TaskId>ServerTasks-5017</TaskId> <TaskState>Success</TaskState> <Created>2015-12-22T08:20:03</Created> <QueueTime>2015-12-22T08:20:03</QueueTime> <StartTime>2015-12-22T08:20:06</StartTime> <CompletedTime>2015-12-22T08:21:11</CompletedTime> <DurationSeconds>68</DurationSeconds> </Task>

Filer looks like this

filter { xml { source => "message" target => "xml_content" xpath => [ "//Created/text()", "created" ] } date { match => [ "created","yyyy-MM-dd'T'HH:mm:ss"] #match => ["xml_content.Created", "yyyy-MM-dd'T'HH:mm:ss"] } }

As you can see I've both tried to match the content pulled from the xml directly and with the xpath.

Both end up with _dateparsefailure and the @timestamp is set the the logstash time.

Any tips for a novice logstash user on how to resolve this?

I'm not XPath-fluent, but shouldn't the expression be //Task/Created/text()? What does the resulting event look like? Use a stdout { codec => rubydebug } output. Also, what's in the logs? When the date filter fails it tells you why in the log.

timestamp and "created" looks like this

"@timestamp" => 2017-03-30T18:29:09.354Z,
"created" => [
[0] "2015-12-22T08:20:03"
],
"@version" => "1",

Might be looking in the wrong log but cant find anything relating to _dateparsefailure in the log. Even tried to set level to debug.

The problem is that created is an array. Disable the xml filter's force_array option.

Thanks for your help! Much appreciated!

even setting 'force_array' to false gives xpath extracted element 'created' as array. Rest of the content that is not extracted with xpath is now not in array format.

Any tips? Can i reference the first element of the array 'created' somehow?

I can reproduce your behavior:

% bin/logstash -e 'filter { xml { source => "message" xpath => { "//Created/text()" => "foo" } store_xml => false } } '

{
  "@timestamp" => 2017-03-29T06:49:47.122Z,
  "foo"        => [
    [0] "2015-12-22T08:20:03"
  ],
...

I agree this behavior is not what I'd expect :wink:

There's an open issue on the logstash-filter-xml project that mentions this issue. Your feedback on this issue is welcome :slight_smile:

Yes.

Until the issue mentioned by @jordansissel is resolved, insert the following mutate filter after the xml filter:

mutate {
  replace => { "created" => "%{created[0]}" }
}

This replaces the array with the value of the first (and, in this case, only) element.

That is, instead of:

"created":["2015-12-22T08:20:03"]

the output contains:

"created":"2015-12-22T08:20:03"

Regarding the following match setting (commented-out in your example):

match => ["xml_content.Created", "yyyy-MM-dd'T'HH:mm:ss"]

From the Elastic docs topic “Accessing Event Data and Fields in the Configuration”:

The syntax to access a field is [fieldname]. If you are referring to a top-level field, you can omit the and simply use fieldname. To refer to a nested field, you specify the full path to that field: [top-level field][nested field].

Replace xml_content.Created with xml_content[Created]

Works with your sample XML data in Logstash 5.2.1, with—as recommended by @magnusbaeck—force_array => false.

The expression that you cite and the original expression will both work. However, /Task/Created/text() (note the single leading slash) is better.

From the XPath W3C recommendation:

// is short for /descendant-or-self::node()/.

A more specific path, such as /Task/Created, is typically more efficient (and less likely to select nodes that you didn’t intend to :slight_smile:).

Try this filter:

filter {
    xml {
        source => "message"
        target => "@metadata[xml_content]"
        force_array => false
    }
    # Copy XML content to first-level fields with all-lowercase names
    ruby {
        code => '
            event.get("@metadata[xml_content]").each do |key, value|
                event.set(key.downcase, value)
            end
        '
    }
    mutate {
         remove_field => ["message", "@metadata"]
         convert => {
           "durationseconds" => "integer"
         }
    }
    date {
        match => ["created", "ISO8601"]
    }
}

Notes:

  • @magnusbaeck: I thought @metadata wasn’t supposed to get passed through to the output, but it does get included in output to stdin and Elasticsearch. Hence its presence in remove_field. Did I miss a memo? (I’m using Logstash 5.2.1.)
  • That Ruby code is a workaround for the issue I describe in “Set target of xml filter to root?”
  • If you want to preserve the original case of the XML element names, remove .downcase from key.downcase

Example Logstash output

In JSON format:

"@timestamp": "2015-12-22T08:20:03.000Z",
"completedtime": "2015-12-22T08:21:11",
"created": "2015-12-22T08:20:03",
"@version": "1",
"host": "58a3fe88f636",
"starttime": "2015-12-22T08:20:06",
"durationseconds": 68,
"taskid": "ServerTasks-5017",
"queuetime": "2015-12-22T08:20:03",
"taskstate": "Success"

More unsolicited tips

If you can—if you are responsible for creating the original XML-format events—consider adding a zone designator to the time stamps. Otherwise, be sure that you understand the repercussions of specifying local times, and how those values might be interpreted.

2 Likes

I have added my two cents to that issue.

This works perfectly! Thank you all for all tips!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.