Logstash xml input configuration: for multiple documents

Hi Team

I'm using http_poller to poll an end point which gives xml data as response. I'm trying to send this xml data to elasticsearch.

But when I run logstash, I see logstash is failing. Please have a look at below config.

xmldata:

few line of my xml data:

>     <?xml version="1.0" encoding="UTF-8"?>
>     <feed xmlns="http://www.w3.org/2005/Atom">
>        <generator version="1.0">Alfresco (1.0)</generator>
>        <link rel="self" href="links" />
>        <id>random_id</id>
>        <title>Activities Site</title>
>        <updated>2020-07-29T12:53:16.000-07:00</updated>
>        <entry xmlns='http://www.w3.org/2005/Atom'>
>           <title type="html"><overview></title>
>           <link rel="alternate" type="text/html" href="random link" />
>           <id>249,535,933</id>
>           <updated>2020-07-29T12:53:16.000-07:00</updated>
>           <summary type="html">
>              <![DATA[<a href="random link</a> downloaded document <a href="random link">Overview</a>]]>
>           </summary>
>           <author>
>              <name>name</name>
>              <uri>random</uri>
>           </author>
>        </entry>
>        <entry xmlns='http://www.w3.org/2005/Atom'>
>           <title type="html"><random></title>
>           <link rel="alternate" type="text/html" href="randomuri" />
>           <id>249,535,867</id>
>           <updated>2020-07-29T12:53:10.000-07:00</updated>
>           <summary type="html">
>              <![CDATA[<a href="random">Name</a> download <a href="random">intro</a>]]>
>           </summary>
>           <author>
>              <name>Name</name>
>              <uri>random</uri>
>           </author>
>        </entry>

Logstash.conf:

input 
{
	http_poller 
	{
		urls => 
		{

		test1 =>
				{		
				url=>"randomhost"
				method => get
                user => "*********"
                password => "*******"
                headers => {
                   "Content-Type" => "text/xml; charset=UTF-8"
                   }
				}
		}
		request_timeout => 60
		schedule => { cron => "* * * * * UTC"}
        
	}
}
filter {
  xml { 
  source => "message" 
  target => "theXML" 
  
  }
}

#output { stdout { codec => rubydebug } }

output {

    elasticsearch {
      index => "logstash-xmldata"
      hosts => "http://elasticsearchhost:80"
      user => "****"
      password => "******"
    }
  }

output:

> [2020-07-29T20:00:02,148][WARN ][logstash.outputs.elasticsearch][main][41e2884444551e51d0256ad578d1476c2186e932e0995e3ce551bbd4c4286a6a] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-xmldata", :routing=>nil, :_type=>"_doc"}, #<LogStash::Event:0x11bd3dd9>], :response=>{"index"=>{"_index"=>"logstash-xmldata", "_type"=>"_doc", "_id"=>"r1MpnHMBz8okLa_0Chk8", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"object mapping for [theXML.title] tried to parse field [null] as object, but found a concrete value"}}}}

please suggest.

The problem appears to be with [theXML][title]

<title>Activities Site</title>

that is a string (a "concrete value") but the mapping in elasticsearch expects it to be an object.

Check the mapping in elasticsearch.

Read this post and then this post.

1 Like

Hi @Badger

Got it. I'm able to ingest data to kibana by following the details in suggested posts.

But in kibana I see all the data in one xml filed ? Any suggestions on this ?

logstash will have placed all of the parsed XML inside the top-level theXML field. If you want the object to be moved to the top level you can use a ruby filter, like this.

Hi @Badger

This is the kibana output without ruby filter. all the data is under theXML.entry field.

[

]

This is the kibana output with the below ruby filter

ruby {
code => '
event.get("theXML").each { |k, v|
event.set(k,v)
}
event.remove("theXML")
'
}

. the data is under entry field.

Do I need make any changes to the ruby filter ? something like event.get(theXML.entry) .
Also, how about entry.updated ? will that be parsed as well

Please suggest

Thank you

Well, the sample data you posted is not valid XML, and I suspect the structure is different as you get through more entries.

You might want to use the 'force_array => false' option on the xml filter.

Since there are multiple <entry> elements that is always going to be an array. You might want to use a split filter to break those up into separate events. Maybe not, depends on your use case.

How you end up with a entry.updated array I cannot guess.

I posted the first few lines of the data. So, it looks like invalid. i tried converting to json (using external editors) and it worked

let me try these options and see. However, I'm still wondering about entry.updated and other similar fields.

Will let you know If i find anything interesting.

Thanks
Rahul

Hi @Badger

Following filter config using split worked well.

filter {
xml {
source => "message"
target => "theXML"
force_array => false

}
ruby {
code => '
event.get("theXML").each { |k, v|
event.set(k,v)
}
event.remove("theXML")
'
}
split {
field => "entry"
remove_field => "message"
}
}

Data in Kibana is good but I see _jsonparsefailure tag. Is there anyway I can understand what is failing ?

I think that comes from your input. You didn't set the codec parameter, so it tried to use its default:

1 Like

oh yea got it. Makes sense. @Jenni

Is there a way to specify xml ? I didnt see it in documentation.

I didn't see anything either. I think you can just use plain for this as you already have an xml filter anyway.

1 Like

sure @Jenni

Thank you :slight_smile:

Hi @Badger @Jenni

The above conf worked but message (field) is adding to every document (in es) with all the data. Any inputs to avoid this ?

(Edit: There was a wrong test and assumption that split doesn't call remove_field if there was only one entry. But Badger proved me wrong below. This is long and unnessecarry, so I am getting rid of it. Have a look at the edit history of this post, if you are interested in my idiotism :slight_smile: )

If you move the remove_field option to a separate mutate filter, it should work.

1 Like

I would add the remove_field => [ "message" ] to the xml filter, so that it is only removed if it is successfully parsed.

@Jenni, the split filter will not decorate the event (i.e. filter_matched is not called) if the field is a string that does not contain the terminator.

1 Like

Ah. Sorry. Thanks. I had wrongfully assumed that add_field would keep my array as an array.

(But a feed with only one entry could still cause problems with split because it would be a hash instead of an array, wouldn't it?)

Adding the remove_field as a separate filter worked as well. However, adding it in xml filter would make more sense.

Thank you both :slight_smile:

If the field were a hash you would get

logger.warn("Only String and Array types are splittable. field:#{@field} is of type = #{original_value.class}")