Logstash pipeline to filter rss document

I've just created a logstash that will retrieve documents from a feed. I receive the documents in return but the fields I want to add are all on the same document. Here's an example:

<rss
 <item>
     <title>
     <desc>
     <link>
     <pubdate>
<item>
...
 <item>
     <title>
     <desc>
     <link>
     <pubdate>
 <item>
...
</rss>

More than 50 elements

Firstly, I wanted to know if it was possible to separate all the value into a separate document.

Then I "just" need to add new field from the , , and to enrich the new document. I already have the regex for each value but I'm totally lost where I add to start on the filter configuration.

To my mind it should use with a SPLIT and then a GROK filter for each new document resulted from the split to add the new field from the regex in it.

If anyone can send me an hint for this, it will be appreciated ! Thank you

Parsing the XML is working fine with the XML filter:

filter
{
 xml {
        remove_namespaces => "true"
        source => "message"
        store_xml => "false"
        target => "xmldata"
        xpath =>  [
                   "//title/text()","Title",
                   "//link/text()","Link",
                   "//description/text()","Description",
                   "//pubdate/date()","PubDate"
        ]
    }
    mutate { 
        remove_tag => [
                        "_jsonparsefailure",
                        "_xmlparsefailure" 
        ] 
        remove_field => "message"
    }
}

But date doesn't work on this case and the field on kibana aggragating all the data from XML.
Exemple: on my new field title: I have 50 title name, as I want to have 50 document with one unique title.
Totally lost, if someone have clue to help me !

You could use a split filter to get 50 documents, one document for each Title, but then you are going to have exactly the same issue for Link. You could use a second split filter, but then you are going to have 2500 documents, one for every combination of Title and Link. That is not going to be helpful.

I think what you need is one document for each <item> element. You could try

    xml {
        source          => "message"
        store_xml       => true
        target          => "[@metadata][theXML]"
        force_array     => false
        remove_field    => [ "message" ]
    }
    split { field => "[@metadata][theXML][item]" }
    ruby {
        code => '
            event.remove("[@metadata][theXML][item]").each { |k, v|
                event.set(k, v)
            }
        '
    }

which for

<rss>
 <item> <title>A</title> <desc>B</desc> <link>C</link> <pubdate>D</pubdate> </item>
 <item> <title>G</title> <desc>H</desc> <link>I</link> <pubdate>J</pubdate> </item>
</rss>

will produce

{
"@timestamp" => 2023-12-22T17:21:49.955409196Z,
      "link" => "C",
   "pubdate" => "D",
      "desc" => "B",
  "@version" => "1",
     "title" => "A"
}
{
"@timestamp" => 2023-12-22T17:21:49.955409196Z,
      "link" => "I",
   "pubdate" => "J",
      "desc" => "H",
  "@version" => "1",
     "title" => "G"
}
1 Like

Hello,

Thank you @Badger , it work perfectly as expected !

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.