How to correctly parse a RSS feed/XML file with Logstash

Hello everyone,
I am currently learning Elasticsearch and Logstash, and I have a job to do.
I want to parse a Google News RSS Feed (ex : google news feed) and put the data (from every items) in an indice.

The RSS feed looks like this :

<rss xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
  <channel>
    <item>
      <title></title>
      <link></link>
      <guid isPermaLink="false"></guid>
      <pubDate></pubDate>
      <description></description>
      <source></source>
    </item>
    <item>...</item>
  </channel>
</rss>

The thing is I tried to use the RSS input plugin, but some tags (like , ) wasn't saved in the documents.

This was the config I used :

input {
  rss {
    url => "https://news.google.com/rss/search?&scoring=n&num=10&q=intitle:Effondrement%22&hl=fr&gl=FR&ceid=FR:fr"
    interval => 600
    tags => ["Effondrement"]
  }

  rss {
    url => "https://news.google.com/rss/search?&scoring=n&num=10&q=intitle:Foudre%22&hl=fr&gl=FR&ceid=FR:fr"
    interval => 600
    tags => ["Foudre"]
  }

 ...
}

filter {
  fingerprint {
    source => "title"
    method => "MURMUR3"
    target => "fingerprint"
  }

  mutate {
   gsub => [
        "message", "<a[^>]*>(.*?)<\/a>", "\1",
        "message", "&nbsp;", ""
   ]
   copy => { "message" => "description" }

    remove_field => "message"
    remove_field => "event"
  }
}

output {
...
}

I then tried to use http_poller but it didn't seem to work properly, I had only two documents with the whole file in one attribute.

This was the config I used :

input {
  http_poller {
    urls => {
      effondrement => "https://news.google.com/rss/search?&scoring=n&num=10&q=intitle:Effondrement%22&hl=fr&gl=FR&ceid=FR:fr"
      explosion => "https://news.google.com/rss/search?&scoring=n&num=10&q=intitle:Explosion%22&hl=fr&gl=FR&ceid=FR:fr"
    }
    request_timeout => 60
    schedule => { "every" => "1h" }
    codec => multiline {
      pattern => "<item>"
      negate => "true"
      what => "previous"
    }
  }
}

filter {
  xml {
    source => "message"
    store_xml => false
    xpath => [
      "item/title/text()", "title"
    ]
  }
}

output {
  ...
}

I am probably doing it wrong, so if anyone can show me the way to do what I'll want properly, I greatly appreciate it.

Thanks !

PS: I am french so that is the reason why you could find some mistakes in my message :wink:

Bonjour Thomas and welcome! :wink:

Indeed the best option to me is to use the Rss input plugin | Logstash Reference [8.13] | Elastic

Could you share what do you have as an output when using:

input {
  rss {
    url => "https://news.google.com/rss/search?&scoring=n&num=10&q=intitle:Effondrement%22&hl=fr&gl=FR&ceid=FR:fr"
    interval => 600
    tags => ["Effondrement"]
  }

  rss {
    url => "https://news.google.com/rss/search?&scoring=n&num=10&q=intitle:Foudre%22&hl=fr&gl=FR&ceid=FR:fr"
    interval => 600
    tags => ["Foudre"]
  }
}

filter { }

output {
  stdout {}
}

Thank you for your quick answer.

This is what I get when I use the config you gave me :

[2024-04-02T14:39:23,331][INFO ][logstash.inputs.rss      ][main][80d0374a018e87f16d7ed833870da1df0148aa330bd0411a1af0e81ac9e0d3d5] Command completed {:command=>nil, :duration=>2.246263}
{
      "@version" => "1",
     "published" => 2024-04-01T04:31:00.000Z,
       "message" => "<a href=\"https://news.google.com/rss/articles/CBMitQFodHRwczovL3d3dy5sYWRlcGVjaGUuZnIvMjAyNC8wNC8wMS9pbmZvLWxhLWRlcGVjaGUtbWVuYWNlLWRlZmZvbmRyZW1lbnQtcHJlcy1kZS10b3Vsb3VzZS11bi1yZXN0YXVyYW50LWEtdC1pbC1ldGUtcmVjb25zdHJ1aXQtaWxsZWdhbGVtZW50LXN1ci1kZXMtY2hhcnBlbnRlcy1jYWxjaW5lZXMtMTE4NTA4NTAucGhw0gEA?oc=5\" target=\"_blank\">INFO LA DEPECHE. Menace d'effondrement près de Toulouse : un restaurant a-t-il été reconstruit illégalement sur des ...</a>&nbsp;&nbsp;<font color=\"#6f6f6f\">LaDepeche.fr</font>",
          "Feed" => "https://news.google.com/rss/search?&scoring=n&num=10&q=intitle:Effondrement%22&hl=fr&gl=FR&ceid=FR:fr",
         "event" => {
        "original" => "<a href=\"https://news.google.com/rss/articles/CBMitQFodHRwczovL3d3dy5sYWRlcGVjaGUuZnIvMjAyNC8wNC8wMS9pbmZvLWxhLWRlcGVjaGUtbWVuYWNlLWRlZmZvbmRyZW1lbnQtcHJlcy1kZS10b3Vsb3VzZS11bi1yZXN0YXVyYW50LWEtdC1pbC1ldGUtcmVjb25zdHJ1aXQtaWxsZWdhbGVtZW50LXN1ci1kZXMtY2hhcnBlbnRlcy1jYWxjaW5lZXMtMTE4NTA4NTAucGhw0gEA?oc=5\" target=\"_blank\">INFO LA DEPECHE. Menace d'effondrement près de Toulouse : un restaurant a-t-il été reconstruit illégalement sur des ...</a>&nbsp;&nbsp;<font color=\"#6f6f6f\">LaDepeche.fr</font>"
    },
         "title" => "INFO LA DEPECHE. Menace d'effondrement près de Toulouse : un restaurant a-t-il été reconstruit illégalement sur des ... - LaDepeche.fr",
          "link" => "https://news.google.com/rss/articles/CBMitQFodHRwczovL3d3dy5sYWRlcGVjaGUuZnIvMjAyNC8wNC8wMS9pbmZvLWxhLWRlcGVjaGUtbWVuYWNlLWRlZmZvbmRyZW1lbnQtcHJlcy1kZS10b3Vsb3VzZS11bi1yZXN0YXVyYW50LWEtdC1pbC1ldGUtcmVjb25zdHJ1aXQtaWxsZWdhbGVtZW50LXN1ci1kZXMtY2hhcnBlbnRlcy1jYWxjaW5lZXMtMTE4NTA4NTAucGhw0gEA?oc=5",
          "tags" => [
        [0] "Effondrement"
    ],
    "@timestamp" => 2024-04-02T14:39:23.272492543Z
}
{
      "@version" => "1",
     "published" => 2024-03-11T07:00:00.000Z,
       "message" => "<a href=\"https://news.google.com/rss/articles/CBMilAFodHRwczovL3d3dy5saW5kZXBlbmRhbnQuZnIvMjAyNC8wMy8xMS9wZXJvdS1sYS1mb3VkcmUtdHVlLXVuLWd1aWRlLWV0LWJsZXNzZS1zaXgtdG91cmlzdGVzLWZyYW5jYWlzLWRhbnMtbGEtbW9udGFnbmUtYXV4LXNlcHQtY291bGV1cnMtMTE4MTkyOTgucGhw0gEA?oc=5\" target=\"_blank\">Pérou : la foudre tue un guide et blesse six touristes français dans la montagne aux sept couleurs</a>&nbsp;&nbsp;<font color=\"#6f6f6f\">L'Indépendant</font>",
          "Feed" => "https://news.google.com/rss/search?&scoring=n&num=10&q=intitle:Foudre%22&hl=fr&gl=FR&ceid=FR:fr",
         "event" => {
        "original" => "<a href=\"https://news.google.com/rss/articles/CBMilAFodHRwczovL3d3dy5saW5kZXBlbmRhbnQuZnIvMjAyNC8wMy8xMS9wZXJvdS1sYS1mb3VkcmUtdHVlLXVuLWd1aWRlLWV0LWJsZXNzZS1zaXgtdG91cmlzdGVzLWZyYW5jYWlzLWRhbnMtbGEtbW9udGFnbmUtYXV4LXNlcHQtY291bGV1cnMtMTE4MTkyOTgucGhw0gEA?oc=5\" target=\"_blank\">Pérou : la foudre tue un guide et blesse six touristes français dans la montagne aux sept couleurs</a>&nbsp;&nbsp;<font color=\"#6f6f6f\">L'Indépendant</font>"
    },
         "title" => "Pérou : la foudre tue un guide et blesse six touristes français dans la montagne aux sept couleurs - L'Indépendant",
          "link" => "https://news.google.com/rss/articles/CBMilAFodHRwczovL3d3dy5saW5kZXBlbmRhbnQuZnIvMjAyNC8wMy8xMS9wZXJvdS1sYS1mb3VkcmUtdHVlLXVuLWd1aWRlLWV0LWJsZXNzZS1zaXgtdG91cmlzdGVzLWZyYW5jYWlzLWRhbnMtbGEtbW9udGFnbmUtYXV4LXNlcHQtY291bGV1cnMtMTE4MTkyOTgucGhw0gEA?oc=5",
          "tags" => [
        [0] "Foudre"
    ],
    "@timestamp" => 2024-04-02T14:39:23.272089571Z
}

But as we can see there is not the guid, source, ... tags

Have a look at Add support for additional item elements · Issue #11 · logstash-plugins/logstash-input-rss · GitHub

It's definitely missing.

I'd suggest to fork the code and try to modify it. And/pr comment on the issue.

Yeah I saw this page and I tried to modify the code to add the field I wanted.

It worked, but I did not really liked to do it that way. Is there another way to do it ?

That is also why I tried to do without the rss plugin and I found the http_poller, but impossible to make it work (with the code I gave).

What about sending a PR with your changes?

The problem with that is that in my exemple I use Google News RSS Feed, but I also want to add other feeds and they don't always are under the same format.

So I think a PR is not the ideal solution.

But it's a rss specification field, no? So it should apply to whatever format I think.

Otherwise, I honestly don't know how to do that.

Yes there are basic attributes that are always there but there can be additional attributes that are specific to certains feed.

Thanks for the help anyway. I built a little Rust script to do what I wanted but that's sad because I wanted to use Logstash.

I am keeping the topic open until the end of the month just in case anyone else has a solution.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.