Retrieve data from API, split response array and handle duplicates in Logstash

frv · August 14, 2019, 3:09pm

I am working on an application that pulls in news articles from an API using the http_poller plugin for Logstash. I want the data to be searchable in Elasticsearch. The basic setup that I have so far is working properly:

input {
  http_poller {
    urls => {
      url => "https://newsapi.org/v2/top-headlines?country=us&apiKey=..."
    }
    request_timeout => 60
    schedule => { every => "1m"}
    codec => "json"
    metadata_target => "http_poller_metadata"
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "newsapi"
  }

  stdout {
    codec => rubydebug
  }
}

What I get back looks like this:

{
  "took" : 425,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 98,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "newsapi",
        "_type" : "_doc",
        "_id" : "laF1kGwBzrlxhAK4IjnR",
        "_score" : 1.0,
        "_source" : {
          "http_poller_metadata" : {
            "response_headers" : {
              "pragma" : "no-cache",
              "access-control-allow-methods" : "GET",
              "access-control-allow-headers" : "x-api-key, authorization",
              "cache-control" : "no-cache",
              "expires" : "-1",
              "x-cached-result" : "true",
              "x-cache-remaining" : "97",
              "content-length" : "18137",
              "content-type" : "application/json; charset=utf-8",
              "date" : "Wed, 14 Aug 2019 14:08:04 GMT",
              "access-control-allow-origin" : "*",
              "x-cache-expires" : "Wed, 14 Aug 2019 14:10:04 GMT"
            },
            "code" : 200,
            "response_message" : "OK",
            "request" : {
              "method" : "get",
              "url" : "https://newsapi.org/v2/top-headlines?country=us&apiKey=..."
            },
            "host" : "cfb772d86a2d",
            "name" : "url",
            "runtime_seconds" : 0.128964,
            "times_retried" : 0
          },
          "articles" : [
            {
              "url" : "https://www.destructoid.com/here-s-what-the-sega-genesis-mini-looks-like-stacked-up-to-all-the-other-major-minis-563040.phtml",
              "urlToImage" : "https://destructoid.com/ul/563040-SegaGenesisMiniPreview1.jpg",
              "description" : "Here's what the Sega Genesis Mini looks like stacked up to all the other major minis",
              "publishedAt" : "2019-08-14T13:00:00Z",
              "source" : {
                "id" : null,
                "name" : "Destructoid.com"
              },
              "title" : "Here's what the Sega Genesis Mini looks like stacked up to all the other major minis - Destructoid",
              "content" : """
Medium-mini
We got our hands on a Sega Genesis Mini early, and have the go-ahead to showcase its complete contents.
Naturally, the first thing I did before playing it is see how it compares to all the other minis so far. I have a little mini museum!
There … [+1559 chars]
""",
              "author" : "Chris Carter"
            },
            {
              "url" : "https://www.cnn.com/2019/08/14/investing/dow-stock-market-today/index.html",
              "urlToImage" : "https://cdn.cnn.com/cnnnext/dam/assets/190801162523-03-nyse-0801-super-tease.jpg",
              "description" : "The Dow slid more than 400 points Wednesday after the bond market, for the first time in over a decade, flashed a warning signal that has an eerily accurate track record for predicting recessions.",
              "publishedAt" : "2019-08-14T12:56:00Z",
              "source" : {
                "id" : "cnn",
                "name" : "CNN"
              },
              "title" : "Dow set to tumble after bond market flashes a recession warning - CNN",
              "content" : """
New York (CNN Business)The Dow slid more than 400 points Wednesday after the bond market, for the first time in over a decade, flashed a warning signal that has an eerily accurate track record for predicting recessions.
Here's what happened: The 10-year Trea… [+2126 chars]
""",
              "author" : "David Goldman, CNN Business"
            },
            {
              "url" : "https://www.cheatsheet.com/entertainment/will-harry-styles-play-prince-eric-in-the-little-mermaid-why-turned-down-the-role.html/",
...
...
...

However, with each call from the http_poller I add some duplicates to my index. I would like to add a filter to my Logstash configuration file so I can check for duplicate articles and omit them.

I have extended the configuration file in order for Logstash to first split the articles array, and to create a single entry for each article that I get from the API.
Next, I want to create a fingerprint for each entry / article based on a unique field, or a combination of fields, for example articles.url. However, I am not sure if I am taking the right approach to detect duplicate values, before I insert them into Elasticsearch. How can I properly filter on source fields such as articles.url, articles.title or articles.source.name, and detect duplicates from the API? Any advice is highly appreciated!

What I've come up with so far looks like this:

input {
  http_poller {
    urls => {
      url => "https://newsapi.org/v2/top-headlines?country=us&apiKey=..."
    }
    request_timeout => 60
    schedule => { every => "1m"}
    codec => "json"
    metadata_target => "http_poller_metadata"
  }
}

filter {
  split {
    field => "articles"
  }

  fingerprint {
    source => "articles.url"
    target => "[@metadata][fingerprint]"
    method => "SHA1"
    key => "test"
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "newsapi"
    document_id => "%{[@metadata][fingerprint]}"
  }

  stdout {
    codec => rubydebug
  }
}

system · September 11, 2019, 3:09pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Basic http_poller question with REST API Logstash	3	836	December 23, 2017
NOT ABLE TO INDEX DATA USING HTTP POLLER Logstash	13	208	March 14, 2024
Help with http http_poller input plugin Logstash	3	308	June 17, 2021
Http_poller input only new lines? Logstash	12	2261	February 9, 2018
Parse json Array input Logstash	3	6294	January 30, 2017

Retrieve data from API, split response array and handle duplicates in Logstash

Related topics