Retrieve data from API, split response array and handle duplicates in Logstash

I am working on an application that pulls in news articles from an API using the http_poller plugin for Logstash. I want the data to be searchable in Elasticsearch. The basic setup that I have so far is working properly:

input {
  http_poller {
    urls => {
      url => "https://newsapi.org/v2/top-headlines?country=us&apiKey=..."
    }
    request_timeout => 60
    schedule => { every => "1m"}
    codec => "json"
    metadata_target => "http_poller_metadata"
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "newsapi"
  }

  stdout {
    codec => rubydebug
  }
}

What I get back looks like this:

{
  "took" : 425,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 98,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "newsapi",
        "_type" : "_doc",
        "_id" : "laF1kGwBzrlxhAK4IjnR",
        "_score" : 1.0,
        "_source" : {
          "http_poller_metadata" : {
            "response_headers" : {
              "pragma" : "no-cache",
              "access-control-allow-methods" : "GET",
              "access-control-allow-headers" : "x-api-key, authorization",
              "cache-control" : "no-cache",
              "expires" : "-1",
              "x-cached-result" : "true",
              "x-cache-remaining" : "97",
              "content-length" : "18137",
              "content-type" : "application/json; charset=utf-8",
              "date" : "Wed, 14 Aug 2019 14:08:04 GMT",
              "access-control-allow-origin" : "*",
              "x-cache-expires" : "Wed, 14 Aug 2019 14:10:04 GMT"
            },
            "code" : 200,
            "response_message" : "OK",
            "request" : {
              "method" : "get",
              "url" : "https://newsapi.org/v2/top-headlines?country=us&apiKey=..."
            },
            "host" : "cfb772d86a2d",
            "name" : "url",
            "runtime_seconds" : 0.128964,
            "times_retried" : 0
          },
          "articles" : [
            {
              "url" : "https://www.destructoid.com/here-s-what-the-sega-genesis-mini-looks-like-stacked-up-to-all-the-other-major-minis-563040.phtml",
              "urlToImage" : "https://destructoid.com/ul/563040-SegaGenesisMiniPreview1.jpg",
              "description" : "Here's what the Sega Genesis Mini looks like stacked up to all the other major minis",
              "publishedAt" : "2019-08-14T13:00:00Z",
              "source" : {
                "id" : null,
                "name" : "Destructoid.com"
              },
              "title" : "Here's what the Sega Genesis Mini looks like stacked up to all the other major minis - Destructoid",
              "content" : """
Medium-mini
We got our hands on a Sega Genesis Mini early, and have the go-ahead to showcase its complete contents.
Naturally, the first thing I did before playing it is see how it compares to all the other minis so far. I have a little mini museum!
There … [+1559 chars]
""",
              "author" : "Chris Carter"
            },
            {
              "url" : "https://www.cnn.com/2019/08/14/investing/dow-stock-market-today/index.html",
              "urlToImage" : "https://cdn.cnn.com/cnnnext/dam/assets/190801162523-03-nyse-0801-super-tease.jpg",
              "description" : "The Dow slid more than 400 points Wednesday after the bond market, for the first time in over a decade, flashed a warning signal that has an eerily accurate track record for predicting recessions.",
              "publishedAt" : "2019-08-14T12:56:00Z",
              "source" : {
                "id" : "cnn",
                "name" : "CNN"
              },
              "title" : "Dow set to tumble after bond market flashes a recession warning - CNN",
              "content" : """
New York (CNN Business)The Dow slid more than 400 points Wednesday after the bond market, for the first time in over a decade, flashed a warning signal that has an eerily accurate track record for predicting recessions.
Here's what happened: The 10-year Trea… [+2126 chars]
""",
              "author" : "David Goldman, CNN Business"
            },
            {
              "url" : "https://www.cheatsheet.com/entertainment/will-harry-styles-play-prince-eric-in-the-little-mermaid-why-turned-down-the-role.html/",
...
...
...

However, with each call from the http_poller I add some duplicates to my index. I would like to add a filter to my Logstash configuration file so I can check for duplicate articles and omit them.

I have extended the configuration file in order for Logstash to first split the articles array, and to create a single entry for each article that I get from the API.
Next, I want to create a fingerprint for each entry / article based on a unique field, or a combination of fields, for example articles.url. However, I am not sure if I am taking the right approach to detect duplicate values, before I insert them into Elasticsearch. How can I properly filter on source fields such as articles.url, articles.title or articles.source.name, and detect duplicates from the API? Any advice is highly appreciated!

What I've come up with so far looks like this:

input {
  http_poller {
    urls => {
      url => "https://newsapi.org/v2/top-headlines?country=us&apiKey=..."
    }
    request_timeout => 60
    schedule => { every => "1m"}
    codec => "json"
    metadata_target => "http_poller_metadata"
  }
}

filter {
  split {
    field => "articles"
  }

  fingerprint {
    source => "articles.url"
    target => "[@metadata][fingerprint]"
    method => "SHA1"
    key => "test"
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "newsapi"
    document_id => "%{[@metadata][fingerprint]}"
  }

  stdout {
    codec => rubydebug
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.