I am working on an application that pulls in news articles from an API using the http_poller
plugin for Logstash. I want the data to be searchable in Elasticsearch. The basic setup that I have so far is working properly:
input {
http_poller {
urls => {
url => "https://newsapi.org/v2/top-headlines?country=us&apiKey=..."
}
request_timeout => 60
schedule => { every => "1m"}
codec => "json"
metadata_target => "http_poller_metadata"
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "newsapi"
}
stdout {
codec => rubydebug
}
}
What I get back looks like this:
{
"took" : 425,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 98,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "newsapi",
"_type" : "_doc",
"_id" : "laF1kGwBzrlxhAK4IjnR",
"_score" : 1.0,
"_source" : {
"http_poller_metadata" : {
"response_headers" : {
"pragma" : "no-cache",
"access-control-allow-methods" : "GET",
"access-control-allow-headers" : "x-api-key, authorization",
"cache-control" : "no-cache",
"expires" : "-1",
"x-cached-result" : "true",
"x-cache-remaining" : "97",
"content-length" : "18137",
"content-type" : "application/json; charset=utf-8",
"date" : "Wed, 14 Aug 2019 14:08:04 GMT",
"access-control-allow-origin" : "*",
"x-cache-expires" : "Wed, 14 Aug 2019 14:10:04 GMT"
},
"code" : 200,
"response_message" : "OK",
"request" : {
"method" : "get",
"url" : "https://newsapi.org/v2/top-headlines?country=us&apiKey=..."
},
"host" : "cfb772d86a2d",
"name" : "url",
"runtime_seconds" : 0.128964,
"times_retried" : 0
},
"articles" : [
{
"url" : "https://www.destructoid.com/here-s-what-the-sega-genesis-mini-looks-like-stacked-up-to-all-the-other-major-minis-563040.phtml",
"urlToImage" : "https://destructoid.com/ul/563040-SegaGenesisMiniPreview1.jpg",
"description" : "Here's what the Sega Genesis Mini looks like stacked up to all the other major minis",
"publishedAt" : "2019-08-14T13:00:00Z",
"source" : {
"id" : null,
"name" : "Destructoid.com"
},
"title" : "Here's what the Sega Genesis Mini looks like stacked up to all the other major minis - Destructoid",
"content" : """
Medium-mini
We got our hands on a Sega Genesis Mini early, and have the go-ahead to showcase its complete contents.
Naturally, the first thing I did before playing it is see how it compares to all the other minis so far. I have a little mini museum!
There … [+1559 chars]
""",
"author" : "Chris Carter"
},
{
"url" : "https://www.cnn.com/2019/08/14/investing/dow-stock-market-today/index.html",
"urlToImage" : "https://cdn.cnn.com/cnnnext/dam/assets/190801162523-03-nyse-0801-super-tease.jpg",
"description" : "The Dow slid more than 400 points Wednesday after the bond market, for the first time in over a decade, flashed a warning signal that has an eerily accurate track record for predicting recessions.",
"publishedAt" : "2019-08-14T12:56:00Z",
"source" : {
"id" : "cnn",
"name" : "CNN"
},
"title" : "Dow set to tumble after bond market flashes a recession warning - CNN",
"content" : """
New York (CNN Business)The Dow slid more than 400 points Wednesday after the bond market, for the first time in over a decade, flashed a warning signal that has an eerily accurate track record for predicting recessions.
Here's what happened: The 10-year Trea… [+2126 chars]
""",
"author" : "David Goldman, CNN Business"
},
{
"url" : "https://www.cheatsheet.com/entertainment/will-harry-styles-play-prince-eric-in-the-little-mermaid-why-turned-down-the-role.html/",
...
...
...
However, with each call from the http_poller
I add some duplicates to my index. I would like to add a filter to my Logstash configuration file so I can check for duplicate articles and omit them.
I have extended the configuration file in order for Logstash to first split the articles array, and to create a single entry for each article that I get from the API.
Next, I want to create a fingerprint for each entry / article based on a unique field, or a combination of fields, for example articles.url. However, I am not sure if I am taking the right approach to detect duplicate values, before I insert them into Elasticsearch. How can I properly filter on source fields such as articles.url
, articles.title
or articles.source.name
, and detect duplicates from the API? Any advice is highly appreciated!
What I've come up with so far looks like this:
input {
http_poller {
urls => {
url => "https://newsapi.org/v2/top-headlines?country=us&apiKey=..."
}
request_timeout => 60
schedule => { every => "1m"}
codec => "json"
metadata_target => "http_poller_metadata"
}
}
filter {
split {
field => "articles"
}
fingerprint {
source => "articles.url"
target => "[@metadata][fingerprint]"
method => "SHA1"
key => "test"
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "newsapi"
document_id => "%{[@metadata][fingerprint]}"
}
stdout {
codec => rubydebug
}
}