I've set up Logstash with the following pipeline.conf:
input {
# stdin {}
# beats {
# port => "5044"
# }
rss {
url => "http://feeds.washingtonpost.com/rss/world"
interval => 3600
tags => ["rss"]
}
}
# filter {
#}
output {
elasticsearch {
action => "index"
hosts => [ "localhost:9200" ]
index => "rss"
workers => 1
}
stdout {}
}
This works fine with my setup. Every time it runs it indexes what it receives from the source (the rss feed on the url).
My problem is that this adds everything from the rss feed into my index, even if that data have already been added. When I search through my index this then gives me multiple of the same result.
For example if a document have been indexed five times then when I search for that document in my index I get five results that are the same except for their timestamp.
An example is when I perform "GET rss/_search?q=border" in Kibana console I get the following results:
{
"took": 25,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 3.7995694,
"hits": [
{
"_index": "rss",
"_type": "doc",
"_id": "T1f8kWMBWYqQmres0sbi",
"_score": 3.7995694,
"_source": {
"title": "The Border Patrol tries to win over Hispanic communities — by singing love songs in Spanish",
"message": "Something very, very different is going on at the U.S.-Mexico border.",
"published": null,
"Feed": "http://feeds.washingtonpost.com/rss/world",
"@version": "1",
"@timestamp": "2018-05-24T11:50:54.567Z",
"link": "https://www.washingtonpost.com/world/the_americas/the-border-patrol-tries-to-win-over-hispanic-communities--by-singing-love-songs-in-spanish/2018/05/22/4c65b930-53cc-11e8-a6d4-ca1d035642ce_story.html",
"author": null,
"tags": [
"rss"
]
}
},
{
"_index": "rss",
"_type": "doc",
"_id": "-Ff3kWMBWYqQmresm8WX",
"_score": 3.6400082,
"_source": {
"@version": "1",
"author": null,
"tags": [
"rss"
],
"title": "The Border Patrol tries to win over Hispanic communities — by singing love songs in Spanish",
"link": "https://www.washingtonpost.com/world/the_americas/the-border-patrol-tries-to-win-over-hispanic-communities--by-singing-love-songs-in-spanish/2018/05/22/4c65b930-53cc-11e8-a6d4-ca1d035642ce_story.html",
"published": null,
"Feed": "http://feeds.washingtonpost.com/rss/world",
"@timestamp": "2018-05-24T11:45:13.014Z",
"message": "Something very, very different is going on at the U.S.-Mexico border."
}
},
{
"_index": "rss",
"_type": "doc",
"_id": "AFf4kWMBWYqQmresfMbt",
"_score": 3.4221885,
"_source": {
"@version": "1",
"author": null,
"tags": [
"rss"
],
"title": "The Border Patrol tries to win over Hispanic communities — by singing love songs in Spanish",
"link": "https://www.washingtonpost.com/world/the_americas/the-border-patrol-tries-to-win-over-hispanic-communities--by-singing-love-songs-in-spanish/2018/05/22/4c65b930-53cc-11e8-a6d4-ca1d035642ce_story.html",
"published": null,
"Feed": "http://feeds.washingtonpost.com/rss/world",
"@timestamp": "2018-05-24T11:46:11.134Z",
"message": "Something very, very different is going on at the U.S.-Mexico border."
}
},
{
"_index": "rss",
"_type": "doc",
"_id": "LFf5kWMBWYqQmresZsb6",
"_score": 3.4221885,
"_source": {
"@version": "1",
"author": null,
"tags": [
"rss"
],
"title": "The Border Patrol tries to win over Hispanic communities — by singing love songs in Spanish",
"link": "https://www.washingtonpost.com/world/the_americas/the-border-patrol-tries-to-win-over-hispanic-communities--by-singing-love-songs-in-spanish/2018/05/22/4c65b930-53cc-11e8-a6d4-ca1d035642ce_story.html",
"published": null,
"Feed": "http://feeds.washingtonpost.com/rss/world",
"@timestamp": "2018-05-24T11:47:11.048Z",
"message": "Something very, very different is going on at the U.S.-Mexico border."
}
}
]
}
}
What I want is to only have a single version of each document. When Logstash receives an identical document it should delete or replace the previous document. Is this something which is possible? And if so what would I need to do to make it so?