Replacing identical documents in my index

Ole · May 24, 2018, 1:46pm

I've set up Logstash with the following pipeline.conf:

input {
#	stdin {}
#    beats {
#       port => "5044"
#    }
	rss {
	url => "http://feeds.washingtonpost.com/rss/world"
	interval => 3600
		tags => ["rss"]
	}
}
# filter {
#}
output {
    elasticsearch {
		action => "index"
        hosts => [ "localhost:9200" ]
		index => "rss"
		workers => 1
    }
	stdout {}
}

This works fine with my setup. Every time it runs it indexes what it receives from the source (the rss feed on the url).

My problem is that this adds everything from the rss feed into my index, even if that data have already been added. When I search through my index this then gives me multiple of the same result.

For example if a document have been indexed five times then when I search for that document in my index I get five results that are the same except for their timestamp.

An example is when I perform "GET rss/_search?q=border" in Kibana console I get the following results:

{
  "took": 25,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 3.7995694,
    "hits": [
      {
        "_index": "rss",
        "_type": "doc",
        "_id": "T1f8kWMBWYqQmres0sbi",
        "_score": 3.7995694,
        "_source": {
          "title": "The Border Patrol tries to win over Hispanic communities — by singing love songs in Spanish",
          "message": "Something very, very different is going on at the U.S.-Mexico border.",
          "published": null,
          "Feed": "http://feeds.washingtonpost.com/rss/world",
          "@version": "1",
          "@timestamp": "2018-05-24T11:50:54.567Z",
          "link": "https://www.washingtonpost.com/world/the_americas/the-border-patrol-tries-to-win-over-hispanic-communities--by-singing-love-songs-in-spanish/2018/05/22/4c65b930-53cc-11e8-a6d4-ca1d035642ce_story.html",
          "author": null,
          "tags": [
            "rss"
          ]
        }
      },
      {
        "_index": "rss",
        "_type": "doc",
        "_id": "-Ff3kWMBWYqQmresm8WX",
        "_score": 3.6400082,
        "_source": {
          "@version": "1",
          "author": null,
          "tags": [
            "rss"
          ],
          "title": "The Border Patrol tries to win over Hispanic communities — by singing love songs in Spanish",
          "link": "https://www.washingtonpost.com/world/the_americas/the-border-patrol-tries-to-win-over-hispanic-communities--by-singing-love-songs-in-spanish/2018/05/22/4c65b930-53cc-11e8-a6d4-ca1d035642ce_story.html",
          "published": null,
          "Feed": "http://feeds.washingtonpost.com/rss/world",
          "@timestamp": "2018-05-24T11:45:13.014Z",
          "message": "Something very, very different is going on at the U.S.-Mexico border."
        }
      },
      {
        "_index": "rss",
        "_type": "doc",
        "_id": "AFf4kWMBWYqQmresfMbt",
        "_score": 3.4221885,
        "_source": {
          "@version": "1",
          "author": null,
          "tags": [
            "rss"
          ],
          "title": "The Border Patrol tries to win over Hispanic communities — by singing love songs in Spanish",
          "link": "https://www.washingtonpost.com/world/the_americas/the-border-patrol-tries-to-win-over-hispanic-communities--by-singing-love-songs-in-spanish/2018/05/22/4c65b930-53cc-11e8-a6d4-ca1d035642ce_story.html",
          "published": null,
          "Feed": "http://feeds.washingtonpost.com/rss/world",
          "@timestamp": "2018-05-24T11:46:11.134Z",
          "message": "Something very, very different is going on at the U.S.-Mexico border."
        }
      },
      {
        "_index": "rss",
        "_type": "doc",
        "_id": "LFf5kWMBWYqQmresZsb6",
        "_score": 3.4221885,
        "_source": {
          "@version": "1",
          "author": null,
          "tags": [
            "rss"
          ],
          "title": "The Border Patrol tries to win over Hispanic communities — by singing love songs in Spanish",
          "link": "https://www.washingtonpost.com/world/the_americas/the-border-patrol-tries-to-win-over-hispanic-communities--by-singing-love-songs-in-spanish/2018/05/22/4c65b930-53cc-11e8-a6d4-ca1d035642ce_story.html",
          "published": null,
          "Feed": "http://feeds.washingtonpost.com/rss/world",
          "@timestamp": "2018-05-24T11:47:11.048Z",
          "message": "Something very, very different is going on at the U.S.-Mexico border."
        }
      }
    ]
  }
}

What I want is to only have a single version of each document. When Logstash receives an identical document it should delete or replace the previous document. Is this something which is possible? And if so what would I need to do to make it so?

magnusbaeck · May 27, 2018, 7:58pm

Use the elasticsearch's document_id to explicitly set the document id of the created documents. If each post has a unique id you can reference that field. Otherwise the fingerprint filter can be used to generate a hash from more than one field.

The second time you fetch the same post it'll get the same document id and overwrite itself in the index.

system · June 24, 2018, 7:58pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Prevent new document with the same docuemnt id replacing the old one Logstash	5	507	March 1, 2018
Will logstash duplicate already indexed data in elasticsearch? Logstash	2	1145	July 6, 2017
Duplicate items in elasticsearch Logstash	3	798	July 6, 2017
Logstash generating duplicated index Logstash	1	475	September 5, 2017
Logstash don't detect duplicated documents Logstash	2	288	July 3, 2018

Replacing identical documents in my index

Related topics