Replacing identical documents in my index


#1

I've set up Logstash with the following pipeline.conf:

input {
#	stdin {}
#    beats {
#       port => "5044"
#    }
	rss {
	url => "http://feeds.washingtonpost.com/rss/world"
	interval => 3600
		tags => ["rss"]
	}
}
# filter {
#}
output {
    elasticsearch {
		action => "index"
        hosts => [ "localhost:9200" ]
		index => "rss"
		workers => 1
    }
	stdout {}
}

This works fine with my setup. Every time it runs it indexes what it receives from the source (the rss feed on the url).

My problem is that this adds everything from the rss feed into my index, even if that data have already been added. When I search through my index this then gives me multiple of the same result.

For example if a document have been indexed five times then when I search for that document in my index I get five results that are the same except for their timestamp.

An example is when I perform "GET rss/_search?q=border" in Kibana console I get the following results:

{
  "took": 25,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 3.7995694,
    "hits": [
      {
        "_index": "rss",
        "_type": "doc",
        "_id": "T1f8kWMBWYqQmres0sbi",
        "_score": 3.7995694,
        "_source": {
          "title": "The Border Patrol tries to win over Hispanic communities — by singing love songs in Spanish",
          "message": "Something very, very different is going on at the U.S.-Mexico border.",
          "published": null,
          "Feed": "http://feeds.washingtonpost.com/rss/world",
          "@version": "1",
          "@timestamp": "2018-05-24T11:50:54.567Z",
          "link": "https://www.washingtonpost.com/world/the_americas/the-border-patrol-tries-to-win-over-hispanic-communities--by-singing-love-songs-in-spanish/2018/05/22/4c65b930-53cc-11e8-a6d4-ca1d035642ce_story.html",
          "author": null,
          "tags": [
            "rss"
          ]
        }
      },
      {
        "_index": "rss",
        "_type": "doc",
        "_id": "-Ff3kWMBWYqQmresm8WX",
        "_score": 3.6400082,
        "_source": {
          "@version": "1",
          "author": null,
          "tags": [
            "rss"
          ],
          "title": "The Border Patrol tries to win over Hispanic communities — by singing love songs in Spanish",
          "link": "https://www.washingtonpost.com/world/the_americas/the-border-patrol-tries-to-win-over-hispanic-communities--by-singing-love-songs-in-spanish/2018/05/22/4c65b930-53cc-11e8-a6d4-ca1d035642ce_story.html",
          "published": null,
          "Feed": "http://feeds.washingtonpost.com/rss/world",
          "@timestamp": "2018-05-24T11:45:13.014Z",
          "message": "Something very, very different is going on at the U.S.-Mexico border."
        }
      },
      {
        "_index": "rss",
        "_type": "doc",
        "_id": "AFf4kWMBWYqQmresfMbt",
        "_score": 3.4221885,
        "_source": {
          "@version": "1",
          "author": null,
          "tags": [
            "rss"
          ],
          "title": "The Border Patrol tries to win over Hispanic communities — by singing love songs in Spanish",
          "link": "https://www.washingtonpost.com/world/the_americas/the-border-patrol-tries-to-win-over-hispanic-communities--by-singing-love-songs-in-spanish/2018/05/22/4c65b930-53cc-11e8-a6d4-ca1d035642ce_story.html",
          "published": null,
          "Feed": "http://feeds.washingtonpost.com/rss/world",
          "@timestamp": "2018-05-24T11:46:11.134Z",
          "message": "Something very, very different is going on at the U.S.-Mexico border."
        }
      },
      {
        "_index": "rss",
        "_type": "doc",
        "_id": "LFf5kWMBWYqQmresZsb6",
        "_score": 3.4221885,
        "_source": {
          "@version": "1",
          "author": null,
          "tags": [
            "rss"
          ],
          "title": "The Border Patrol tries to win over Hispanic communities — by singing love songs in Spanish",
          "link": "https://www.washingtonpost.com/world/the_americas/the-border-patrol-tries-to-win-over-hispanic-communities--by-singing-love-songs-in-spanish/2018/05/22/4c65b930-53cc-11e8-a6d4-ca1d035642ce_story.html",
          "published": null,
          "Feed": "http://feeds.washingtonpost.com/rss/world",
          "@timestamp": "2018-05-24T11:47:11.048Z",
          "message": "Something very, very different is going on at the U.S.-Mexico border."
        }
      }
    ]
  }
}

What I want is to only have a single version of each document. When Logstash receives an identical document it should delete or replace the previous document. Is this something which is possible? And if so what would I need to do to make it so?


(Magnus Bäck) #2

Use the elasticsearch's document_id to explicitly set the document id of the created documents. If each post has a unique id you can reference that field. Otherwise the fingerprint filter can be used to generate a hash from more than one field.

The second time you fetch the same post it'll get the same document id and overwrite itself in the index.


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.