Avoid reloading (duplicate) csv records into same index

Hi All,
I have a working config for loading csv records from logstash to elasticsearch. However when I try to restart the logstash service, the same csv file records are reloaded again into same index and creating duplicate record entries. I want to avoid this happening. Can someone point me whats wrong with this conf or anything missing.

input {
  file {
    path => "/etc/logstash/1.csv"
    start_position => "beginning"
    sincedb_path => "/etc/logstash/sincedb_sample1csv"
  }
  file {
    path => "/etc/logstash/2.csv"
    start_position => "beginning"
    sincedb_path => "/etc/logstash/sincedb_sample2csv"
  }
}

filter {
  if [path] == "/etc/logstash/1.csv"
  {
    csv {
       separator => ","
       columns => ["column1","column2"]
        }
    }
  if [path] == "/etc/logstash/2.csv"
  {
    csv {
       separator => ","
       columns => ["column1","column2"]
        }
    }
}
output {
  if [path] == "/etc/logstash/1.csv"
    {
     elasticsearch {
       action => "index"
       hosts => ["http://192.168.1.1:9200"]
       index => "sample1"
     }
    }
  if [path] == "/etc/logstash/2.csv"
    {
     elasticsearch {
       action => "index"
       hosts => ["http://192.168.1.1:9200"]
       index => "sample2"
     }
   }

Hi there,

here you are not forcing a document_id, so a random one is assigned to your documents. It means that even if all the fields of your documents are identical, the _id is different, hence Elasticsearch will treat is as a new document.

Check out the fingerprint filter https://www.elastic.co/guide/en/logstash/current/plugins-filters-fingerprint.html :wink:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.