How to stop duplicate entries using elasticsearch plugin

Morning all.
Im trying to use the elasticsearch plugin to pull data from an existing elasticsearch instance. Both are running older versions of Elasticsearch, and thus im running an older version of logstash (1.5). I've having 2 issues so far.

  1. The logstash instance runs to a point, and then shuts down. (this isnt a major problem is problem 2 can be solved).

  2. When I start up the logstash instance again, it copies over data which has already copied over, thus creating duplicate entries in my new elasticsearch instance.

Input:

    input {
      elasticsearch {
        hosts => ["kibana.host.name"]
        query => '{ "query": { "match": { "message": "filterMessageHere" } } }'
        docinfo => true
        scroll => '2m'
      }
    }

Output:

output {
  elasticsearch {
  host => localhost
  }
}

I tried to add document_id => "%{_id}" (since thats the id defined in the source elasticsearch instance), but had no success.

Any help would be much appreciated. Thanks all.

If id, type and destination index of the documents are the same, by default it should not create another instance of the same document, but rather just bump the version number of the already indexed document.

You can always try changing the default action of the output plugin to create instead of the default index (as per https://www.elastic.co/guide/en/logstash/1.5/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-action), so the insert will fail for already existing documents.

Keep in mind though that you will potentially be flooded with 40x error responces from ES. Nothing to worry about since it's intended, but may take up space quickly depending on the amount of them.

If you look in the documentation the example given under the docinfo section, this seems to show how to assign document id from the metadata fields, which is the default location for this information.

I had set the docinfo => true, but didn't see any improvement (unless I need to dump the indexes first and start copying again).

@paz
I tried setting the default action to create_unless_exists earlier, but no joy. I'll try 'create' now. You're right about the logs though, flooding with 40x error responses. Am I right in thinking that logstash is trying to copy over data that it already has, and is erroring out because it already exists? This would explain why my document counts aren't increasing (yet). Eventually i'd expect to see logstash find an index that wasnt copied over and start increasing my document count.

Did you look in the documentation I linked to? The example shows how you set the document id: document_id => "%{[@metadata][_id]}"

That is correct, Elasticsearch refuses to create the document since it already exists, and the error propagates back to Logstash.
When the scroll goes beyond the documents already indexed, you should see those errors stopping and the document count increasing.

@Christian_Dahlqvist
I sure did Christian.

I updated my input file to include docinfo => true and my output file as follows:

output {
  elasticsearch {
    host => localhost
    action => "create"
    index => "logstash-%{YYYY.MM.dd}"
    document_type => "%{[@metadata][_type]}"
    document_id => "%{[@metadata][_id]}"
  }
}

Im running an older version of logstash so im looking at this documentation.

@paz & @Christian_Dahlqvist

I deleted all my indexes and restarted logstash with the above output config, but im getting a constant stream of warns in the logs and no documents being indexed.
:message=>"failed action with response of 400, dropping action

Can you post a sample from the Elasticsearch log? There should be more information there on why it returns 400.

I removed the index from my output configuration and the logs are all quiet and documents are being indexed.

Ah, I wonder was it the index naming that was breaking it:
{:timestamp=>"2017-06-01T13:35:22.377000+0000", :message=>"failed action with response of 400, dropping action: [\"index\", {:_id=>\"AVxUpH8GqqYqcknYGhjV\", :_index=>\"logstash-%{YYYY.MM.dd}\",\

logstash-%{YYYY.MM.dd} wasn't getting translated. I'll let it index away for now, and circle back here when I restart logstash. Hopefully it won't duplicate the data this time.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.