Duplicate Issue - document_id, how to prevent overwriting of entries

Hi All,

Some background information:

I have duplicate entries in my elasticsearch indexes.

Have used document_id which prevented duplicates from appearing. But the issue with this is that it overwrites and updates the duplicate > effectively removing the older copy which is 'correct'.

Can anyone point out if there is anyway to prevent the overwriting/updating from happening, and instead just tell ElasticSearch to ignore the duplicate that was detected from document_id?

Hi,

Elasticsearch has a parameter called op_type: Index API | Elasticsearch Reference [7.11] | Elastic

Set this parameter to create to ingest only documents if the Id does not exist.

Best regards
Wolfram

Hi Wolfram,

Appreciate your response.

If you don't mind could you point out to me how I can apply your recommendation?

For example, am I supposed to edit a certain config file from somewhere? Or am I suppose to do something in the same logstash.conf file.

Regards
Sam

Hello Sam,

This depends how you are ingesting the data:
When ingesting directly to ElasticSearch op_type is an url parameter:

PUT my-index-000001/_doc/1?op_type=create

When using LogStash you can configure it in the elasticsearch output:

 output {
      elasticsearch {
        action => "create"
      }
    }

Best regards
Wolfram

Hi Wolfram,

Apologies as I have forgotten to mention how I am using the data.

Yes, I am ingesting in via Logstash:

From the example you shown me, all I need to do is to add the actions => "create" line into my conf file.

After which Logstash will not index and overwrite documents which already have an ID > if I were to simulate a duplicate this time round, nothing will happen (unlike before whereby documents were overwritten).

Am interpreting it right?

Regards
Sam

Hello Sam,

Yes, adding action => "create" should solve your issue.

Best regards
Wolfram