How to prevent duplicates in ElasticSearch 2.X

acabrol · February 17, 2016, 8:17pm

Dear all,

I've just upgraded to ES 2.2 from 1.x and discovered that custom settings on "_id" is now deprecated more specifically "path" field.
(see https://www.elastic.co/guide/en/elasticsearch/reference/1.7/mapping-id-field.html).

I've found this blog which propose workaround to prevent duplicates during insert of docs:

From my point of view they are several risk with those solutions:

the duplicate check is not stored in index settings so in case of multiple clients which are not managed you can have duplicates
preprocessing increase load during insertion where path to unique field solved the duplication issue easily
the duplication prevention became much more complex so implementation failure risk is important

Could you help me to find an easy ES settings side way to prevent duplicate insert in Elasticsearch 2.0 or 2.2?

Regards,
Alexandre.

nik9000 · February 19, 2016, 5:29am

There isn't a settings side way. The only way to prevent duplication is to manage the _id in your ingesting application. Your point number 1 is valid but once you have multiple clients you can't trust you have bigger problems then _id management. I don't buy point number 2 or 3 because any application that can build the document in the first place has the data readily at hand to build the _id.

Mark_Harwood · February 21, 2016, 3:37pm

I raised some additional serious issues with that article in their comments section. Unfortunately they are no longer there.

acabrol · February 25, 2016, 7:45pm

Thank you for your answers.

I take note.

thn · February 26, 2016, 12:48pm

You can use the MD5 hash (or something equivalent) of the document as _id, this will prevent the dup. The way it works is when ES sees the same _id in the index, I think it replaces the current document in the index with the incoming document (kind of like an update to an existing document) Solr works the same way.

Note: some may argue that MD5 hash does have a potential collision but the probability is low. If you are not comfortable with the hash, then look at your document... if there is something from the document that you can use as a unique value that can be assigned to _id, then use that value.