Dealing with duplicate documents

cblackburn · June 1, 2016, 9:19am

Hi All,

I'm trying to develop a database solution for some event data coming off of an embedded industrial system. Unfortunately, because of the way it is designed it generates a lot of duplicate events. Some are genuine exact duplicates and others are updates to previous events (i.e. filling in the end time of an event after it has finished).

Of course, I could use the Update API's upsert functionality to achieve this. However, I won't know the document ID of the event I'm trying to update when a new event arrives. From what I can tell (and admittedly I'm a complete Elasticsearch newbie so feel free to correct me!) you need to know the document ID of the document you're upserting to use the Update API.

In traditional SQL land I'd have to do a SELECT then UPDATE/INSERT as two independent operations. Do I have to follow the same programming model in Elasticsearch or is there a better, more efficient, way. I don't mind if the process has to happen asynchronously in the background with eventual consistancy. As long as the system doesn't grind to a halt (which is what's currently happening to the SQL database!).

Thanks

Chris

Christian_Dahlqvist · June 1, 2016, 9:31am

If you know which fields in the event that uniquely identifies it, which I assume is the case as you state that you would be able to search for it, you can create a key based on this and use it instead of allowing Elasticsearch to automatically assign a key. You can create the key by simply concatenating the appropriate fields, or perhaps even calculate a sufficiently large hash based on these fields that you can use as a key.

cblackburn · June 1, 2016, 10:56am

Ah, I see. I had assumed that the key had to be numeric (another layover from old school SQL I suppose). If it can be text as you say then I see how it can be done.

Thanks

loren · June 1, 2016, 6:57pm

Chris, I believe that the _id is always stored/represented as unindexed text, even if you think you are giving it an integer from your DB. It's concatenated with the document _type to get the _uid string.

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-id-field.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-uid-field.html

Topic		Replies	Views
How can I avoid indexing duplicate data into Elasticsearch, how define my keys? Elasticsearch	2	530	July 11, 2020
Avoiding duplicate records Logstash	4	2459	November 2, 2017
Avoiding duplicate documents with versioning Elasticsearch	5	411	July 6, 2017
Alias-level unique document id? Elasticsearch	1	1055	September 19, 2017
What is way to approach "INSERT ON DUPLICATE KEY UPDATE" in ES Elasticsearch	3	2680	July 6, 2017

Dealing with duplicate documents

Related topics