Dealing with duplicate documents

Hi All,

I'm trying to develop a database solution for some event data coming off of an embedded industrial system. Unfortunately, because of the way it is designed it generates a lot of duplicate events. Some are genuine exact duplicates and others are updates to previous events (i.e. filling in the end time of an event after it has finished).

Of course, I could use the Update API's upsert functionality to achieve this. However, I won't know the document ID of the event I'm trying to update when a new event arrives. From what I can tell (and admittedly I'm a complete Elasticsearch newbie so feel free to correct me!) you need to know the document ID of the document you're upserting to use the Update API.

In traditional SQL land I'd have to do a SELECT then UPDATE/INSERT as two independent operations. Do I have to follow the same programming model in Elasticsearch or is there a better, more efficient, way. I don't mind if the process has to happen asynchronously in the background with eventual consistancy. As long as the system doesn't grind to a halt (which is what's currently happening to the SQL database!).



If you know which fields in the event that uniquely identifies it, which I assume is the case as you state that you would be able to search for it, you can create a key based on this and use it instead of allowing Elasticsearch to automatically assign a key. You can create the key by simply concatenating the appropriate fields, or perhaps even calculate a sufficiently large hash based on these fields that you can use as a key.

Ah, I see. I had assumed that the key had to be numeric (another layover from old school SQL I suppose). If it can be text as you say then I see how it can be done.


Chris, I believe that the _id is always stored/represented as unindexed text, even if you think you are giving it an integer from your DB. It's concatenated with the document _type to get the _uid string.