Question on using my own value as _id

Hi As per this post it is not advisable to bring your own _id to elasticsearch.

I have inherited a shared windows (horror of all the horrors) drive as a source of log file and already see re-ingestion happening occasionally when using filebeats. However I have GUID as one of the fields.

I was thinking of using it as _id so that the event gets overwritten on an accidental re-ingest (Rather than ending up as a duplicate event).

The key thing is that I am not going to do a lookup. It will be a blind overwriting. At least from my point of view. Is there anyway I can instruct Elasticsearch not to try to look up for that document behind the scene and just fire a write command?

No that is not possible. If you provide your own id Elasticsearch will need to check if it is an update. This is what prevents duplicates, which is the behaviour you are looking for.

Ok got it. Will have to live with that then.
Is it a worthy feature to be considered?

Consider what a feature? The current behaviour is required when using you use your own IDs. Each shard in Elasticsearch is a Lucene intance and Lucene uses immutable segments to store data. These have to be searched when you supply your own ID as you otherwise could get duplicate documents with the same ID. What you are suggesting is therefore impossible.

Got it. Thanks.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.