Prevent checking existence of document ID when specified

iamredlus · December 3, 2018, 12:40pm

Hey

It is well documented that when specifying an ID of a document upon ingestion, elasticsearch will first check whether that ID is already in use, and only then approve the new document for indexing. This comes with a big performance price tag. We have given up setting our own ID for performance issues, but this makes our codebase much more complex as documents need to be searched instead of directly accessed.

Why is this check necessary?
We are very sure of the uniqueness of the ID we are providing and would expect a collision to be automatically interpreted as an update action.
I would like to suggest such a flag to disable checking the ID of documents before indexing.

p.s.
If I compare this behavior with mapping, for example, elasticsearch does not check whether the mapping of the document I am about to index actually matches the expected one in the index. Instead, it tries to index the document, and in case of a mapping collision the document is rejected and an exception is thrown.

Thanks
Lior

Mark_Harwood · December 3, 2018, 12:56pm

It would be a pretty catastrophic error if we trusted a client to provide unique IDs without elasticsearch checking them first and it turned out the client had provided us with duplicates. The behaviour of the system at that point would be "interesting".
Relinquishing that fundamental check would be like asking a database to not reject records that violated a primary key constraint. The performance benefits are not to be disputed but that's the sort of feature we'd call a foot gun.

You can use routing to target reads and writes to the appropriate shard.

iamredlus · December 3, 2018, 1:03pm

@Mark_Harwood thanks for the quick answer.
Can't the behavior fallback duplicates to an update? This is what I would expect as the API suggest (indexing is exposed as an upsert).
What would the behavior of the system be if such a fallback isn't possible and the aforementioned violation happens?

Mark_Harwood · December 3, 2018, 1:10pm

It already does. Elasticsearch attempts a read on the provided ID to know whether an update or insert is the appropriate action - which is where the noted performance cost comes in.
Autogen IDs are always assumed inserts and therefore skip a read.

It's like Ghostbusters and "crossing the streams" - we should just assume bad things will happen.

system · December 31, 2018, 1:10pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Force elasticsearch uniqueness constraint Elasticsearch	4	477	July 6, 2017
Inserting a document that already exists. Exception? Elasticsearch	7	5718	July 13, 2018
Modifying default _id field Elasticsearch	2	338	July 17, 2019
Question on using my own value as _id Elasticsearch	5	451	July 19, 2021
Duplicate indexing behavior without _id Elasticsearch vector-search	1	73	April 24, 2025

Prevent checking existence of document ID when specified

Related topics