Prevent checking existence of document ID when specified


(Lior Redlus) #1

Hey

It is well documented that when specifying an ID of a document upon ingestion, elasticsearch will first check whether that ID is already in use, and only then approve the new document for indexing. This comes with a big performance price tag. We have given up setting our own ID for performance issues, but this makes our codebase much more complex as documents need to be searched instead of directly accessed.

Why is this check necessary?
We are very sure of the uniqueness of the ID we are providing and would expect a collision to be automatically interpreted as an update action.
I would like to suggest such a flag to disable checking the ID of documents before indexing.

p.s.
If I compare this behavior with mapping, for example, elasticsearch does not check whether the mapping of the document I am about to index actually matches the expected one in the index. Instead, it tries to index the document, and in case of a mapping collision the document is rejected and an exception is thrown.

Thanks
Lior


(Mark Harwood) #2

It would be a pretty catastrophic error if we trusted a client to provide unique IDs without elasticsearch checking them first and it turned out the client had provided us with duplicates. The behaviour of the system at that point would be "interesting".
Relinquishing that fundamental check would be like asking a database to not reject records that violated a primary key constraint. The performance benefits are not to be disputed but that's the sort of feature we'd call a foot gun.

You can use routing to target reads and writes to the appropriate shard.


(Lior Redlus) #3

@Mark_Harwood thanks for the quick answer.
Can't the behavior fallback duplicates to an update? This is what I would expect as the API suggest (indexing is exposed as an upsert).
What would the behavior of the system be if such a fallback isn't possible and the aforementioned violation happens?


(Mark Harwood) #4

It already does. Elasticsearch attempts a read on the provided ID to know whether an update or insert is the appropriate action - which is where the noted performance cost comes in.
Autogen IDs are always assumed inserts and therefore skip a read.

It's like Ghostbusters and "crossing the streams" - we should just assume bad things will happen.