Performance/Constraints on provided external document id

soltiz · July 29, 2015, 9:00am

Hi everyone
For deduplication purpose, we would like to provide our own id when inserting/updating documents.
We do not need versioning of documents. Sender always do "UPSERTS" (i.e. it doesn't know if document has already been indexed).
Question is multifold :

Is there performance impact when providing an external ID (because ES will not now if it is a create/update) ?
Will providing target version (always "first version") of the document help (because ES will not have to find out a new version available to store document) ?
Are there advices on the form of the external ID ( max size, sharding algorithm...) that affect performance/storage space ?

Best regards

nik9000 · July 29, 2015, 12:07pm

That suggests that zero padded sequential or mostly sequential ids does best. I've done just fine with sequential ids just strait from the database.

That'll break things I think. Do not provide a version if you always want Elasticsearch to perform the update.

You can provide an external id with a index request rather than an update request. I believe it to be slightly faster to send index requests if you know that the document is new but its not tons and tons faster and its harder to get right in multithreaded upserting environments. I've done fine by always sending updates.