We are implementing discovery service over emails meta data for a large stream of emails, and we are trying to use some self-generated email ID as primary key for each doc in elasticsearch (overwrite the _id field).
Main reason is that we would need to perform some future updates for docs after these are indexes into ES.
Is this possible with latest ES versions (7.x)? If so, can you please show some example?
If not, is it possible to perform an update using painless script for an ID of our choice ? how would the performance be affected of such update? (similar to SQL-like statement like: 'update some_col=0 where self_gen_id=123')
While ingesting a new documents you can PUT /<index>/_doc/<self_generated_email_id> PUT /<index>/_create/<self_generated_email_id>
Yes, thank you, I am aware of these options;
However - I did left out mentioning some crucial part our implementation - we use AWS Kinesis Firehose for streaming the bulk of docs.
So in our case - we don't have control over the bulk API usage, unfortunately.
Thus - I was wondering whether we could somehow overwrite the _id field of a document, from within the document body (trying to do so results with the error: "Field [_id] is a metadata field and cannot be added inside a document. Use the index API request parameters.")
Regarding my update script question, I found that it is possible to do what we need using the _update_by_query API (querying for our generated ID field within the doc), which I assume that it would result with poor performance comparing to _update API when we have the document ID to update.
If you have any suggestion on how to improve that - please share.
(Note that our update rate is expected to be much lower than the insert rate)
you can specify your id in bulk api too
{ "index": { "_id" : "self_generated_email_id>", ... }}
BTW - I did try to check if AWS Firehose supports this but was disappointed to see that they don't. Seems to me like a great feature to have on AWS side where they could pop out the _id field from the doc (if exists) and put it into the bulk API request (Just sharing my thought here, how I would do this if I could).
I don't have first hand experience in mass updating documents. But take conflict parameter impact into account. If you choose to continue on version conflict some documents may not get updated. I don't think you can find out which are those from the API response. You can search using your id but if I am not mistaken it will not allow you to change _id of the matched documents. Since it uses scroll, it will block segment merges while updates are in progress.
Using user supplied _id has impact on ingestion speed. When number of documents is small it works great. As index grows ingestion throughput degrades due to look up required for every insert. Insert using autogenerated ids, do not require lookup.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.