What would be the most performant way to maintain a creation timestamp and last update timestamp for each doc?
I am indexing documents which have a set ID. When there is a new ID the creation timestamp is set. When fields other than the ID is changed in a given document, the last update timestamp is set.
I'm assuming this would need to be done in a script using the update api. So, do I need to manually compare all fields in the old and new documents, ignoring the timestamp fields, and set operation to noop if they are the same and update the document and last update timestamp and persist the creation timestamp, if the fields are different?
I see there is a detect noop feature in the update api, can that be configured to ignore specific fields (like the timestamps)? Can its value be accessed in the script so we can perform operations if no change is detected?
You could try to create an ingest pipeline for setting the created and updated timestamps, in a pipeline you can both add new document fields and modify old ones. Ref:
You can create fairly complex scripts using the painless scripting language inside such pipelines. Here's an example of a pipeline we use at my company for setting the processing time on our documents:
{
"description" : "Sets the document processingtime",
"processors" : [
{
"script" : {
"lang" : "painless",
"inline" : "DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");\ndf.setTimeZone(TimeZone.getTimeZone("UTC"));\nDate date = new Date();\nctx.processingtime = df.format(date);"
}
}
]
}
It's fairly efficient too as we run millions of docs through this pipeline every day without any noticeable delays in our indexing.
I haven't used the update API so I don't know how it compares to using pipelines, but for us the ingest pipeline didn't add any noticeable delays in our indexing phase, we even did a full re-feed from database ~250 mill docs which was just as fast (about 36 hrs) as before we added the "processingtime" pipeline.
As we're not doing updates, just indexing new documents (which may be updates to old, in that case they simply overwrite the old ones), I really don't know. My guess is that this should be handled in the application layer by bulk reading N docs by _id and checking them for changes against the new batch of docs and then filter out those that have changed for bulk indexing / updating.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.