Creation and update timestamps for each doc?

twigbranch · July 17, 2017, 9:08pm

What would be the most performant way to maintain a creation timestamp and last update timestamp for each doc?

I am indexing documents which have a set ID. When there is a new ID the creation timestamp is set. When fields other than the ID is changed in a given document, the last update timestamp is set.

I'm assuming this would need to be done in a script using the update api. So, do I need to manually compare all fields in the old and new documents, ignoring the timestamp fields, and set operation to noop if they are the same and update the document and last update timestamp and persist the creation timestamp, if the fields are different?

I see there is a detect noop feature in the update api, can that be configured to ignore specific fields (like the timestamps)? Can its value be accessed in the script so we can perform operations if no change is detected?

Bernt_Rostad · July 18, 2017, 9:39am

You could try to create an ingest pipeline for setting the created and updated timestamps, in a pipeline you can both add new document fields and modify old ones. Ref:

https://www.elastic.co/guide/en/elasticsearch/reference/5.5/put-pipeline-api.html

You can create fairly complex scripts using the painless scripting language inside such pipelines. Here's an example of a pipeline we use at my company for setting the processing time on our documents:

{
"description" : "Sets the document processingtime",
"processors" : [
{
"script" : {
"lang" : "painless",
"inline" : "DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");\ndf.setTimeZone(TimeZone.getTimeZone("UTC"));\nDate date = new Date();\nctx.processingtime = df.format(date);"
}
}
]
}

It's fairly efficient too as we run millions of docs through this pipeline every day without any noticeable delays in our indexing.

Good luck!

twigbranch · July 18, 2017, 4:17pm

A couple questions:

Is the pipeline script approach faster and/or less resource intensive than the update api script approach?
What would be the fastest way to detect/find the changes in the old and new documents for a given id?

Bernt_Rostad · July 19, 2017, 7:02am

I haven't used the update API so I don't know how it compares to using pipelines, but for us the ingest pipeline didn't add any noticeable delays in our indexing phase, we even did a full re-feed from database ~250 mill docs which was just as fast (about 36 hrs) as before we added the "processingtime" pipeline.
As we're not doing updates, just indexing new documents (which may be updates to old, in that case they simply overwrite the old ones), I really don't know. My guess is that this should be handled in the application layer by bulk reading N docs by _id and checking them for changes against the new batch of docs and then filter out those that have changed for bulk indexing / updating.

system · August 16, 2017, 7:02am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Add creation (not update) time of doc using ingest Elasticsearch	12	2278	March 16, 2020
How to add created_at and updated_at fields Elasticsearch	19	3615	April 9, 2024
Best practice for integrating automated created_at & updated_at fields in Elasticsearch 8.4 Elasticsearch ingest-pipeline	6	1205	October 16, 2022
Can't delete/change data in @timestamp field yielded by ingest processor Elasticsearch painless	6	1230	February 15, 2022
Can I update a field (e.g. timestamp) whenever an update is not a noop? Elasticsearch	1	382	January 17, 2019

Creation and update timestamps for each doc?

Related topics