Create a 'Write Once' Date field that is populated on insert and never updated


(Markmcgookin) #1

As the title says, I want to create a field in my index for 'Created'. It is to be a datetime field recording when the row is inserted.

However, as I am using the builk insert API and the nature of the data, I will essentially be doing an 'Upsert' on all the rows on a second pass of the data set.

With a few new rows appearing each time. If I rely on the (deprecated) _timestamp field, every time I update the document, this is updated.

Is there any way to add a field that I either don't map to in my model, but use a default value in ES to populate this with something like 'now()' or if I have to manually add a field, can I 'lock' it so that a subsequent pass over the data with a different value would NOT update this field?

Regards,

Mark


(Nik Everett) #2

Normally the recommendation is to do that in the application talking to Elasticsearch.

It is worth noting that Elasticsearch differs in a lot of ways from a relational database so saying "row" is kind of a red flag. Not that "row" and "document" are all that different, its just that if you expect Elasticsearch to do relational database things you'll be disappointed. It makes other architectural choices.

As far as how to do it with bulk you'd have to use upsert and script and doc and not use doc_as_upsert. That is less than ideal, especially in 2.x because of the script sandboxing problem. 5.0 will be nicer where you can safely specify a painless language script as part of the upsert. But still, you have to write a script which is a bit of a if all you want to do is merge documents.

As far as "locking" a field, no, that isn't a thing Elasticsearch does. Constraints and validation is one of those things relational databases do that Elasticsearch would prefer you handle in your application. Partially it is because we haven't implemented it and partially it is because we figure you are better off enforcing constraints on the application side because it is typically cheaper to scale application servers than it is to scale Elasticsearch servers.

For what it is worth we are talking about adding a _last_modified field over here. It might make sense to add a _created field as well, I don't know. I know what you are trying to do is fairly cumbersome with the tools that we have now and not super rare. But have a look at the debate on that issue. We go through a lot before we add a feature like this because we want to make it easy to use, and that requires that we have some idea of how it is going to be used which is difficult.


(Markmcgookin) #3

Nik9000 - Cheers man.

Yeah I understand that it's general practice to do it in the app. I'm using .Net Core and the ElasticsearchCRUD library... I want to do bulk upserts to keep it quick, without having to check every row.

The nature of the data I am processing is that I can see small amounts of metadata for everything, then I want to drill deeper into the new items.

I may end up adding all the small metadata to a temp index, cross reference that with a master, spot the new items (doing well not to say rows) then do manual inserts of those into the master, then destroy the temp index.

That will result in fast bulk insert, slow(ish) cross reference, fast indepth data gathering exercise.

Maybe there's a quick way to do a diff on two indexes that I haven't heard of yet that could speed up the middle section.. I keep thinking in RDB terms so I want to do a join!


(Nik Everett) #4

There isn't really a quick way to do diffs between two indexes. You can do two scroll requests with ordering and walk the two but, yeah, that isn't a thing that we do.

You can figure out from the response of each entry in a bulk operation - it is more powerful in 5.0 but I believe you can still get "created": true vs "created": false. That is something, better than the temp index thing if you can get away with it. Update scripts are fairly powerful at modifying the document in useful ways but they don't let you look at other documents. If you can use them instead of a temp index it'd be much faster.


(system) #5