Scripted update to select minimum value in list

I have series of documents that currently have a field that is a list of integers (date timestamps, but this doesn't really matter). Some have a single value, some have many (possibly hundreds). For example:

"date" : [1538508710]
or 
"date" : [1538405989, 1550215297, 1550236107]

I would like to create a script that would update all these documents and change it from being a list to being a single integer, such that the integer is the smallest value in the list. So the fields above would instead look like:

"date" : 1538508710
or
"date" : 1538405989

Is it possible to do this with a scripted update? Can I do something like:

"inline": "ctx._source.date = min([d for d in ctx._source.date]),",

Yes, this is possible using update-by-query.

It looks like your suggested script is trying to use python generators. These are not possible in painless, but instead java streams are supported. It would look something like this:

ctx._source.date = ctx._source.date.stream().min(Comparator.comparing(Integer::valueOf)).get();

Note however that it is best practice to make this operation idempotent, so that if some issues causes your query to fail you will be able to restart without handling any documents already processed.

A separate (temporary) boolean field would probably be easiest for this. You can then make a second update-by-query that removes the field (where your query searches for the field existing on a document). You could also instead make your script resilient to being re-run for the same document by checking the type of ctx._source.date (using instanceof List), but whether that is easier than the first option depends on your total data size and how costly it is to rerun your update-by-query on the entire dataset.

Do you mean tracking which documents have already had the script run against? Is there a way to do this easily? Or are you suggesting something like if you add a new field, use the match part of an update query to find the documents who are missing the new field?

Is there a good way to estimate this cost? We've got a 40-50 million rows to update.

Or are you suggesting something like if you add a new field, use the match part of an update query to find the documents who are missing the new field

Yes this is what I was suggesting.

Is there a good way to estimate this cost? We've got a 40-50 million rows to update.

The best way is just to time a sample. Then you can extrapolate how long the entire update would take.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.