Scripted update to select minimum value in list

neesha · July 24, 2019, 6:42pm

I have series of documents that currently have a field that is a list of integers (date timestamps, but this doesn't really matter). Some have a single value, some have many (possibly hundreds). For example:

"date" : [1538508710]
or 
"date" : [1538405989, 1550215297, 1550236107]

I would like to create a script that would update all these documents and change it from being a list to being a single integer, such that the integer is the smallest value in the list. So the fields above would instead look like:

"date" : 1538508710
or
"date" : 1538405989

Is it possible to do this with a scripted update? Can I do something like:

"inline": "ctx._source.date = min([d for d in ctx._source.date]),",

rjernst · July 26, 2019, 5:53pm

Yes, this is possible using update-by-query.

It looks like your suggested script is trying to use python generators. These are not possible in painless, but instead java streams are supported. It would look something like this:

ctx._source.date = ctx._source.date.stream().min(Comparator.comparing(Integer::valueOf)).get();

Note however that it is best practice to make this operation idempotent, so that if some issues causes your query to fail you will be able to restart without handling any documents already processed.

A separate (temporary) boolean field would probably be easiest for this. You can then make a second update-by-query that removes the field (where your query searches for the field existing on a document). You could also instead make your script resilient to being re-run for the same document by checking the type of ctx._source.date (using instanceof List), but whether that is easier than the first option depends on your total data size and how costly it is to rerun your update-by-query on the entire dataset.

neesha · July 29, 2019, 3:20pm

Do you mean tracking which documents have already had the script run against? Is there a way to do this easily? Or are you suggesting something like if you add a new field, use the match part of an update query to find the documents who are missing the new field?

Is there a good way to estimate this cost? We've got a 40-50 million rows to update.

rjernst · August 9, 2019, 10:21pm

Or are you suggesting something like if you add a new field, use the match part of an update query to find the documents who are missing the new field

Yes this is what I was suggesting.

Is there a good way to estimate this cost? We've got a 40-50 million rows to update.

The best way is just to time a sample. Then you can extrapolate how long the entire update would take.

system · September 6, 2019, 10:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Using _update_by_query and painless script with hashmap to update based on value Elasticsearch painless	2	646	May 21, 2019
Incrementing Datetime field by one day in Elasticsearch Production Cluster using painless script Elasticsearch painless	2	233	May 9, 2022
Painless Script to Update Array Elasticsearch painless	1	1381	June 16, 2020
Calculations with values of different documents in painless query Elasticsearch painless	11	2338	October 27, 2020
Update all fields containing specific string in field name by painless script Elasticsearch painless	1	424	July 13, 2021

Scripted update to select minimum value in list

Related topics