Uncertain if Refresh is needed

dude88 · July 2, 2021, 10:34pm

Hi I need to run approximately 30,000 updateByQuery's against my index which has about 35 million decently large documents, sequentially. I'm currently setting refresh:true, and the eta for my script to complete all 30,000 queries is about 2 weeks from now.... So i want to speed things up.

The update painless script updates a field on _source that does not need to be queried against, but each update step will need the latest copy of _source when if it pulls in the same document as a previous query.

Basically my situation is like so:

query index
for each document returned, if _source.nonQueriedField doesn't already have tag x, add tag x to _source.nonQueriedField
repeat for each query / tag combination

Since my queries don't need the latest copy of _source at query time, only at update time, do I still need to set refresh: true? Or put another way, does an updated document ever have multiple copies of _source floating around in the index, between refreshes?

nik9000 · July 2, 2021, 10:52pm

Update by query runs a search to find what to update. It gets the source from the result of the search. So it needs a refresh to see the last one.

dude88 · July 2, 2021, 10:57pm

Dang... Thanks for the quick reply @nik9000 . Can you recommend any alternative approaches to speed things up?

A more detailed exploration of my situation is like so:

I have ~30 million medium to large documents
I have about 30,000 tags
Each tag is basically a unique identifier for n number of various approximately synonymous keywords
For each tag, find all documents that have any of those synonyms in one of 10 keyword fields
For each document, add that unique tag, and a series of other related tags (all passed into the script as params) in a new field

This is useful because later I'll need to run aggregation queries on these tags across the entire index

Christian_Dahlqvist · July 3, 2021, 4:42am

It might be faster to read all documents using a scroll query and determine which tags should be added in code to each one before you update directly by ID through bulk requests. As this approach will update each document exactly once you can likely run many processes in parallel on different subsets of data and no refresh is required.

dude88 · July 3, 2021, 5:32am

Oh this is a really interesting idea, thank you! Could you elaborate on how scroll would enable me to parallelize this work? Is it simply that I I keep the scroll active for x amount of time, and in that time, spin up threads locally to handle subsections of that scroll partition and post bulk updates independently of each other? If so, am I able to queue up bulk updates, or do I need to wait for one to finish before submitting the next?

Also, one down side of this approach would be that I wouldn't be able to leverage elastic's query functionality. For the first use-case I need to address, I can get by without it, but in future use-cases, it would be immensely useful to be able to leverage Elastic's full text search capability to determine which documents I need to tag in which manner.

system · July 31, 2021, 5:32am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Update by query and refresh Elasticsearch	3	2458	July 6, 2017
elasticsearch updateByQuery is not working for large data Elasticsearch painless	3	31	July 22, 2024
Update initially fast, but then crawls Elasticsearch	10	419	March 18, 2019
UpdateByQuery script version conflict Elasticsearch	2	1093	May 6, 2019
Update By Query API Elasticsearch	2	485	May 15, 2018

Uncertain if Refresh is needed

Related Topics