Uncertain if Refresh is needed

Hi I need to run approximately 30,000 updateByQuery's against my index which has about 35 million decently large documents, sequentially. I'm currently setting refresh:true, and the eta for my script to complete all 30,000 queries is about 2 weeks from now.... So i want to speed things up.

The update painless script updates a field on _source that does not need to be queried against, but each update step will need the latest copy of _source when if it pulls in the same document as a previous query.

Basically my situation is like so:

  1. query index
  2. for each document returned, if _source.nonQueriedField doesn't already have tag x, add tag x to _source.nonQueriedField
  3. repeat for each query / tag combination

Since my queries don't need the latest copy of _source at query time, only at update time, do I still need to set refresh: true? Or put another way, does an updated document ever have multiple copies of _source floating around in the index, between refreshes?

Update by query runs a search to find what to update. It gets the source from the result of the search. So it needs a refresh to see the last one.

Dang... Thanks for the quick reply @nik9000 . Can you recommend any alternative approaches to speed things up?

A more detailed exploration of my situation is like so:

  1. I have ~30 million medium to large documents
  2. I have about 30,000 tags
  3. Each tag is basically a unique identifier for n number of various approximately synonymous keywords
  4. For each tag, find all documents that have any of those synonyms in one of 10 keyword fields
  5. For each document, add that unique tag, and a series of other related tags (all passed into the script as params) in a new field

This is useful because later I'll need to run aggregation queries on these tags across the entire index

It might be faster to read all documents using a scroll query and determine which tags should be added in code to each one before you update directly by ID through bulk requests. As this approach will update each document exactly once you can likely run many processes in parallel on different subsets of data and no refresh is required.

Oh this is a really interesting idea, thank you! Could you elaborate on how scroll would enable me to parallelize this work? Is it simply that I I keep the scroll active for x amount of time, and in that time, spin up threads locally to handle subsections of that scroll partition and post bulk updates independently of each other? If so, am I able to queue up bulk updates, or do I need to wait for one to finish before submitting the next?

Also, one down side of this approach would be that I wouldn't be able to leverage elastic's query functionality. For the first use-case I need to address, I can get by without it, but in future use-cases, it would be immensely useful to be able to leverage Elastic's full text search capability to determine which documents I need to tag in which manner.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.