Hi I need to run approximately 30,000 updateByQuery's against my index which has about 35 million decently large documents, sequentially. I'm currently setting refresh:true, and the eta for my script to complete all 30,000 queries is about 2 weeks from now.... So i want to speed things up.
The update painless script updates a field on _source that does not need to be queried against, but each update step will need the latest copy of _source when if it pulls in the same document as a previous query.
Basically my situation is like so:
query index
for each document returned, if _source.nonQueriedField doesn't already have tag x, add tag x to _source.nonQueriedField
repeat for each query / tag combination
Since my queries don't need the latest copy of _source at query time, only at update time, do I still need to set refresh: true? Or put another way, does an updated document ever have multiple copies of _source floating around in the index, between refreshes?
It might be faster to read all documents using a scroll query and determine which tags should be added in code to each one before you update directly by ID through bulk requests. As this approach will update each document exactly once you can likely run many processes in parallel on different subsets of data and no refresh is required.
Oh this is a really interesting idea, thank you! Could you elaborate on how scroll would enable me to parallelize this work? Is it simply that I I keep the scroll active for x amount of time, and in that time, spin up threads locally to handle subsections of that scroll partition and post bulk updates independently of each other? If so, am I able to queue up bulk updates, or do I need to wait for one to finish before submitting the next?
Also, one down side of this approach would be that I wouldn't be able to leverage elastic's query functionality. For the first use-case I need to address, I can get by without it, but in future use-cases, it would be immensely useful to be able to leverage Elastic's full text search capability to determine which documents I need to tag in which manner.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.