I am noticing that performing many "update_by_query" calls is very fast initially, handling several thousand a second. As time progresses though, this crawls to only a few per second. Looking at memory/CPU usage, it doesn't appear to be running up against any hardware bottlenecks, so this could be a settings issue.
Are there settings I can check that could lead to write requests being fast at first but then slowing to a crawl? Is there some "keep alive" that is holding a queue full, despite requests being completed? Anything like that?
How large are the docs and how many? What sort of things are you updating? Are you using scripts? What version of Elasticsearch? What is the node size?
All good questions, that I totally should have included initially.
Total source index size is 1,201,247 Documents, with ~30 fields, mostly keyword.
Target index size is the same Doc count, but with only 4 fields, all keyword.
One keyword Field, a hash, is used for the aggregate, to gather all unique instances of another keyword field, a Custodian, for all Documents that share the same hash.
Now that I have this "Global Owner" value (an array of keywords), I am performing an "update by query" to update a "GlobalOwner" keyword field for all Docs that match that hash value.
The "update by query" uses an inline script with parameters.
ES version is 6.4.3
All settings are default
Single-node, single-shard laptop
6th-gen i5 @2.3GHz, 16GB RAM, NVMe SSD
Repeatedly updating the same documents is quite inefficient as it can result in a lot of small refreshes (every time a document to be updated is found in the transaction log, a refresh will be triggered) that are expensive and slows things down considerably. If you have a reasonable small number of hashes I would recommend instead creating a process that reads all documents through a scroll and builds the resulting documents in memory before writing them to the index.
In my example above, the ~1.2m Documents have ~338k unique hashes. Each of those hashes will end up with an array of values, which are then written back to the index. Each Document would only get updated once though.
I see your point about building the Documents in memory and then writing them. My issue is that while this example is fairly small, the ultimate goal is to scale to an index that would be hundreds of millions of Documents. Even when I can narrow the pool for aggregation, there will be an initial "create global aggregate" phase that will have to run over the entire index.
To that end, I'm looking for any and all contributions to efficiency. ElasticSearch is able to calculate those ~338k global aggregate values in seconds, so the fact that it takes so long to write back just strikes me as odd. But, I'm new at this, so I probably just don't know what I'm talking about.
Got it. So then if I could preserve the ID information, and reconstruct my own "bulk" updates, that will likely be my fastest option, provided I have an ocean of memory, yeah?
If I understand you correctly, you are saying that "update by query" does not update all Documents at once that match the query, but one at a time, followed by an expensive refresh.
I initially looked past the "bulk" API, as it required individual Document IDs. But if I understand that correctly, it will queue a bunch of updates into the index that do happen at once, with a single refresh for the bulk update. If that is the case, then it becomes worth my trouble to keep up with ID information, and use it to construct bulk updates in memory (as you mentioned) and push them in batches to the index.
Update by query used the bulk api. If you send N operations in bulk and these all update the same document, nothing will be squashed and the document will see N individual updates.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.