Update initially fast, but then crawls

faZhift · February 17, 2019, 9:10pm

I am noticing that performing many "update_by_query" calls is very fast initially, handling several thousand a second. As time progresses though, this crawls to only a few per second. Looking at memory/CPU usage, it doesn't appear to be running up against any hardware bottlenecks, so this could be a settings issue.

Are there settings I can check that could lead to write requests being fast at first but then slowing to a crawl? Is there some "keep alive" that is holding a queue full, despite requests being completed? Anything like that?

warkolm · February 17, 2019, 10:46pm

How large are the docs and how many? What sort of things are you updating? Are you using scripts? What version of Elasticsearch? What is the node size?

faZhift · February 18, 2019, 12:15am

All good questions, that I totally should have included initially.

Total source index size is 1,201,247 Documents, with ~30 fields, mostly keyword.
Target index size is the same Doc count, but with only 4 fields, all keyword.

One keyword Field, a hash, is used for the aggregate, to gather all unique instances of another keyword field, a Custodian, for all Documents that share the same hash.
Now that I have this "Global Owner" value (an array of keywords), I am performing an "update by query" to update a "GlobalOwner" keyword field for all Docs that match that hash value.
The "update by query" uses an inline script with parameters.

ES version is 6.4.3
All settings are default
Single-node, single-shard laptop
6th-gen i5 @2.3GHz, 16GB RAM, NVMe SSD

Christian_Dahlqvist · February 18, 2019, 8:14am

Repeatedly updating the same documents is quite inefficient as it can result in a lot of small refreshes (every time a document to be updated is found in the transaction log, a refresh will be triggered) that are expensive and slows things down considerably. If you have a reasonable small number of hashes I would recommend instead creating a process that reads all documents through a scroll and builds the resulting documents in memory before writing them to the index.

faZhift · February 18, 2019, 8:59am

In my example above, the ~1.2m Documents have ~338k unique hashes. Each of those hashes will end up with an array of values, which are then written back to the index. Each Document would only get updated once though.

I see your point about building the Documents in memory and then writing them. My issue is that while this example is fairly small, the ultimate goal is to scale to an index that would be hundreds of millions of Documents. Even when I can narrow the pool for aggregation, there will be an initial "create global aggregate" phase that will have to run over the entire index.

To that end, I'm looking for any and all contributions to efficiency. ElasticSearch is able to calculate those ~338k global aggregate values in seconds, so the fact that it takes so long to write back just strikes me as odd. But, I'm new at this, so I probably just don't know what I'm talking about.

Christian_Dahlqvist · February 18, 2019, 9:21am

If you are doing an update-by-query, each document is processed individually. Elasticsearch does not aggregate and write each document once.

faZhift · February 18, 2019, 9:30am

Got it. So then if I could preserve the ID information, and reconstruct my own "bulk" updates, that will likely be my fastest option, provided I have an ocean of memory, yeah?

Christian_Dahlqvist · February 18, 2019, 10:02am

I am not sure I understand what you mean. Could you perhaps provide some additional details?

faZhift · February 18, 2019, 10:26am

If I understand you correctly, you are saying that "update by query" does not update all Documents at once that match the query, but one at a time, followed by an expensive refresh.

I initially looked past the "bulk" API, as it required individual Document IDs. But if I understand that correctly, it will queue a bunch of updates into the index that do happen at once, with a single refresh for the bulk update. If that is the case, then it becomes worth my trouble to keep up with ID information, and use it to construct bulk updates in memory (as you mentioned) and push them in batches to the index.

Unless I'm misunderstanding.

Christian_Dahlqvist · February 18, 2019, 12:17pm

Update by query used the bulk api. If you send N operations in bulk and these all update the same document, nothing will be squashed and the document will see N individual updates.

system · March 18, 2019, 12:18pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How update, update_by_query in ES really work? Elasticsearch	8	2719	October 4, 2022
Bulk update is too slow elasticsearch 6.2 Elasticsearch	25	6852	June 4, 2018
Update by query and refresh Elasticsearch	3	2498	July 6, 2017
Update by query performance question Elasticsearch	1	324	August 18, 2020
Write Performance at Scale Elasticsearch	5	398	June 21, 2018

Update initially fast, but then crawls

Related topics