Does update_by_query always reindex entire document

ahummel25 · January 30, 2019, 1:35pm

I'm trying to figure out if the _update_by_query elasticsearch endpoint reindexes entire documents. I ran a batch process that generated and ran thousands of update_by_query statements. My CPU escalated after some time so I stopped running them. That was about a week ago, and my CPU is still abnormally high.

When I check the nodes in my cluster, one of them has unusually high processing percentage. I checked the hot threads against that node and it appears to still be processing update tasks. I stopped running the updates over a week ago. How could this node still be processing updates? My thinking was that it's reindexing documents that were affected by the updates.

Please share any thoughts.

nik9000 · January 30, 2019, 1:59pm

The tasks API should tell you if you are still running the _update_by_query. IIRC one of the funny things about _update_by_query is that it'll perform a noop update of all documents if you give it an empty body. This might be what is going on here.

ahummel25 · January 30, 2019, 2:06pm

So the tasks that are running against the heavy processing node look like this..

I don't think these are _update_by_query tasks, but the hot threads appeared otherwise.

"C_i25yS5SWSdrv3NPEoHEA": {
        "name": "C_i25yS",
        "roles": [
            "data",
            "ingest"
        ],
        "tasks": {
            "C_i25yS5SWSdrv3NPEoHEA:186466469": {
                "node": "C_i25yS5SWSdrv3NPEoHEA",
                "id": 186466469,
                "type": "transport",
                "action": "cluster:monitor/tasks/lists",
                "start_time_in_millis": 1548856970331,
                "running_time_in_nanos": 2978262,
                "cancellable": false,
                "headers": {}
            },
            "C_i25yS5SWSdrv3NPEoHEA:186466471": {
                "node": "C_i25yS5SWSdrv3NPEoHEA",
                "id": 186466471,
                "type": "direct",
                "action": "cluster:monitor/tasks/lists[n]",
                "start_time_in_millis": 1548856970333,
                "running_time_in_nanos": 101917,
                "cancellable": false,
                "parent_task_id": "C_i25yS5SWSdrv3NPEoHEA:186466469",
                "headers": {}
            },
            "C_i25yS5SWSdrv3NPEoHEA:186466470": {
                "node": "C_i25yS5SWSdrv3NPEoHEA",
                "id": 186466470,
                "type": "netty",
                "action": "internal:discovery/zen/publish/commit",
                "start_time_in_millis": 1548856970332,
                "running_time_in_nanos": 1706067,
                "cancellable": false,
                "headers": {}
            }
        }
    }

ahummel25 · January 30, 2019, 2:10pm

The action on the task would be something like indices:data/write/update/byquery if it was an update by query still running.

ahummel25 · January 31, 2019, 2:51pm

@nik9000 Any other thoughts here as to why CPU would still be so high?

nik9000 · January 31, 2019, 7:18pm

Right, update by query isn't running. I'd check the hot_threads API. If the CPU is super high on any one task it'll jump right out.

ahummel25 · January 31, 2019, 8:41pm

@nik9000 The hot_threads aren't particularly helpful to me, not really sure what to take away from it. Here's what it's reading for the heavy CPU node. Nothing really stands out to me. The response also constantly changes.

"::: {C_i25yS}{C_i25yS5SWSdrv3NPEoHEA}{IlzCv7ROQSaMm6IGV_Z88A}{x.x.x.x}{x.x.x.x:9300}{zone=us-west-2b}\n Hot threads at 2019-01-31T20:38:49.441Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:\n \n 68.3% (341.5ms out of 500ms) cpu usage by thread 'elasticsearch[C_i25yS][clusterApplierService#updateTask][T#1]'\n 2/10 snapshots sharing following 13 elements\n org.elasticsearch.indices.store.IndicesStore$ShardActiveResponseHandler.lambda$allNodesResponded$2(IndicesStore.java:289)\n org.elasticsearch.indices.store.IndicesStore$ShardActiveResponseHandler$$Lambda$1704/1699952741.accept(Unknown Source)\n org.elasticsearch.cluster.service.ClusterApplierService.lambda$runOnApplierThread$0(ClusterApplierService.java:307)\n org.elasticsearch.cluster.service.ClusterApplierService$$Lambda$1706/1089848983.apply(Unknown Source)\n org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.apply(ClusterApplierService.java:156)\n org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:400)\n org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:161)\n org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573)\n org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244)\n org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207)\n java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n java.lang.Thread.run(Thread.java:748)\n 8/10 snapshots sharing following 2 elements\n java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n java.lang.Thread.run(Thread.java:748)\n\n"

ahummel25 · January 31, 2019, 8:46pm

For some reason the human and pretty URL params do not format the response at all.

ahummel25 · February 7, 2019, 2:46pm

@nik9000 Do you see anything in the hot threads above that may be useful? Or anything else that may help resolve my issue?

system · March 7, 2019, 2:47pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How update, update_by_query in ES really work? Elasticsearch	8	3211	October 4, 2022
Update by query and refresh Elasticsearch	3	2584	July 6, 2017
Is Update By Query API a good choice to update single document in elasticsearch? Elasticsearch	3	795	September 29, 2021
GET /.tasks/task/{taskId} is not availbale when update by query task is running Elasticsearch	1	463	October 10, 2018
Update By Query - performance Elasticsearch	1	829	May 17, 2017

Does update_by_query always reindex entire document

Related topics