I have an index with 3 nodes running, each node having 1 primary and 1 replica shard:
index shard prirep state node
source-index 0 p STARTED eck-elasticsearch-es-default-2
source-index 0 r STARTED eck-elasticsearch-es-default-1
source-index 1 p STARTED eck-elasticsearch-es-default-0
source-index 1 r STARTED eck-elasticsearch-es-default-1
source-index 2 p STARTED eck-elasticsearch-es-default-0
source-index 2 r STARTED eck-elasticsearch-es-default-2
I have a need to use the reindex API to remove some fields from the existing documents. I create the new index with some recommended settings to increase indexing speed, as such:
put /destination-index
{
"mappings": {
"dynamic": "strict",
"properties": {...}
},
"settings": {
"number_of_shards": 3,
"number_of_replicas": 0,
"refresh_interval": -1
}
}
I run the reindex request as follows:
POST _reindex?wait_for_completion=false&slices=auto
{
"source": {
"index": "source-index",
"_source": "${varList}"
},
"dest": {
"index": "destination-index"
}
}
When I run this request, it creates 3 slices to run in parallel. This follows with the docs from Automatic Slicing that states: “This setting (slices=auto
) will use one slice per shard, up to a certain limit”. Given the 3 primary shards as shown above, this all makes complete sense.
What doesn't make sense is that I check the child tasks of the main reindex
action, and each child task with the action indices:data/write/reindex
is using the same node:
action parent_task_id type running_time node
indices:data/write/reindex l7bY43OiRGquG6KcBAa4Jw:327586044 transport 15.3m eck-elasticsearch-es-default-0
indices:data/write/reindex l7bY43OiRGquG6KcBAa4Jw:327586044 transport 15.3m eck-elasticsearch-es-default-0
indices:data/write/reindex l7bY43OiRGquG6KcBAa4Jw:327586044 transport 15.3m eck-elasticsearch-es-default-0
It feels like this is unexpected given the wording of the documentation for using slices with _reindex
. What I would expect is that the reindex
action would utilize all 3 nodes, each utilizing a thread per primary shard, to write to another primary shard in the destination index. Something like:
source primary shard 1 on node A reindexing to destination primary shard 1 on node A
source primary shard 2 on node B reindexing to destination primary shard 2 on node B
etc...
I bring this up because the reindexing process feels slow and unpredictable with each test run that I do. Just curious about utilizing one node for all 3 slices vs distributing the reindexing to every node and spreading the load. Having all 3 slices being processed by the same node at the same time is much slower than a reindex process with the same number of documents without slices.