I am on my 3rd attempt at re-indexing an index. The re-index task reliably demonstrates the following failure mode:
- The re-index task starts with high throughput, processing ~12,000 documents per second.
- After ~12-24 hours the task suddenly slows down by an order of magnitude processing ~1,500 documents per second.
- The task may increase throughput after a cooldown of ~24 hours for a period of ~3-6 hours, but I have only seen this once so it may be a fluke.
This has been difficult to troubleshoot because the issue is not that the re-index is slow, it's that it is fast and then slows down without any intervention on my part.
Information
Indexes
Source Index
Some information about the source index:
- 2.4TB of storage
- 9.1 Billion Documents
- 6 Shards
- 1 Replica
This is not an ideal shard-size. One goal of this re-index is increasing the shard count.
Destination Index
Here are some settings on the destination index:
routing.allocation.include._tier_preference: "data_content"
refesh_interval: -1
number_of_shards: 60
translog.durability: "async"
number_of_replcias: 0
Nothing is writing to the destination index outside of the re-index task, so it has refresh_interval
disabled and replicas set to 0.
Cluster
I am running a cluster in ElasticCloud with 19 nodes total. Here is the cluster information:
- 9 "Hot" Data Tier;
aws.data.highio.i3
: 58GB RAM, 1.69TB Disk - 3 "Warm" Data Tier;
aws.data.highstorage.d3
: 8GB RAM, 1.48TB RAM - 3 Coordinating;
aws.coordinating.m5d
- 3 Master;
aws.master.r5d
- 1 Kibana;
aws.kibana.r5d
My understanding with ElasticCloud is that I cannot increase the size of my "hot" or "warm" instances any further, I can only add more instances.
Before I started re-indexing I had 6 "Hot" instances but added 3 more for additional storage space.
Re-Index Task
The goal of the re-index is to switch from the now deprecated dateOptionalTime
date format to date_optional_time
. This is a blocker for upgrading our cluster from Elasticsearch 7.17 to 8.x.
The re-index task is running on a coordinating
instance.
Here is the status of the latest, running, re-index task:
{
// ...
"action" : "indices:data/write/reindex",
"status" : {
"total" : 9123281565,
"updated" : 1800710708,
"created" : 1384292,
"deleted" : 0,
"batches" : 1802096,
"version_conflicts" : 0,
"noops" : 0,
"retries" : {
"bulk" : 0,
"search" : 0
},
"throttled_millis" : 1491643,
"requests_per_second" : -1.0,
"throttled_until_millis" : 0
},
"description" : "reindex from [source-index-name] to [destination-index-name][_doc]",
"start_time_in_millis" : 1709747775686,
"running_time_in_nanos" : 173147997512004,
"cancellable" : true,
"cancelled" : false,
"headers" : { }
}
}
}
I temporarily set
requests_per_second
to a non--1
value which is whythrottled_millis
is non-zero.
- The re-index task started at Wednesday at 9:30AM and was reliably processing at a rate of ~12,000 documents/second.
- At 6AM Thursday morning the throughput suddenly dropped to ~1,500 documents/second. No changes to the cluster or task occurred around this time.
Before the re-index I did the following tuning:
- Destination Index: Set
replicas
to0
- Destination Index: Set
refresh_interval
to-1
I have since done the following tuning:
- Task: Set
requests_per_second
to-1
- This was the default, but I set it to a non-zero value temporarily for troubleshooting.
- Destination Index: Set
translog.durability
toasync
- Done just a few hours ago. Increased throughput from ~1500 doc/sec to ~1600 doc/sec.
The task is currently processing ~1,600 documents/second which is still an order of magnitude slower than the initial re-index speed.
The re-index running at the current slow rate will take ~60 days to complete.
Were it to run at the initial speed it would take ~6 days.
My question for y'all is this: Why did the re-index slow down and what can I do to speed it up again?
I contacted Elastic Support about this but have not gotten a straight answer to the above question and I have implemented all suggestions they had.
- Disable replicas on the destination index.
- Set
refresh_interval
to30s
(or-1
). - Set
requests_per_second
to-1
. - Set
translog.durability
toasync
(suggestion from this blog post).
I'm happy to provide any more information y'all need from me to help troubleshoot. Thank you for your time.