Why does translog phase during index recovery take so long?

Melanie_Zamora · July 14, 2016, 7:51pm

Hi guys,

Yesterday I had to restart my cluster. Afterwards, my one index, "atlas20160713" was stuck in INITIALIZING phase for hours. When I went to look at why it was taking so long, I saw that it was in the translog phase for over an hour. This index was about 36GB when it was being recovered. **The part that I DON"T UNDERSTAND is that while this index was in translog phase, NO INDEXING was happening on this index at all. Why is that? Why would replication which should be a background action prevent indexing from occuring? Can someone help me understand why?

"atlas20160713" : {
"shards" : [ {
"id" : 0,
"type" : "REPLICA",
"stage" : "DONE",
"primary" : false,
"start_time_in_millis" : 1468428894360,
"stop_time_in_millis" : 1468444103560,
"total_time_in_millis" : 15209360,
"source" : {
"id" : "rLwpMbfZQhWWKm3o1e83yA",
"host" : "coplhglc001",
"transport_address" : "inet[coplhglc001.atlas.cm.com/10.128.6.183:9300]",
"ip" : "10.128.6.183",
"name" : "coplhglc001.atlas.cm.com"
},
"target" : {
"id" : "9Hr4KGnySnm6X1u2gy0_Ig",
"host" : "coplhglc002",
"transport_address" : "inet[coplhglc002.atlas.cm.com/10.128.6.184:9300]",
"ip" : "10.128.6.184",
"name" : "coplhglc002.atlas.cm.com"
},
"index" : {
"size" : {
"total_in_bytes" : 36043471688,
"reused_in_bytes" : 0,
"recovered_in_bytes" : 36043471688,
"percent" : "100.0%"
},
"files" : {
"total" : 178,
"reused" : 0,
"recovered" : 178,
"percent" : "100.0%"
},
"total_time_in_millis" : 4023341,
"source_throttle_time_in_millis" : 1854424,
"target_throttle_time_in_millis" : 1826150
},
"translog" : {
"recovered" : 86314848,
"total" : 86314848,
"percent" : "100.0%",
"total_on_start" : 356461,
"total_time_in_millis" : 11176799
},
"start" : {
"check_index_time_in_millis" : 0,
"total_time_in_millis" : 758
}

Melanie_Zamora · July 25, 2016, 5:19pm

Anyone ever run into this issue?

nik9000 · July 25, 2016, 5:43pm

I'm not really an expert in this. I know it'd help to know what version you are using.

I was hoping that a real expert would respond when I saw this the first time but none has. Sorry.

mikemccand · July 25, 2016, 7:08pm

There have been bugs that slow down translog phase in the past ... are you using the latest ES (2.3.4)?

During the translog phase, recently indexed documents since the last flush are re-indexed ... if you have an unusually large translog that can make things slower. Have you changed any translog settings?

Melanie_Zamora · July 25, 2016, 7:25pm

Currently running ES 1.7
I have not changed any translog settings.
I had one of these recoveries this morning that prevented any new documents from being indexed for about 13 minutes. I checked the translog size and it was about 500MB. I part that boggles my mind is why aren't new documents being indexed? Even documents that aren't being indexed into the recovery index.

nik9000 · July 25, 2016, 7:49pm

Could you try issuing an explicit _flush on all your indexes, maybe one index at a time so you can observe the performance impact?

bleskes · July 25, 2016, 8:17pm

There are a couple of issues here that work together. This is how 1.x recovery works:

We prevent the translog from flushing, which means it will keep all operations since the start of the recovery
We copy over a point in time snapshot the lucene index files. We throttle this so it make take long.
We replay everything in the translog to the target replica, making sure that everything that was indexed since we took the lucene index snapshot is replayed.
We stop indexing on the primary and replay everything that was indexed while we were doing step 3.

I suspect step 4 is the source of your troubles, correct?

The ES 2.x we became smarter and removed step 4 all together. If this is your issue, it will be gone if you upgrade.

Melanie_Zamora · July 25, 2016, 8:18pm

Yes step 4 is definitely my problem. Thank you @bleskes. I'll look into upgrading.

Topic		Replies	Views
Why translog phase in recovery took a long time Elasticsearch	1	337	June 23, 2020
Bulk indexing failure/timeout while shard in translog recovery state Elasticsearch	1	773	July 5, 2017
Recover index takes lots of time in translog Elasticsearch	2	516	March 23, 2018
EL7.11 Index recoveries stuck at translog stage Elasticsearch	1	582	April 7, 2021
ES 7.5 translog recovery is extremely slow Elasticsearch	15	3341	February 19, 2020

Why does translog phase during index recovery take so long?

Related topics