Why does translog phase during index recovery take so long?

Hi guys,

Yesterday I had to restart my cluster. Afterwards, my one index, "atlas20160713" was stuck in INITIALIZING phase for hours. When I went to look at why it was taking so long, I saw that it was in the translog phase for over an hour. This index was about 36GB when it was being recovered. **The part that I DON"T UNDERSTAND is that while this index was in translog phase, NO INDEXING was happening on this index at all. Why is that? Why would replication which should be a background action prevent indexing from occuring? Can someone help me understand why?

"atlas20160713" : {
"shards" : [ {
"id" : 0,
"type" : "REPLICA",
"stage" : "DONE",
"primary" : false,
"start_time_in_millis" : 1468428894360,
"stop_time_in_millis" : 1468444103560,
"total_time_in_millis" : 15209360,
"source" : {
"id" : "rLwpMbfZQhWWKm3o1e83yA",
"host" : "coplhglc001",
"transport_address" : "inet[coplhglc001.atlas.cm.com/10.128.6.183:9300]",
"ip" : "10.128.6.183",
"name" : "coplhglc001.atlas.cm.com"
},
"target" : {
"id" : "9Hr4KGnySnm6X1u2gy0_Ig",
"host" : "coplhglc002",
"transport_address" : "inet[coplhglc002.atlas.cm.com/10.128.6.184:9300]",
"ip" : "10.128.6.184",
"name" : "coplhglc002.atlas.cm.com"
},
"index" : {
"size" : {
"total_in_bytes" : 36043471688,
"reused_in_bytes" : 0,
"recovered_in_bytes" : 36043471688,
"percent" : "100.0%"
},
"files" : {
"total" : 178,
"reused" : 0,
"recovered" : 178,
"percent" : "100.0%"
},
"total_time_in_millis" : 4023341,
"source_throttle_time_in_millis" : 1854424,
"target_throttle_time_in_millis" : 1826150
},
"translog" : {
"recovered" : 86314848,
"total" : 86314848,
"percent" : "100.0%",
"total_on_start" : 356461,
"total_time_in_millis" : 11176799
},
"start" : {
"check_index_time_in_millis" : 0,
"total_time_in_millis" : 758
}

1 Like

Anyone ever run into this issue?

I'm not really an expert in this. I know it'd help to know what version you are using.

I was hoping that a real expert would respond when I saw this the first time but none has. Sorry.

There have been bugs that slow down translog phase in the past ... are you using the latest ES (2.3.4)?

During the translog phase, recently indexed documents since the last flush are re-indexed ... if you have an unusually large translog that can make things slower. Have you changed any translog settings?

Currently running ES 1.7
I have not changed any translog settings.
I had one of these recoveries this morning that prevented any new documents from being indexed for about 13 minutes. I checked the translog size and it was about 500MB. I part that boggles my mind is why aren't new documents being indexed? Even documents that aren't being indexed into the recovery index.

Could you try issuing an explicit _flush on all your indexes, maybe one index at a time so you can observe the performance impact?

There are a couple of issues here that work together. This is how 1.x recovery works:

  1. We prevent the translog from flushing, which means it will keep all operations since the start of the recovery
  2. We copy over a point in time snapshot the lucene index files. We throttle this so it make take long.
  3. We replay everything in the translog to the target replica, making sure that everything that was indexed since we took the lucene index snapshot is replayed.
  4. We stop indexing on the primary and replay everything that was indexed while we were doing step 3.

I suspect step 4 is the source of your troubles, correct?

The ES 2.x we became smarter and removed step 4 all together. If this is your issue, it will be gone if you upgrade.

Yes step 4 is definitely my problem. Thank you @bleskes. I'll look into upgrading.