Bulk indexing failure/timeout while shard in translog recovery state

Hey, we've been seeing this issue for some time, and have only gotten to the point where we can reliably recreate the issue in our development environment.

At a high level, the issue we see is that when ES shards are recovering (from a node failure, or other reason), when a shard that is taking some amount of writes (10k to 40k writes per second) during the "translog" stage of the recovery, all indexing will fail into that index. It doesn't seem to matter if the shard is a primary or a replica.

Additionally we've seen scenarios where during indexing, the time it takes for translog to succeed and complete a shard relocation or initalization can be upwards of 30-60 min for (30G) shard sizes.

But in our dev environment where i'm able to recreate the issue our setup is the following:

ES 1.7.4 - 10 shards per index (x2 for the replica shard) - running on 6 r3.2xlarge nodes with 500G of EBS (using the enhanced networking).

To recreate, assume a steady stream of 5-10k docs per second insert rate into the "main" index. Restart one of the ES nodes, and as one shard from THAT index (the one taking 5-10k docs per second writes) starts recovering and hits "translog" bulk indexing will timeout for a period of time.

The recovery cat endpoint shows

globallogs.2.20160708                          5     1402385 replica    translog node-a   node-b  n/a        n/a      134   100.0%        9034842507  100.0%        134         9034842507  97635    12.4%            784284```

in this case the shard was about 10G in size.


Has anyone seen an issue like this before - and are there any good docs or tunables you could point me to that could possibly help speed up the time a shard is in the "translog" state during relocation or recovery.