Elasticsearch becomes unresponsive during Lucene merges after bulk indexing

I'm trying to debug an issue where running this takes many minutes:

$ time curl -s 'http://localhost:9200/_cat/segments?v' | wc -l

real    26m46.425s
user    0m0.081s
sys     0m0.001s

In particular, I'm interested in understanding why Elasticsearch might become unresponsive for this particular endpoint.

Here are the details:

I have Elasticsearch 1.7 running via a Docker container on an EC2 i2.2xlarge. Its data directory is bind mounted to an instance store SSD. I'm running 4 concurrent bulk indexers pushing about 100 documents of about 100KB each (so roughly 10MB per bulk request). Each document has a few dozen string fields, where each field is of similar size and is analyzed. I'm using the default of 5 shards and 1 replica.

My configuration is slightly less vanilla. Here are all the non-default settings:

indices.store.throttle.max_bytes_per_sec: 5mb
index.merge.scheduler.max_thread_count: 1

I'm also running ES with a heap size of 8g.

I recognize that some of these settings are a bit unorthodox, especially on an SSD. Indeed, if I tweak them to be completely default settings, then I haven't (yet) been able to reproduce the issue. What I'm after is a better conceptual model of what can cause Elasticsearch to become completely unresponsive to certain requests. The most obvious explanation is that Lucene is spending a lot of time merging segments and using up all available IO resources, but, merges are throttled to 5MB/sec. Indeed, iotop reports sustained disk writes at 5MB/sec and the output of iostat looks uninteresting. Profiling Elasticsearch over HTTP while it is in this state does confirm that merging is happening:

Elasticsearch is also using very few CPU resources.

This makes sense, of course, because of the obscenely low merge threshold. While Elasticsearch is in this state, other requests also take an uncharacteristically long time:

$ time curl 'http://localhost:9200/_cat/indices?v'
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open   foo     5   1      33500            0     13.1gb         13.1gb

real    0m21.686s
user    0m0.000s
sys     0m0.007s

Yet other requests are still quite fast:

$ time curl 'http://localhost:9200/_count?pretty'
  "count" : 33500,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0

real    0m0.021s
user    0m0.002s
sys     0m0.004s

Search is still fast:

$ time curl -s 'http://localhost:9200/_search?q=g:"Fred"&pretty'
[output elided]
real    0m0.024s
user    0m0.000s
sys     0m0.007s

Interestingly, once the request to http://localhost:9200/_cat/segments returns, Elasticsearch seems to become unstuck, and requests to _cat/indices, for example, become fast again. Similarly for _cat/segments. I note though that profiling (and iotop) still indicate that merges are still running, so what exactly is causing Elasticsearch to become unresponsive isn't clear to me.

Any ideas what's happening? Thanks!