I'm trying to debug an issue where running this takes many minutes:
$ time curl -s 'http://localhost:9200/_cat/segments?v' | wc -l
972
real 26m46.425s
user 0m0.081s
sys 0m0.001s
In particular, I'm interested in understanding why Elasticsearch might become unresponsive for this particular endpoint.
Here are the details:
I have Elasticsearch 1.7 running via a Docker container on an EC2 i2.2xlarge. Its data directory is bind mounted to an instance store SSD. I'm running 4 concurrent bulk indexers pushing about 100 documents of about 100KB each (so roughly 10MB per bulk request). Each document has a few dozen string fields, where each field is of similar size and is analyzed. I'm using the default of 5 shards and 1 replica.
My configuration is slightly less vanilla. Here are all the non-default settings:
indices.store.throttle.max_bytes_per_sec: 5mb
index.merge.scheduler.max_thread_count: 1
index.refresh_interval:30s
I'm also running ES with a heap size of 8g
.
I recognize that some of these settings are a bit unorthodox, especially on an SSD. Indeed, if I tweak them to be completely default settings, then I haven't (yet) been able to reproduce the issue. What I'm after is a better conceptual model of what can cause Elasticsearch to become completely unresponsive to certain requests. The most obvious explanation is that Lucene is spending a lot of time merging segments and using up all available IO resources, but, merges are throttled to 5MB/sec. Indeed, iotop
reports sustained disk writes at 5MB/sec and the output of iostat
looks uninteresting. Profiling Elasticsearch over HTTP while it is in this state does confirm that merging is happening:
Elasticsearch is also using very few CPU resources.
This makes sense, of course, because of the obscenely low merge threshold. While Elasticsearch is in this state, other requests also take an uncharacteristically long time:
$ time curl 'http://localhost:9200/_cat/indices?v'
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open foo 5 1 33500 0 13.1gb 13.1gb
real 0m21.686s
user 0m0.000s
sys 0m0.007s
Yet other requests are still quite fast:
$ time curl 'http://localhost:9200/_count?pretty'
{
"count" : 33500,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
real 0m0.021s
user 0m0.002s
sys 0m0.004s
Search is still fast:
$ time curl -s 'http://localhost:9200/_search?q=g:"Fred"&pretty'
[output elided]
real 0m0.024s
user 0m0.000s
sys 0m0.007s
Interestingly, once the request to http://localhost:9200/_cat/segments
returns, Elasticsearch seems to become unstuck, and requests to _cat/indices
, for example, become fast again. Similarly for _cat/segments
. I note though that profiling (and iotop
) still indicate that merges are still running, so what exactly is causing Elasticsearch to become unresponsive isn't clear to me.
Any ideas what's happening? Thanks!