Elasticsearch becomes unresponsive during Lucene merges after bulk indexing

BurntSushi · April 12, 2016, 6:39pm

I'm trying to debug an issue where running this takes many minutes:

$ time curl -s 'http://localhost:9200/_cat/segments?v' | wc -l
972

real    26m46.425s
user    0m0.081s
sys     0m0.001s

In particular, I'm interested in understanding why Elasticsearch might become unresponsive for this particular endpoint.

Here are the details:

I have Elasticsearch 1.7 running via a Docker container on an EC2 i2.2xlarge. Its data directory is bind mounted to an instance store SSD. I'm running 4 concurrent bulk indexers pushing about 100 documents of about 100KB each (so roughly 10MB per bulk request). Each document has a few dozen string fields, where each field is of similar size and is analyzed. I'm using the default of 5 shards and 1 replica.

My configuration is slightly less vanilla. Here are all the non-default settings:

indices.store.throttle.max_bytes_per_sec: 5mb
index.merge.scheduler.max_thread_count: 1
index.refresh_interval:30s

I'm also running ES with a heap size of 8g.

I recognize that some of these settings are a bit unorthodox, especially on an SSD. Indeed, if I tweak them to be completely default settings, then I haven't (yet) been able to reproduce the issue. What I'm after is a better conceptual model of what can cause Elasticsearch to become completely unresponsive to certain requests. The most obvious explanation is that Lucene is spending a lot of time merging segments and using up all available IO resources, but, merges are throttled to 5MB/sec. Indeed, iotop reports sustained disk writes at 5MB/sec and the output of iostat looks uninteresting. Profiling Elasticsearch over HTTP while it is in this state does confirm that merging is happening:

gist.github.com

https://gist.github.com/anonymous/06a64fd7dab737691d13450dde6f05d5

profile

$ curl 'http://localhost:9200/_nodes/hot_threads?interval=10000&threads=20'
   Hot threads at 2016-04-12T18:07:35.999Z, interval=10s, busiestThreads=20, ignoreIdleThreads=true:

    3.7% (371.7ms out of 10s) cpu usage by thread 'elasticsearch[Carmilla Black][[foo][3]: Lucene Merge Thread #0]'
     10/10 snapshots sharing following 15 elements
       java.lang.Thread.sleep(Native Method)
       java.lang.Thread.sleep(Thread.java:340)
       org.apache.lucene.store.RateLimiter$SimpleRateLimiter.pause(RateLimiter.java:151)
       org.apache.lucene.store.RateLimitedFSDirectory$RateLimiterWrapper.pause(RateLimitedFSDirectory.java:96)
       org.apache.lucene.store.RateLimitedIndexOutput.checkRate(RateLimitedIndexOutput.java:76)

This file has been truncated. show original

Elasticsearch is also using very few CPU resources.

This makes sense, of course, because of the obscenely low merge threshold. While Elasticsearch is in this state, other requests also take an uncharacteristically long time:

$ time curl 'http://localhost:9200/_cat/indices?v'
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open   foo     5   1      33500            0     13.1gb         13.1gb

real    0m21.686s
user    0m0.000s
sys     0m0.007s

Yet other requests are still quite fast:

$ time curl 'http://localhost:9200/_count?pretty'
{
  "count" : 33500,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  }
}

real    0m0.021s
user    0m0.002s
sys     0m0.004s

Search is still fast:

$ time curl -s 'http://localhost:9200/_search?q=g:"Fred"&pretty'
[output elided]
real    0m0.024s
user    0m0.000s
sys     0m0.007s

Interestingly, once the request to http://localhost:9200/_cat/segments returns, Elasticsearch seems to become unstuck, and requests to _cat/indices, for example, become fast again. Similarly for _cat/segments. I note though that profiling (and iotop) still indicate that merges are still running, so what exactly is causing Elasticsearch to become unresponsive isn't clear to me.

Any ideas what's happening? Thanks!

Topic		Replies	Views
Elasticsearch bulk indexing issue Elasticsearch	9	4144	March 3, 2020
Too many Lucene merge threads while indexing Elasticsearch	7	3900	July 19, 2019
Elasticsearch merging woes Elasticsearch	6	330	July 6, 2017
Index/shard API Unresponsive Elasticsearch	7	2693	May 4, 2017
Cluster extremely slow after many bulk indexes Elasticsearch	2	492	July 6, 2017

Elasticsearch becomes unresponsive during Lucene merges after bulk indexing

Related topics