Bulk indexing stalls (7.2.0)

Hi all,

I'm in the process of updating an ES 5.3.0 cluster to the latest 7.2 version, and while re-indexing the data I'm noticing some odd behaviour in Elasticsearch. In short, it appears that the bulk indexing API is randomly "stalling" with no discernible reason why.

Our setup here is that we have a PHP script, using the official elasticsearch/elasticsearch module from Composer, which is parsing a text file and using the bulk API to push these into our ES cluster. This is configured to send batches of 100 documents and use up to 5 concurrent connections (using the futures feature of the API). The average document size is pretty small, around 200 bytes each (UK postal addresses).

On the Elasticsearch side of things it's a fairly default install -- we have a two-node cluster running inside two separate VirtuozzoLinux containers, both with 4GB RAM allocated. JVM heap is set to 2GB. Both servers are deployed using the official elasticsearch Puppet module, so should be identical, and the only non-default configuration is the heap and various cluster-related settings:

    config      => {
      'cluster.name'                     => "pc-${server_facts['environment']}",
      'network.host'                     => '0.0.0.0',
      'cluster.initial_master_nodes'     => "${facts['hostname']}-${facts['fqdn']}",
      'discovery.zen.ping.unicast.hosts' => join($server_ips, ','),
    },
    jvm_options => [
      '-Xms2g',
      '-Xmx2g',
    ],

Our indexing script can be quite happily chugging along, indexing around 1000 docs/sec, when all of a sudden it starts getting 429 responses back from ES:

{
   "error" : {
      "bytes_wanted" : 2080067686,
      "reason" : "[parent] Data too large, data for [<http_request>] would be [2080067686/1.9gb], which is larger than the limit of [2040109465/1.8gb], real usage: [2079974248/1.9gb], new bytes reserved: [93438/91.2kb]",
      "root_cause" : [
         {
            "bytes_wanted" : 2080067686,
            "type" : "circuit_breaking_exception",
            "bytes_limit" : 2040109465,
            "durability" : "PERMANENT",
            "reason" : "[parent] Data too large, data for [<http_request>] would be [2080067686/1.9gb], which is larger than the limit of [2040109465/1.8gb], real usage: [2079974248/1.9gb], new bytes reserved: [93438/91.2kb]"
         }
      ],
      "durability" : "PERMANENT",
      "type" : "circuit_breaking_exception",
      "bytes_limit" : 2040109465
   },
   "status" : 429
}

At this point, our code recognises the circuit_breaking_exception, and will retry the failed request after waiting for a few seconds (somewhere between 5s and 30s). However, it is somewhat random as to how long it can take Elasticsearch to start responding again. The first time it happened on my current run of the script it took 13 minutes to start indexing again. As I write this it's happening again, but the script has been sitting there for almost an hour but is still getting 429 responses.

I've also noticed that occasionally we get "shard not found" errors, which I don't really understand. The cluster status remains at GREEN throughout with all shards allocated (6 shards split evenly across the 2 nodes):

{
   "index" : {
      "_type" : "_doc",
      "_index" : "pafentries_2019-01",
      "error" : {
         "index" : "pafentries_2019-01",
         "type" : "shard_not_found_exception",
         "index_uuid" : "7PftHIbDR3etFfqpU7KRXQ",
         "shard" : "2",
         "reason" : "no such shard"
      },
      "_id" : "SYd0j2sBINvOkoYgEkSp",
      "status" : 404
   }
}

In the meantime, Elasticsearch is sitting there doing nothing. CPU usage is at 0-1%. I/O is at 0%. This is what doesn't make any sense to me at all -- it would appear it has a full queue of documents to index, but it's not actually indexing anything so the queue never empties.

There's no search load on the cluster either, other than a handful of queries I did manually to see if searching was still working while the indexing was paused (answer: maybe -- some searches failed with the same circuit_breaking_exception as above, whereas others were fine and returned almost instantly).

I've put up a gist here with the log output from both nodes plus the output of running jstack on one of the processes.

Anyone have any clue what could be going on here? Is it an Elasticsearch issue or is our indexing script doing something wrong?

Can you provide the node stats from your cluster? It might help us figure out where the memory is being used.

Sure, I've put that here: https://gist.github.com/crispygoth/dcec7d350bfb031afee86e27687fd5d0

Note this is a new run of our indexing script as I accidentally broke the cluster, so might not match up exactly with the previous data, but the errors the script is getting are basically the same as before.

It looks like it's only the real memory circuit breaker that's tripping. All the other breakers show very little usage.

What's surprising though is that only young generation GCs have happened. Can you share your jvm options (can be retrieved using nodes info API)? I'm wondering if there are some customized settings that are badly interacting with the real memory circuit breaker.

You could also try deactivating the real-memory circuit breaker by settingindices.breaker.total.use_real_memory to false and see if your indexing goes smoothly without OutOfMemory errors and if there are no indexing pipeline stalls.

OK, the node info is here: https://gist.github.com/crispygoth/f53aeb75c5652016b82aa41c1eb8a511

The jvm.options file is generated by the puppet-elasticsearch module, although now I think about it I did have to modify that slightly to get it to start up, as it was trying to use obsolete JVM flags, and those are GC-related: my changeset
Although I think they're mostly related to logging rather than actual GC behaviour?

I'll change the config to turn off the breaker now and see how the script goes.

After setting indices.breaker.total.use_real_memory to false our indexer seems to be running OK. It's actually slightly faster, and it hasn't hit a 429 response at all yet. It's only indexed about 5% of the data so far though, I'll keep an eye on it for the rest of the day, it'll take about 8 hours at the current rate to get through everything.

It looks like you're not using the default configuration with the CMS garbage collector. Instead you're using the G1 garbage collector, but it is missing crucial configuration options. Can you revert back to a default jvm.options file and not change the GC settings?

OK, I guess this is related to this bug in the puppet-elasticsearch module. The only changes I made to the jvm.options provided by that module were necessary in order to even get Elasticsearch to start. I'll revert back to the jvm.options provided by the DEB package and see how that goes.

good find. That's indeed unfortunate

Quick update on this - after fixing jvm.options to match the one from the DEB file and restarting the cluster, our indexer has now been running for the past 2 hours and is about a third of the way through the data, and hasn't hit any memory issues or circuit breaker exceptions.

Many thanks for helping me out on this issue! I'll keep an eye on the puppet module bug, it looks like there might be a fairly new pull request that's being worked on.

thanks for the update

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.