Hi all,
I'm in the process of updating an ES 5.3.0 cluster to the latest 7.2 version, and while re-indexing the data I'm noticing some odd behaviour in Elasticsearch. In short, it appears that the bulk indexing API is randomly "stalling" with no discernible reason why.
Our setup here is that we have a PHP script, using the official elasticsearch/elasticsearch
module from Composer, which is parsing a text file and using the bulk API to push these into our ES cluster. This is configured to send batches of 100 documents and use up to 5 concurrent connections (using the futures feature of the API). The average document size is pretty small, around 200 bytes each (UK postal addresses).
On the Elasticsearch side of things it's a fairly default install -- we have a two-node cluster running inside two separate VirtuozzoLinux containers, both with 4GB RAM allocated. JVM heap is set to 2GB. Both servers are deployed using the official elasticsearch Puppet module, so should be identical, and the only non-default configuration is the heap and various cluster-related settings:
config => {
'cluster.name' => "pc-${server_facts['environment']}",
'network.host' => '0.0.0.0',
'cluster.initial_master_nodes' => "${facts['hostname']}-${facts['fqdn']}",
'discovery.zen.ping.unicast.hosts' => join($server_ips, ','),
},
jvm_options => [
'-Xms2g',
'-Xmx2g',
],
Our indexing script can be quite happily chugging along, indexing around 1000 docs/sec, when all of a sudden it starts getting 429 responses back from ES:
{
"error" : {
"bytes_wanted" : 2080067686,
"reason" : "[parent] Data too large, data for [<http_request>] would be [2080067686/1.9gb], which is larger than the limit of [2040109465/1.8gb], real usage: [2079974248/1.9gb], new bytes reserved: [93438/91.2kb]",
"root_cause" : [
{
"bytes_wanted" : 2080067686,
"type" : "circuit_breaking_exception",
"bytes_limit" : 2040109465,
"durability" : "PERMANENT",
"reason" : "[parent] Data too large, data for [<http_request>] would be [2080067686/1.9gb], which is larger than the limit of [2040109465/1.8gb], real usage: [2079974248/1.9gb], new bytes reserved: [93438/91.2kb]"
}
],
"durability" : "PERMANENT",
"type" : "circuit_breaking_exception",
"bytes_limit" : 2040109465
},
"status" : 429
}
At this point, our code recognises the circuit_breaking_exception, and will retry the failed request after waiting for a few seconds (somewhere between 5s and 30s). However, it is somewhat random as to how long it can take Elasticsearch to start responding again. The first time it happened on my current run of the script it took 13 minutes to start indexing again. As I write this it's happening again, but the script has been sitting there for almost an hour but is still getting 429 responses.
I've also noticed that occasionally we get "shard not found" errors, which I don't really understand. The cluster status remains at GREEN throughout with all shards allocated (6 shards split evenly across the 2 nodes):
{
"index" : {
"_type" : "_doc",
"_index" : "pafentries_2019-01",
"error" : {
"index" : "pafentries_2019-01",
"type" : "shard_not_found_exception",
"index_uuid" : "7PftHIbDR3etFfqpU7KRXQ",
"shard" : "2",
"reason" : "no such shard"
},
"_id" : "SYd0j2sBINvOkoYgEkSp",
"status" : 404
}
}
In the meantime, Elasticsearch is sitting there doing nothing. CPU usage is at 0-1%. I/O is at 0%. This is what doesn't make any sense to me at all -- it would appear it has a full queue of documents to index, but it's not actually indexing anything so the queue never empties.
There's no search load on the cluster either, other than a handful of queries I did manually to see if searching was still working while the indexing was paused (answer: maybe -- some searches failed with the same circuit_breaking_exception
as above, whereas others were fine and returned almost instantly).
I've put up a gist here with the log output from both nodes plus the output of running jstack
on one of the processes.
Anyone have any clue what could be going on here? Is it an Elasticsearch issue or is our indexing script doing something wrong?