While running an indexing job overnight on our development cluster, we ran
across an out of memory error that put our cluster into an unrecoverable
state. The first such exception looked like:
ES10 OutOfMemoryError: https://gist.github.com/e4e35733fcd06ec6a9a4
This was followed by a second out of memory error on another node:
ES1 OutOfMemoryError: https://gist.github.com/1d18321d35bce18ad738
There is a query that has an error in the stack trace at the top of that
second gist. There were a lot of these types of errors around 5:35, whereas
the OOM errors started around 5:40. I believe the query errors resulted
from a separate job that was running at the same time I was indexing new
data. Is it possible that the queries caused the OOM error?
We then saw another OOM error a few minutes later:
ES19 OutOfMemoryError: https://gist.github.com/e308b2123081aff02438
After this, the cluster was put in an unrecoverable state. We only have a
single replica, so losing 3 nodes certainly lost a few shards in their
entirety. Restarting the nodes did not bring the shards beack. We can
reindex fairly quickly so it isn't a huge problem, but we'd like to get to
the bottom of why we were seeing OOM errors across the cluster.
Cluster information: we have 10 total nodes. In looking at our
configuration, I already see that recover_after_nodes is incorrect:
gateway.type: local
gateway.recover_after_nodes: 5
gateway.recover_after_time: 5m
gateway.expected_nodes: 10
That should be 1 with only a single replica. Is there anything else I
should be looking for that might point to why we were seeing these errors
across the cluster? The index itself has already been recreated (as it's
our dev environment, we need to keep people moving), so we don't have that
information available anymore. Will the logs contain anything else I can
look for?
As an aside, if it were to provide any additional information, even after
recreating the indices in the cluster, we are seeing issues where shards
won't come out of this state:
{
routing: {
state: INITIALIZING
primary: false
node: 1PiZJnPRSNOacqElpolPEw
relocating_node: null
shard: 14
index: documents
},
state: RECOVERING
index: {
size: 0b
size_in_bytes: 0
}
}
Thanks for any help,
Dale
--