Simultaneous OutOfMemoryErrors across multiple nodes in cluster

While running an indexing job overnight on our development cluster, we ran
across an out of memory error that put our cluster into an unrecoverable
state. The first such exception looked like:

ES10 OutOfMemoryError: https://gist.github.com/e4e35733fcd06ec6a9a4

This was followed by a second out of memory error on another node:

ES1 OutOfMemoryError: https://gist.github.com/1d18321d35bce18ad738

There is a query that has an error in the stack trace at the top of that
second gist. There were a lot of these types of errors around 5:35, whereas
the OOM errors started around 5:40. I believe the query errors resulted
from a separate job that was running at the same time I was indexing new
data. Is it possible that the queries caused the OOM error?

We then saw another OOM error a few minutes later:

ES19 OutOfMemoryError: https://gist.github.com/e308b2123081aff02438

After this, the cluster was put in an unrecoverable state. We only have a
single replica, so losing 3 nodes certainly lost a few shards in their
entirety. Restarting the nodes did not bring the shards beack. We can
reindex fairly quickly so it isn't a huge problem, but we'd like to get to
the bottom of why we were seeing OOM errors across the cluster.

Cluster information: we have 10 total nodes. In looking at our
configuration, I already see that recover_after_nodes is incorrect:

gateway.type: local
gateway.recover_after_nodes: 5
gateway.recover_after_time: 5m
gateway.expected_nodes: 10

That should be 1 with only a single replica. Is there anything else I
should be looking for that might point to why we were seeing these errors
across the cluster? The index itself has already been recreated (as it's
our dev environment, we need to keep people moving), so we don't have that
information available anymore. Will the logs contain anything else I can
look for?

As an aside, if it were to provide any additional information, even after
recreating the indices in the cluster, we are seeing issues where shards
won't come out of this state:

{
routing: {
state: INITIALIZING
primary: false
node: 1PiZJnPRSNOacqElpolPEw
relocating_node: null
shard: 14
index: documents
},
state: RECOVERING
index: {
size: 0b
size_in_bytes: 0
}
}

Thanks for any help,

Dale

--

Hello Dale,

On Fri, Nov 9, 2012 at 3:20 AM, Dale Beermann dale@studyblue.com wrote:

While running an indexing job overnight on our development cluster, we ran
across an out of memory error that put our cluster into an unrecoverable
state. The first such exception looked like:

ES10 OutOfMemoryError: https://gist.github.com/e4e35733fcd06ec6a9a4

This was followed by a second out of memory error on another node:

ES1 OutOfMemoryError: https://gist.github.com/1d18321d35bce18ad738

There is a query that has an error in the stack trace at the top of that
second gist. There were a lot of these types of errors around 5:35, whereas
the OOM errors started around 5:40. I believe the query errors resulted from
a separate job that was running at the same time I was indexing new data. Is
it possible that the queries caused the OOM error?

We then saw another OOM error a few minutes later:

ES19 OutOfMemoryError: https://gist.github.com/e308b2123081aff02438

After this, the cluster was put in an unrecoverable state. We only have a
single replica, so losing 3 nodes certainly lost a few shards in their
entirety. Restarting the nodes did not bring the shards beack. We can
reindex fairly quickly so it isn't a huge problem, but we'd like to get to
the bottom of why we were seeing OOM errors across the cluster.

Cluster information: we have 10 total nodes. In looking at our
configuration, I already see that recover_after_nodes is incorrect:

gateway.type: local
gateway.recover_after_nodes: 5
gateway.recover_after_time: 5m
gateway.expected_nodes: 10

That should be 1 with only a single replica. Is there anything else I should
be looking for that might point to why we were seeing these errors across
the cluster? The index itself has already been recreated (as it's our dev
environment, we need to keep people moving), so we don't have that
information available anymore. Will the logs contain anything else I can
look for?

I'm not sure if logs would give you any hints of what will cause an
out-of-memory error, as you'd normally get errors&warnings only close
to that point. Since it's a dev environment, it would be nice if you
can re-create the issue (maybe stress the thing more than usual),
while monitoring the cluster with something. Like our SPM for
Elasticsearch:

And then you might get more clues to what is eating the memory. My
first suspect would be caching, check out this story:

BTW, how much memory do you give to ES out of your total RAM, and how
big are your indices?

As an aside, if it were to provide any additional information, even after
recreating the indices in the cluster, we are seeing issues where shards
won't come out of this state:

Do you get any clues in the logs for this?

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

--

On Friday, November 9, 2012 4:54:20 AM UTC-8, Radu Gheorghe wrote:

Hello Dale,

On Fri, Nov 9, 2012 at 3:20 AM, Dale Beermann <da...@studyblue.com<javascript:>>
wrote:

While running an indexing job overnight on our development cluster, we
ran
across an out of memory error that put our cluster into an unrecoverable
state. The first such exception looked like:

ES10 OutOfMemoryError: https://gist.github.com/e4e35733fcd06ec6a9a4

This was followed by a second out of memory error on another node:

ES1 OutOfMemoryError: https://gist.github.com/1d18321d35bce18ad738

There is a query that has an error in the stack trace at the top of that
second gist. There were a lot of these types of errors around 5:35,
whereas
the OOM errors started around 5:40. I believe the query errors resulted
from
a separate job that was running at the same time I was indexing new
data. Is
it possible that the queries caused the OOM error?

We then saw another OOM error a few minutes later:

ES19 OutOfMemoryError: https://gist.github.com/e308b2123081aff02438

After this, the cluster was put in an unrecoverable state. We only have
a
single replica, so losing 3 nodes certainly lost a few shards in their
entirety. Restarting the nodes did not bring the shards beack. We can
reindex fairly quickly so it isn't a huge problem, but we'd like to get
to
the bottom of why we were seeing OOM errors across the cluster.

Cluster information: we have 10 total nodes. In looking at our
configuration, I already see that recover_after_nodes is incorrect:

gateway.type: local
gateway.recover_after_nodes: 5
gateway.recover_after_time: 5m
gateway.expected_nodes: 10

That should be 1 with only a single replica. Is there anything else I
should
be looking for that might point to why we were seeing these errors
across
the cluster? The index itself has already been recreated (as it's our
dev
environment, we need to keep people moving), so we don't have that
information available anymore. Will the logs contain anything else I can
look for?

I'm not sure if logs would give you any hints of what will cause an
out-of-memory error, as you'd normally get errors&warnings only close
to that point. Since it's a dev environment, it would be nice if you
can re-create the issue (maybe stress the thing more than usual),
while monitoring the cluster with something. Like our SPM for
Elasticsearch:
http://sematext.com/spm/elasticsearch-performance-monitoring/index.html

And then you might get more clues to what is eating the memory. My
first suspect would be caching, check out this story:
http://blog.sematext.com/2012/05/17/elasticsearch-cache-usage/

Well we had the same problem happen this morning after re-indexing last
night so it's reproducible. We're in the process of launching a new cluster
and changing a few settings and will try reindexing again. I'll also check
out sematext to see if it can offer any insights.

BTW, how much memory do you give to ES out of your total RAM, and how
big are your indices?

The elasticsearch heap size was set to 9GB. We have a total index size of
around 60GB, with a single replica per index, so 120GB of total stored data
across 10 nodes. We did find another thread about a nearly identical
problem:
https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/zK0hZLY0NYI
.

As an additional piece of information, once we get the OutOfMemoryError, we
can't get the cluster back into a usable state. Once we restart a node that
had an OOM error, another node will inevitably go down and we seem to lose
data. This is an exception we've seen repeated a few times:

As an aside, if it were to provide any additional information, even
after
recreating the indices in the cluster, we are seeing issues where shards
won't come out of this state:

Do you get any clues in the logs for this?

The gist I posted above is the only thing I've seen that suggests an issue
with individual shards.

Thanks for the help so far,

Dale

--

The field cache seems to be the primary suspect here. Next when the cluster
runs out of memory could you run the following command to check the cache
sizes on all your nodes:

curl -s -XGET 'http://localhost:9200/_cluster/nodes/stats?pretty=true' |
grep -A 11 '"cache" :' | grep '_size"'

If the numbers reported will be close to the max heap size, it would be a
good indicator that field cache is indeed the issue.

The TooLongFrameException is benign. It basically means that somebody on
10.12.110.101 is trying to connect to port 9300 using curl or http client.
Just ask them to switch to port 9200. :slight_smile:

On Friday, November 9, 2012 5:46:28 PM UTC-5, Dale Beermann wrote:

On Friday, November 9, 2012 4:54:20 AM UTC-8, Radu Gheorghe wrote:

Hello Dale,

On Fri, Nov 9, 2012 at 3:20 AM, Dale Beermann da...@studyblue.com
wrote:

While running an indexing job overnight on our development cluster, we
ran
across an out of memory error that put our cluster into an
unrecoverable
state. The first such exception looked like:

ES10 OutOfMemoryError: https://gist.github.com/e4e35733fcd06ec6a9a4

This was followed by a second out of memory error on another node:

ES1 OutOfMemoryError: https://gist.github.com/1d18321d35bce18ad738

There is a query that has an error in the stack trace at the top of
that
second gist. There were a lot of these types of errors around 5:35,
whereas
the OOM errors started around 5:40. I believe the query errors resulted
from
a separate job that was running at the same time I was indexing new
data. Is
it possible that the queries caused the OOM error?

We then saw another OOM error a few minutes later:

ES19 OutOfMemoryError: https://gist.github.com/e308b2123081aff02438

After this, the cluster was put in an unrecoverable state. We only have
a
single replica, so losing 3 nodes certainly lost a few shards in their
entirety. Restarting the nodes did not bring the shards beack. We can
reindex fairly quickly so it isn't a huge problem, but we'd like to get
to
the bottom of why we were seeing OOM errors across the cluster.

Cluster information: we have 10 total nodes. In looking at our
configuration, I already see that recover_after_nodes is incorrect:

gateway.type: local
gateway.recover_after_nodes: 5
gateway.recover_after_time: 5m
gateway.expected_nodes: 10

That should be 1 with only a single replica. Is there anything else I
should
be looking for that might point to why we were seeing these errors
across
the cluster? The index itself has already been recreated (as it's our
dev
environment, we need to keep people moving), so we don't have that
information available anymore. Will the logs contain anything else I
can
look for?

I'm not sure if logs would give you any hints of what will cause an
out-of-memory error, as you'd normally get errors&warnings only close
to that point. Since it's a dev environment, it would be nice if you
can re-create the issue (maybe stress the thing more than usual),
while monitoring the cluster with something. Like our SPM for
Elasticsearch:
http://sematext.com/spm/elasticsearch-performance-monitoring/index.html

And then you might get more clues to what is eating the memory. My
first suspect would be caching, check out this story:
http://blog.sematext.com/2012/05/17/elasticsearch-cache-usage/

Well we had the same problem happen this morning after re-indexing last
night so it's reproducible. We're in the process of launching a new cluster
and changing a few settings and will try reindexing again. I'll also check
out sematext to see if it can offer any insights.

BTW, how much memory do you give to ES out of your total RAM, and how
big are your indices?

The elasticsearch heap size was set to 9GB. We have a total index size of
around 60GB, with a single replica per index, so 120GB of total stored data
across 10 nodes. We did find another thread about a nearly identical
problem:
https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/zK0hZLY0NYI
.

As an additional piece of information, once we get the OutOfMemoryError,
we can't get the cluster back into a usable state. Once we restart a node
that had an OOM error, another node will inevitably go down and we seem to
lose data. This is an exception we've seen repeated a few times:

https://gist.github.com/7f6c9af1893656a285a3

As an aside, if it were to provide any additional information, even
after
recreating the indices in the cluster, we are seeing issues where
shards
won't come out of this state:

Do you get any clues in the logs for this?

The gist I posted above is the only thing I've seen that suggests an issue
with individual shards.

Thanks for the help so far,

Dale

--