Java.lang.OutOfMemoryError causing cluster to fail

Background
We have a three node cluster comprised of prd-elastic-x, prd-elastic-y
and prd-elastic-z. Each box is an EC2 m2.xlarge, with 17.1 GB of RAM.

Elasticsearch is run with the following java memory configuration:
java -server -Djava.net.preferIPv4Stack=true
-Des.config=/usr/local/etc/elasticsearch/elasticsearch.yml -Xms10247m
-Xmx10247m -Xss256k

We run some very memory intensive term aggregations on fields with very
high cardinality (millions of documents, tens of thousands of values). We
have had issues with running out of memory before, including issues with
the OS oomkiller, but yesterday we had a particularly bad outage.

There are no signs that the oomkiller took any action in the syslog.

Timeline

12:35
prd-elastic-x:

[2014-09-04 11:35:53,002][WARN ][monitor.jvm ] [prd-elastic-x]
[gc][old][1812644][175] duration [26.3s], collections [2]/[27.5s], total
[26.3s]/[1.3m], memory [7.6gb]->[4.7gb]/[9.9gb], all_pools {[young]
[129.9mb]->[37.5mb]/[133.1mb]}{[survivor] [16.6mb]->[0b]/[16.6mb]}{[old]
[7.5gb]->[4.7gb]/[9.8gb]}

12:36
prd-elastic-z:

[2014-09-04 11:36:02,809][WARN ][monitor.jvm ] [prd-elastic-z]
[gc][old][3019662][378] duration [34.9s], collections [2]/[36.1s], total
[34.9s]/[2.8m], memory [8.8gb]->[6.3gb]/[9.9gb], all_pools {[young]
[116.9mb]->[12.6mb]/[133.1mb]}{[survivor] [16.6mb]->[0b]/[16.6mb]}{[old]
[8.6gb]->[6.3gb]/[9.8gb]}

12:38
Got first application error - request has timed out.

We start investigating. prd-elastic-y is reporting the cluster state as
green, with all three nodes in the cluster still in. However, attempts to
load debug information from endpoints such as /_cat/recovery hang, and we
continue to receive errors at the application level.

We stop all non-critical application processes to try and reduce the load
on Elasticsearch, in the hope that it will recover.

12:41
A lot of errors start appearing in the logs for prd-elastic-z, including
but in no way limited to:

[2014-09-04 11:40:14,284][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space

12:43
We start seeing errors on prd-elastic-x as well, including:

[2014-09-04 11:43:51,440][WARN ][netty.channel.DefaultChannelPipeline] An
exception was thrown by a user handler while handling an exception event
([id: 0xe2f53491, /10.78.198.78:55953 => /10.79.75.106:9300] EXCEPTION:
java.lang.OutOfMemoryError: Java heap space)
java.lang.OutOfMemoryError: Java heap space

12:46
prd-elastic-x reports that it is unable to ping prd-elastic-z

[2014-09-04 11:46:23,016][INFO ][discovery.ec2 ] [prd-elastic-x]
master_left
[[prd-elastic-z][nbnLdQjgS4KVe7rGx8eIWw][ip-10-76-41-241.ec2.internal][inet[/10.76.41.241:9300]]{aws_availability_zone=us-east-1d,
max_local_storage_nodes=1}], reason [failed to ping, tried [3] times, each
with maximum [30s] timeout]

12:48
We decide to restart prd-elastic-z (gracefully, using service elasticsearch
restart), based upon the aforementioned ping log message

prd-elastic-z:

[2014-09-04 11:48:54,186][INFO ][node ] [prd-elastic-z]
stopping ...

13:03
prd-elastic-z has still not stopped (still appearing in ps -aux with a long
uptime), so we take the decision to forcefully kill it (kill -9)

13:06
The cluster is still not behaving itself. We take the decision to restart
the other two nodes, starting with prd-elastic-y.

13:08
prd-elastic-y has restarted without any further intervention required, but
is reporting as the only node in the cluster, in the 'red' state, and with
no indices or shards.

13:09
We try to restart prd-elastic-x gracefully.

prd-elastic-x:

[2014-09-04 12:09:47,480][INFO ][node ] [prd-elastic-x]
stopping ...

13:22
prd-elastic-x has still not stopped, so again we are forced to kill -9
it. We then restart it manually using sudo service elasticsearch start

prd-elastic-y:

[2014-09-04 12:23:14,108][INFO ][cluster.service ] [prd-elastic-y]
added
{[prd-elastic-x][OF5WaLTzRVG92z1Y7zMW2g][ip-10-79-75-106.ec2.internal][inet[/10.79.75.106:9300]]{aws_availability_zone=us-east-1d,
max_local_storage_nodes=1},}, reason: zen-disco-receive(join from
node[[prd-elastic-x][OF5WaLTzRVG92z1Y7zMW2g][ip-10-79-75-106.ec2.internal][inet[/10.79.75.106:9300]]{aws_availability_zone=us-east-1d,
max_local_storage_nodes=1}])
[2014-09-04 12:23:14,119][INFO ][gateway ] [prd-elastic-y]
delaying initial state recovery for [5m]. expecting [3] nodes, but only
have [2]

13:24
We decide to restart prd-elastic-z. We were originally going to leave it
out of the cluster and bring in a completely new instance, as we had to
kill it and thus were not sure about its data integrity, but given it fared
no worse than prd-elastic-x we figured it didn't make a difference at this
point.

13:27
All three nodes are in the cluster, and recovery starts. The cluster
reports yellow almost immediately.

13:37
The cluster is green and 'healthy' again (or so we believe)


Having gone through the logs we believe that the root cause is the
"java.lang.OutOfMemoryError: Java heap space" exceptions we see shortly
after the garbage collection on prd-elastic-x and prd-elastic-y.
prd-elastic-y seemed 'stable' throughout, but was reporting the cluster
state incorrectly and then appeared to lose all of its data (or cluster
state, not sure?)

The logs for all nodes can be found here:

Questions

  • Why is Elasticsearch allowed to get into this state? Is it poor
    configuration on our part or a bug in the software?
  • Why was prd-elastic-y reporting its state as 'green' even though the
    cluster had seemingly failed?
  • Why did prd-elastic-y report as having no indices when it was the only
    node in the cluster?
  • How did the heap space exception end up causing networking problems
    (failed to ping)? Or is that a separate problem entirely?
  • Is there anything we can do to prevent this happening again, other than
    throw more hardware at the problem?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ce8492a6-fe9c-4148-921e-2e599e97832d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Why is Elasticsearch allowed to get into this state? Is it poor
configuration on our part or a bug in the software?

It is the JVM with low memory condition. No Java code can execute when
stack and heap is full and free memory is below a few bytes. ES works hard
to overcome these situations.

Why was prd-elastic-y reporting its state as 'green' even though the
cluster had seemingly failed?

There is a delay before cluster operations timeout, there is no good method
to detect faults earlier. Timeouts can be configured.

Why did prd-elastic-y report as having no indices when it was the only
node in the cluster?

This is due to the minimum_master_nodes I assume. The node waits for other
masters (expecting ... nodes).

How did the heap space exception end up causing networking problems
(failed to ping)? Or is that a separate problem entirely?

The network library (Netty) can only work with enough heap. It is not a
separate problem. The whole JVM stalls under low memory condition.

Is there anything we can do to prevent this happening again, other than
throw more hardware at the problem?

You have assigned more than 50% RAM to the heap. Why?

Have you sized your configuration by driving enough tests? You should care
about that. In a heavy load test you will encounter OOMs and then you know
exactly if you should add more nodes or not.

Jörg

On Fri, Sep 5, 2014 at 11:44 AM, Ollie oliver.byford@idioplatform.com
wrote:

Background
We have a three node cluster comprised of prd-elastic-x, prd-elastic-y
and prd-elastic-z. Each box is an EC2 m2.xlarge, with 17.1 GB of RAM.

Elasticsearch is run with the following java memory configuration:
java -server -Djava.net.preferIPv4Stack=true
-Des.config=/usr/local/etc/elasticsearch/elasticsearch.yml -Xms10247m
-Xmx10247m -Xss256k

We run some very memory intensive term aggregations on fields with very
high cardinality (millions of documents, tens of thousands of values). We
have had issues with running out of memory before, including issues with
the OS oomkiller, but yesterday we had a particularly bad outage.

There are no signs that the oomkiller took any action in the syslog.

Timeline

12:35
prd-elastic-x:

[2014-09-04 11:35:53,002][WARN ][monitor.jvm ]
[prd-elastic-x] [gc][old][1812644][175] duration [26.3s], collections
[2]/[27.5s], total [26.3s]/[1.3m], memory [7.6gb]->[4.7gb]/[9.9gb],
all_pools {[young] [129.9mb]->[37.5mb]/[133.1mb]}{[survivor]
[16.6mb]->[0b]/[16.6mb]}{[old] [7.5gb]->[4.7gb]/[9.8gb]}

12:36
prd-elastic-z:

[2014-09-04 11:36:02,809][WARN ][monitor.jvm ]
[prd-elastic-z] [gc][old][3019662][378] duration [34.9s], collections
[2]/[36.1s], total [34.9s]/[2.8m], memory [8.8gb]->[6.3gb]/[9.9gb],
all_pools {[young] [116.9mb]->[12.6mb]/[133.1mb]}{[survivor]
[16.6mb]->[0b]/[16.6mb]}{[old] [8.6gb]->[6.3gb]/[9.8gb]}

12:38
Got first application error - request has timed out.

We start investigating. prd-elastic-y is reporting the cluster state as
green, with all three nodes in the cluster still in. However, attempts to
load debug information from endpoints such as /_cat/recovery hang, and we
continue to receive errors at the application level.

We stop all non-critical application processes to try and reduce the load
on Elasticsearch, in the hope that it will recover.

12:41
A lot of errors start appearing in the logs for prd-elastic-z, including
but in no way limited to:

[2014-09-04 11:40:14,284][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space

12:43
We start seeing errors on prd-elastic-x as well, including:

[2014-09-04 11:43:51,440][WARN ][netty.channel.DefaultChannelPipeline] An
exception was thrown by a user handler while handling an exception event
([id: 0xe2f53491, /10.78.198.78:55953 => /10.79.75.106:9300] EXCEPTION:
java.lang.OutOfMemoryError: Java heap space)
java.lang.OutOfMemoryError: Java heap space

12:46
prd-elastic-x reports that it is unable to ping prd-elastic-z

[2014-09-04 11:46:23,016][INFO ][discovery.ec2 ]
[prd-elastic-x] master_left
[[prd-elastic-z][nbnLdQjgS4KVe7rGx8eIWw][ip-10-76-41-241.ec2.internal][inet[/10.76.41.241:9300]]{aws_availability_zone=us-east-1d,
max_local_storage_nodes=1}], reason [failed to ping, tried [3] times, each
with maximum [30s] timeout]

12:48
We decide to restart prd-elastic-z (gracefully, using service
elasticsearch restart), based upon the aforementioned ping log message

prd-elastic-z:

[2014-09-04 11:48:54,186][INFO ][node ]
[prd-elastic-z] stopping ...

13:03
prd-elastic-z has still not stopped (still appearing in ps -aux with a
long uptime), so we take the decision to forcefully kill it (kill -9)

13:06
The cluster is still not behaving itself. We take the decision to restart
the other two nodes, starting with prd-elastic-y.

13:08
prd-elastic-y has restarted without any further intervention required, but
is reporting as the only node in the cluster, in the 'red' state, and with
no indices or shards.

13:09
We try to restart prd-elastic-x gracefully.

prd-elastic-x:

[2014-09-04 12:09:47,480][INFO ][node ]
[prd-elastic-x] stopping ...

13:22
prd-elastic-x has still not stopped, so again we are forced to kill -9
it. We then restart it manually using sudo service elasticsearch start

prd-elastic-y:

[2014-09-04 12:23:14,108][INFO ][cluster.service ]
[prd-elastic-y] added
{[prd-elastic-x][OF5WaLTzRVG92z1Y7zMW2g][ip-10-79-75-106.ec2.internal][inet[/10.79.75.106:9300]]{aws_availability_zone=us-east-1d,
max_local_storage_nodes=1},}, reason: zen-disco-receive(join from
node[[prd-elastic-x][OF5WaLTzRVG92z1Y7zMW2g][ip-10-79-75-106.ec2.internal][inet[/10.79.75.106:9300]]{aws_availability_zone=us-east-1d,
max_local_storage_nodes=1}])
[2014-09-04 12:23:14,119][INFO ][gateway ]
[prd-elastic-y] delaying initial state recovery for [5m]. expecting [3]
nodes, but only have [2]

13:24
We decide to restart prd-elastic-z. We were originally going to leave it
out of the cluster and bring in a completely new instance, as we had to
kill it and thus were not sure about its data integrity, but given it fared
no worse than prd-elastic-x we figured it didn't make a difference at this
point.

13:27
All three nodes are in the cluster, and recovery starts. The cluster
reports yellow almost immediately.

13:37
The cluster is green and 'healthy' again (or so we believe)


Having gone through the logs we believe that the root cause is the
"java.lang.OutOfMemoryError: Java heap space" exceptions we see shortly
after the garbage collection on prd-elastic-x and prd-elastic-y.
prd-elastic-y seemed 'stable' throughout, but was reporting the cluster
state incorrectly and then appeared to lose all of its data (or cluster
state, not sure?)

The logs for all nodes can be found here:
Dropbox - File Deleted - Simplify your life

Questions

  • Why is Elasticsearch allowed to get into this state? Is it poor
    configuration on our part or a bug in the software?
  • Why was prd-elastic-y reporting its state as 'green' even though the
    cluster had seemingly failed?
  • Why did prd-elastic-y report as having no indices when it was the only
    node in the cluster?
  • How did the heap space exception end up causing networking problems
    (failed to ping)? Or is that a separate problem entirely?
  • Is there anything we can do to prevent this happening again, other than
    throw more hardware at the problem?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ce8492a6-fe9c-4148-921e-2e599e97832d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ce8492a6-fe9c-4148-921e-2e599e97832d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFJ1kNZd8vc6xXLE3cTX1HCA0JA4FPz%2B10AzJ3_1NObhg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.