Background
We have a three node cluster comprised of prd-elastic-x, prd-elastic-y
and prd-elastic-z. Each box is an EC2 m2.xlarge, with 17.1 GB of RAM.
Elasticsearch is run with the following java memory configuration:
java -server -Djava.net.preferIPv4Stack=true
-Des.config=/usr/local/etc/elasticsearch/elasticsearch.yml -Xms10247m
-Xmx10247m -Xss256k
We run some very memory intensive term aggregations on fields with very
high cardinality (millions of documents, tens of thousands of values). We
have had issues with running out of memory before, including issues with
the OS oomkiller, but yesterday we had a particularly bad outage.
There are no signs that the oomkiller took any action in the syslog.
Timeline
12:35
prd-elastic-x:
[2014-09-04 11:35:53,002][WARN ][monitor.jvm ] [prd-elastic-x]
[gc][old][1812644][175] duration [26.3s], collections [2]/[27.5s], total
[26.3s]/[1.3m], memory [7.6gb]->[4.7gb]/[9.9gb], all_pools {[young]
[129.9mb]->[37.5mb]/[133.1mb]}{[survivor] [16.6mb]->[0b]/[16.6mb]}{[old]
[7.5gb]->[4.7gb]/[9.8gb]}
12:36
prd-elastic-z:
[2014-09-04 11:36:02,809][WARN ][monitor.jvm ] [prd-elastic-z]
[gc][old][3019662][378] duration [34.9s], collections [2]/[36.1s], total
[34.9s]/[2.8m], memory [8.8gb]->[6.3gb]/[9.9gb], all_pools {[young]
[116.9mb]->[12.6mb]/[133.1mb]}{[survivor] [16.6mb]->[0b]/[16.6mb]}{[old]
[8.6gb]->[6.3gb]/[9.8gb]}
12:38
Got first application error - request has timed out.
We start investigating. prd-elastic-y is reporting the cluster state as
green, with all three nodes in the cluster still in. However, attempts to
load debug information from endpoints such as /_cat/recovery hang, and we
continue to receive errors at the application level.
We stop all non-critical application processes to try and reduce the load
on Elasticsearch, in the hope that it will recover.
12:41
A lot of errors start appearing in the logs for prd-elastic-z, including
but in no way limited to:
[2014-09-04 11:40:14,284][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
12:43
We start seeing errors on prd-elastic-x as well, including:
[2014-09-04 11:43:51,440][WARN ][netty.channel.DefaultChannelPipeline] An
exception was thrown by a user handler while handling an exception event
([id: 0xe2f53491, /10.78.198.78:55953 => /10.79.75.106:9300] EXCEPTION:
java.lang.OutOfMemoryError: Java heap space)
java.lang.OutOfMemoryError: Java heap space
12:46
prd-elastic-x reports that it is unable to ping prd-elastic-z
[2014-09-04 11:46:23,016][INFO ][discovery.ec2 ] [prd-elastic-x]
master_left
[[prd-elastic-z][nbnLdQjgS4KVe7rGx8eIWw][ip-10-76-41-241.ec2.internal][inet[/10.76.41.241:9300]]{aws_availability_zone=us-east-1d,
max_local_storage_nodes=1}], reason [failed to ping, tried [3] times, each
with maximum [30s] timeout]
12:48
We decide to restart prd-elastic-z (gracefully, using service elasticsearch
restart), based upon the aforementioned ping log message
prd-elastic-z:
[2014-09-04 11:48:54,186][INFO ][node ] [prd-elastic-z]
stopping ...
13:03
prd-elastic-z has still not stopped (still appearing in ps -aux with a long
uptime), so we take the decision to forcefully kill it (kill -9)
13:06
The cluster is still not behaving itself. We take the decision to restart
the other two nodes, starting with prd-elastic-y.
13:08
prd-elastic-y has restarted without any further intervention required, but
is reporting as the only node in the cluster, in the 'red' state, and with
no indices or shards.
13:09
We try to restart prd-elastic-x gracefully.
prd-elastic-x:
[2014-09-04 12:09:47,480][INFO ][node ] [prd-elastic-x]
stopping ...
13:22
prd-elastic-x has still not stopped, so again we are forced to kill -9
it. We then restart it manually using sudo service elasticsearch start
prd-elastic-y:
[2014-09-04 12:23:14,108][INFO ][cluster.service ] [prd-elastic-y]
added
{[prd-elastic-x][OF5WaLTzRVG92z1Y7zMW2g][ip-10-79-75-106.ec2.internal][inet[/10.79.75.106:9300]]{aws_availability_zone=us-east-1d,
max_local_storage_nodes=1},}, reason: zen-disco-receive(join from
node[[prd-elastic-x][OF5WaLTzRVG92z1Y7zMW2g][ip-10-79-75-106.ec2.internal][inet[/10.79.75.106:9300]]{aws_availability_zone=us-east-1d,
max_local_storage_nodes=1}])
[2014-09-04 12:23:14,119][INFO ][gateway ] [prd-elastic-y]
delaying initial state recovery for [5m]. expecting [3] nodes, but only
have [2]
13:24
We decide to restart prd-elastic-z. We were originally going to leave it
out of the cluster and bring in a completely new instance, as we had to
kill it and thus were not sure about its data integrity, but given it fared
no worse than prd-elastic-x we figured it didn't make a difference at this
point.
13:27
All three nodes are in the cluster, and recovery starts. The cluster
reports yellow almost immediately.
13:37
The cluster is green and 'healthy' again (or so we believe)
Having gone through the logs we believe that the root cause is the
"java.lang.OutOfMemoryError: Java heap space" exceptions we see shortly
after the garbage collection on prd-elastic-x and prd-elastic-y.
prd-elastic-y seemed 'stable' throughout, but was reporting the cluster
state incorrectly and then appeared to lose all of its data (or cluster
state, not sure?)
The logs for all nodes can be found here:
Questions
- Why is Elasticsearch allowed to get into this state? Is it poor
configuration on our part or a bug in the software? - Why was prd-elastic-y reporting its state as 'green' even though the
cluster had seemingly failed? - Why did prd-elastic-y report as having no indices when it was the only
node in the cluster? - How did the heap space exception end up causing networking problems
(failed to ping)? Or is that a separate problem entirely? - Is there anything we can do to prevent this happening again, other than
throw more hardware at the problem?
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ce8492a6-fe9c-4148-921e-2e599e97832d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.