We are seeing unexpected JVM crashes after an ElasticSearch cluster node is
restarted and reincorporate into the cluster. For load balancing and
robustness we have two ElasticSearch front-end servers which are balancers.
In the back-end there are three ElasticSearch worker nodes which hold all
the data. There are 5 shards and the number of replicas per shard is 2. We
notice that running the cluster continuously results in a memory / load /
query time build-up. To battle this we restart the nodes twice per day.
The restart scenario is the following:
- Stop ElasticSearch on one of the worker nodes via
- Wait 120 seconds
- Start ElasticSearch on the corresponding worker nodes via
What we see is that worker nodes sometimes suffer from a JVM crash after
the cluster state is updated. From the logs we see the following messages:
ElasticSearch log file:
[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
master=true}])]: done applying updated cluster_state
Wrapper log file:
STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN (6).
In the JVM crash log file on the worker nodes we have seen two different
root causes after an incident:
These crashes seem completely unexpected. Is there any measure we can take
to circumvent these JVM crashes? If you need more information on our setup
please do not hesitate to contact us.
Please find the individual component versions below.
$ /opt/java/jre1.7.0_04/bin/java -version
java version "1.7.0_04"
Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)
$ uname -a
Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012