JVM crashes after reincorporating nodes into cluster

Alex_5 · August 8, 2012, 9:32am

Dear all,

We are seeing unexpected JVM crashes after an ElasticSearch cluster node is
restarted and reincorporate into the cluster. For load balancing and
robustness we have two ElasticSearch front-end servers which are balancers.
In the back-end there are three ElasticSearch worker nodes which hold all
the data. There are 5 shards and the number of replicas per shard is 2. We
notice that running the cluster continuously results in a memory / load /
query time build-up. To battle this we restart the nodes twice per day.

The restart scenario is the following:

Stop ElasticSearch on one of the worker nodes via
/etc/init.d/elasticsearch stop
Wait 120 seconds
Start ElasticSearch on the corresponding worker nodes via
/etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash after
the cluster state is updated. From the logs we see the following messages:

ElasticSearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
[[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN (6).

In the JVM crash log file on the worker nodes we have seen two different
root causes after an incident:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J

org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can take
to circumvent these JVM crashes? If you need more information on our setup
please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

Alex

Our stack:

elasticsearch-0.19.8
Java Version:
$ /opt/java/jre1.7.0_04/bin/java -version
java version "1.7.0_04"
Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)
OS Version:
$ uname -a
Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
x86_64 GNU/Linux

alexander sennhauser
software engineer

+41 79 420 4953 mobile
lex@squirro.com
www.squirro.com

kimchy · August 8, 2012, 9:53am

Can you gist the crash file output of the JVM? Can you make sure that ES is using the relevant Java version and not a different one (using the nodes info API with the jvm flag).

On Aug 8, 2012, at 11:32 AM, Alex lex@squirro.com wrote:

Dear all,

We are seeing unexpected JVM crashes after an Elasticsearch cluster node is restarted and reincorporate into the cluster. For load balancing and robustness we have two Elasticsearch front-end servers which are balancers. In the back-end there are three Elasticsearch worker nodes which hold all the data. There are 5 shards and the number of replicas per shard is 2. We notice that running the cluster continuously results in a memory / load / query time build-up. To battle this we restart the nodes twice per day.

The restart scenario is the following:

Stop Elasticsearch on one of the worker nodes via /etc/init.d/elasticsearch stop

Wait 120 seconds

Start Elasticsearch on the corresponding worker nodes via /etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash after the cluster state is updated. From the logs we see the following messages:

Elasticsearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2] processing [zen-disco-receive(from master [[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false, master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN (6).

In the JVM crash log file on the worker nodes we have seen two different root causes after an incident:

J org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can take to circumvent these JVM crashes? If you need more information on our setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

Alex

Our stack:

elasticsearch-0.19.8

Java Version:
$ /opt/java/jre1.7.0_04/bin/java -version
java version "1.7.0_04"
Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

OS Version:
$ uname -a
Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012 x86_64 GNU/Linux

alexander sennhauser
software engineer

+41 79 420 4953 mobile
lex@squirro.com
www.squirro.com

Alex_5 · August 8, 2012, 11:40am

Thanks for the quick reply.

The JVM crash files can be found at JVM crash log file · GitHub.

The reported JVM version is being used by Elasticsearch.

Best Regards

Alex

On Wednesday, August 8, 2012 11:53:39 AM UTC+2, kimchy wrote:

Can you gist the crash file output of the JVM? Can you make sure that ES
is using the relevant Java version and not a different one (using the nodes
info API with the jvm flag).

On Aug 8, 2012, at 11:32 AM, Alex <l...@squirro.com <javascript:>> wrote:

Dear all,

We are seeing unexpected JVM crashes after an Elasticsearch cluster node
is restarted and reincorporate into the cluster. For load balancing and
robustness we have two Elasticsearch front-end servers which are balancers.
In the back-end there are three Elasticsearch worker nodes which hold all
the data. There are 5 shards and the number of replicas per shard is 2. We
notice that running the cluster continuously results in a memory / load /
query time build-up. To battle this we restart the nodes twice per day.

The restart scenario is the following:

Stop Elasticsearch on one of the worker nodes via
/etc/init.d/elasticsearch stop

Wait 120 seconds

Start Elasticsearch on the corresponding worker nodes via
/etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash after
the cluster state is updated. From the logs we see the following messages:

Elasticsearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
[[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN
(6).

In the JVM crash log file on the worker nodes we have seen two different
root causes after an incident:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J

org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can
take to circumvent these JVM crashes? If you need more information on our
setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

Alex

Our stack:

elasticsearch-0.19.8

Java Version:
$ /opt/java/jre1.7.0_04/bin/java -version
java version "1.7.0_04"
Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

OS Version:
$ uname -a
Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
x86_64 GNU/Linux

alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com <javascript:>
www.squirro.com

Alex_5 · August 16, 2012, 6:56pm

Dear all,

Are there any insights on this incident that you can share with us? We are
still seeing repeated failures when a node is re-incorporated.

On some occasions we also see that the two front-end servers get out of
sync. One usually looses connectivity to all other nodes and as a
consequence it's CPU usage goes through the roof.

Best Regards

Alex

On Wednesday, August 8, 2012 1:40:24 PM UTC+2, Alexander Sennhauser wrote:

Thanks for the quick reply.

The JVM crash files can be found at JVM crash log file · GitHub.

The reported JVM version is being used by Elasticsearch.

Best Regards

Alex

On Wednesday, August 8, 2012 11:53:39 AM UTC+2, kimchy wrote:

Can you gist the crash file output of the JVM? Can you make sure that ES
is using the relevant Java version and not a different one (using the nodes
info API with the jvm flag).

On Aug 8, 2012, at 11:32 AM, Alex l...@squirro.com wrote:

Dear all,

We are seeing unexpected JVM crashes after an Elasticsearch cluster
node is restarted and reincorporate into the cluster. For load balancing
and robustness we have two Elasticsearch front-end servers which are
balancers. In the back-end there are three Elasticsearch worker nodes which
hold all the data. There are 5 shards and the number of replicas per shard
is 2. We notice that running the cluster continuously results in a memory /
load / query time build-up. To battle this we restart the nodes twice per
day.

The restart scenario is the following:

Stop Elasticsearch on one of the worker nodes via
/etc/init.d/elasticsearch stop

Wait 120 seconds

Start Elasticsearch on the corresponding worker nodes via
/etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash
after the cluster state is updated. From the logs we see the following
messages:

Elasticsearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
[[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN
(6).

In the JVM crash log file on the worker nodes we have seen two
different root causes after an incident:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J

org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can
take to circumvent these JVM crashes? If you need more information on our
setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

Alex

Our stack:

elasticsearch-0.19.8

Java Version:
$ /opt/java/jre1.7.0_04/bin/java -version
java version "1.7.0_04"
Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

OS Version:
$ uname -a
Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
x86_64 GNU/Linux

alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com
www.squirro.com

--

Ivan · August 16, 2012, 11:59pm

I am experiencing the same issue.

Cluster consisted of four 0.19.2 nodes. Was performing a rolling
upgrade to 0.19.8 one at a time. During each restart, the cluster
would freeze and a head dump would appear on the node that was just
restarted. Removing the hprof fixed the issue. As of now, 3 nodes have
been updated, 1 has not. Still in red state after 3 restart.

hprof file is 6.7G

On Thu, Aug 16, 2012 at 11:56 AM, Alexander Sennhauser lex@squirro.com wrote:

Dear all,

Are there any insights on this incident that you can share with us? We are
still seeing repeated failures when a node is re-incorporated.

On some occasions we also see that the two front-end servers get out of
sync. One usually looses connectivity to all other nodes and as a
consequence it's CPU usage goes through the roof.

Best Regards

Alex

On Wednesday, August 8, 2012 1:40:24 PM UTC+2, Alexander Sennhauser wrote:

Thanks for the quick reply.

The JVM crash files can be found at JVM crash log file · GitHub.

The reported JVM version is being used by Elasticsearch.

Best Regards

Alex

On Wednesday, August 8, 2012 11:53:39 AM UTC+2, kimchy wrote:

Can you gist the crash file output of the JVM? Can you make sure that ES
is using the relevant Java version and not a different one (using the nodes
info API with the jvm flag).

On Aug 8, 2012, at 11:32 AM, Alex l...@squirro.com wrote:

Dear all,

We are seeing unexpected JVM crashes after an Elasticsearch cluster
node is restarted and reincorporate into the cluster. For load balancing and
robustness we have two Elasticsearch front-end servers which are balancers.
In the back-end there are three Elasticsearch worker nodes which hold all
the data. There are 5 shards and the number of replicas per shard is 2. We
notice that running the cluster continuously results in a memory / load /
query time build-up. To battle this we restart the nodes twice per day.

The restart scenario is the following:

Stop Elasticsearch on one of the worker nodes via
/etc/init.d/elasticsearch stop

Wait 120 seconds

Start Elasticsearch on the corresponding worker nodes via
/etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash
after the cluster state is updated. From the logs we see the following
messages:

Elasticsearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
[[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN
(6).

In the JVM crash log file on the worker nodes we have seen two
different root causes after an incident:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J

org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can
take to circumvent these JVM crashes? If you need more information on our
setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

Alex

Our stack:

elasticsearch-0.19.8

Java Version:
$ /opt/java/jre1.7.0_04/bin/java -version
java version "1.7.0_04"
Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

OS Version:
$ uname -a
Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
x86_64 GNU/Linux

alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com
www.squirro.com

--

--

Ivan · August 17, 2012, 12:03am

Forgot to include

$ java -version
java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)

$ cat /etc/redhat-release
CentOS release 6.2 (Final)

$ uname -a
Linux srch-dv105 2.6.32-220.13.1.el6.x86_64 #1 SMP Tue Apr 17 23:56:34
BST 2012 x86_64 x86_64 x86_64 GNU/Linux

Using the service wrapper. Part of the upgrade process included
upgrading the wrapper as well.

Cheers,

Ivan

On Thu, Aug 16, 2012 at 4:59 PM, Ivan Brusic ivan@brusic.com wrote:

I am experiencing the same issue.

Cluster consisted of four 0.19.2 nodes. Was performing a rolling
upgrade to 0.19.8 one at a time. During each restart, the cluster
would freeze and a head dump would appear on the node that was just
restarted. Removing the hprof fixed the issue. As of now, 3 nodes have
been updated, 1 has not. Still in red state after 3 restart.

hprof file is 6.7G

On Thu, Aug 16, 2012 at 11:56 AM, Alexander Sennhauser lex@squirro.com wrote:

Dear all,

Are there any insights on this incident that you can share with us? We are
still seeing repeated failures when a node is re-incorporated.

On some occasions we also see that the two front-end servers get out of
sync. One usually looses connectivity to all other nodes and as a
consequence it's CPU usage goes through the roof.

Best Regards

Alex

On Wednesday, August 8, 2012 1:40:24 PM UTC+2, Alexander Sennhauser wrote:

Thanks for the quick reply.

The JVM crash files can be found at JVM crash log file · GitHub.

The reported JVM version is being used by Elasticsearch.

Best Regards

Alex

On Wednesday, August 8, 2012 11:53:39 AM UTC+2, kimchy wrote:

Can you gist the crash file output of the JVM? Can you make sure that ES
is using the relevant Java version and not a different one (using the nodes
info API with the jvm flag).

On Aug 8, 2012, at 11:32 AM, Alex l...@squirro.com wrote:

Dear all,

We are seeing unexpected JVM crashes after an Elasticsearch cluster
node is restarted and reincorporate into the cluster. For load balancing and
robustness we have two Elasticsearch front-end servers which are balancers.
In the back-end there are three Elasticsearch worker nodes which hold all
the data. There are 5 shards and the number of replicas per shard is 2. We
notice that running the cluster continuously results in a memory / load /
query time build-up. To battle this we restart the nodes twice per day.

The restart scenario is the following:

Stop Elasticsearch on one of the worker nodes via
/etc/init.d/elasticsearch stop

Wait 120 seconds

Start Elasticsearch on the corresponding worker nodes via
/etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash
after the cluster state is updated. From the logs we see the following
messages:

Elasticsearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
[[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN
(6).

In the JVM crash log file on the worker nodes we have seen two
different root causes after an incident:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J

org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can
take to circumvent these JVM crashes? If you need more information on our
setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

Alex

Our stack:

elasticsearch-0.19.8

Java Version:
$ /opt/java/jre1.7.0_04/bin/java -version
java version "1.7.0_04"
Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

OS Version:
$ uname -a
Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
x86_64 GNU/Linux

alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com
www.squirro.com

--

--

Ivan · August 17, 2012, 8:16pm

Several restarts later, I am no longer experiencing JVM crashes or
stalled clusters.

The only change was updating the elasticsearch.conf file to the latest
version. I was previously using the new wrapper with an old config
file (mainly GC changes, ES jar was not listed first, and the
Bootstrap class was used directly). Only modifications are heap size
and es.path.conf.

One node is still running 0.19.2. I will replicate the upgrade
process, but I prefer to wait to see if there is some information I
should capture before upgrading.

Only other tidbit of information is that I try to set
cluster.routing.allocation.disable_allocation to true when restarting
a node, although I have not been consistent.

Ivan

On Thu, Aug 16, 2012 at 5:03 PM, Ivan Brusic ivan@brusic.com wrote:

Forgot to include

$ java -version
java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)

$ cat /etc/redhat-release
CentOS release 6.2 (Final)

$ uname -a
Linux srch-dv105 2.6.32-220.13.1.el6.x86_64 #1 SMP Tue Apr 17 23:56:34
BST 2012 x86_64 x86_64 x86_64 GNU/Linux

Using the service wrapper. Part of the upgrade process included
upgrading the wrapper as well.

Cheers,

Ivan

On Thu, Aug 16, 2012 at 4:59 PM, Ivan Brusic ivan@brusic.com wrote:

I am experiencing the same issue.

Cluster consisted of four 0.19.2 nodes. Was performing a rolling
upgrade to 0.19.8 one at a time. During each restart, the cluster
would freeze and a head dump would appear on the node that was just
restarted. Removing the hprof fixed the issue. As of now, 3 nodes have
been updated, 1 has not. Still in red state after 3 restart.

hprof file is 6.7G

On Thu, Aug 16, 2012 at 11:56 AM, Alexander Sennhauser lex@squirro.com wrote:

Dear all,

Are there any insights on this incident that you can share with us? We are
still seeing repeated failures when a node is re-incorporated.

On some occasions we also see that the two front-end servers get out of
sync. One usually looses connectivity to all other nodes and as a
consequence it's CPU usage goes through the roof.

Best Regards

Alex

On Wednesday, August 8, 2012 1:40:24 PM UTC+2, Alexander Sennhauser wrote:

Thanks for the quick reply.

The JVM crash files can be found at JVM crash log file · GitHub.

The reported JVM version is being used by Elasticsearch.

Best Regards

Alex

On Wednesday, August 8, 2012 11:53:39 AM UTC+2, kimchy wrote:

Can you gist the crash file output of the JVM? Can you make sure that ES
is using the relevant Java version and not a different one (using the nodes
info API with the jvm flag).

On Aug 8, 2012, at 11:32 AM, Alex l...@squirro.com wrote:

Dear all,

We are seeing unexpected JVM crashes after an Elasticsearch cluster
node is restarted and reincorporate into the cluster. For load balancing and
robustness we have two Elasticsearch front-end servers which are balancers.
In the back-end there are three Elasticsearch worker nodes which hold all
the data. There are 5 shards and the number of replicas per shard is 2. We
notice that running the cluster continuously results in a memory / load /
query time build-up. To battle this we restart the nodes twice per day.

The restart scenario is the following:

Stop Elasticsearch on one of the worker nodes via
/etc/init.d/elasticsearch stop

Wait 120 seconds

Start Elasticsearch on the corresponding worker nodes via
/etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash
after the cluster state is updated. From the logs we see the following
messages:

Elasticsearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
[[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN
(6).

In the JVM crash log file on the worker nodes we have seen two
different root causes after an incident:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J

org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can
take to circumvent these JVM crashes? If you need more information on our
setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

Alex

Our stack:

elasticsearch-0.19.8

Java Version:
$ /opt/java/jre1.7.0_04/bin/java -version
java version "1.7.0_04"
Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

OS Version:
$ uname -a
Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
x86_64 GNU/Linux

alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com
www.squirro.com

--

--

Ivan · August 17, 2012, 9:48pm

Sorry for thread jacking Alex, but I am curious if my situation is
identical to yours.

Despite not having done a single query after restarting the nodes, 2
nodes have their memory (10G) maxed out with lots of CPU activity.
Transport size is in constant action (2000, not sure what the
dimension is).

Nothing is occurring on this cluster. No indexing, no querying, nothing.

--
Ivan

On Fri, Aug 17, 2012 at 1:16 PM, Ivan Brusic ivan@brusic.com wrote:

Several restarts later, I am no longer experiencing JVM crashes or
stalled clusters.

The only change was updating the elasticsearch.conf file to the latest
version. I was previously using the new wrapper with an old config
file (mainly GC changes, ES jar was not listed first, and the
Bootstrap class was used directly). Only modifications are heap size
and es.path.conf.

One node is still running 0.19.2. I will replicate the upgrade
process, but I prefer to wait to see if there is some information I
should capture before upgrading.

Only other tidbit of information is that I try to set
cluster.routing.allocation.disable_allocation to true when restarting
a node, although I have not been consistent.

Ivan

On Thu, Aug 16, 2012 at 5:03 PM, Ivan Brusic ivan@brusic.com wrote:

Forgot to include

$ java -version
java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)

$ cat /etc/redhat-release
CentOS release 6.2 (Final)

$ uname -a
Linux srch-dv105 2.6.32-220.13.1.el6.x86_64 #1 SMP Tue Apr 17 23:56:34
BST 2012 x86_64 x86_64 x86_64 GNU/Linux

Using the service wrapper. Part of the upgrade process included
upgrading the wrapper as well.

Cheers,

Ivan

On Thu, Aug 16, 2012 at 4:59 PM, Ivan Brusic ivan@brusic.com wrote:

I am experiencing the same issue.

Cluster consisted of four 0.19.2 nodes. Was performing a rolling
upgrade to 0.19.8 one at a time. During each restart, the cluster
would freeze and a head dump would appear on the node that was just
restarted. Removing the hprof fixed the issue. As of now, 3 nodes have
been updated, 1 has not. Still in red state after 3 restart.

hprof file is 6.7G

On Thu, Aug 16, 2012 at 11:56 AM, Alexander Sennhauser lex@squirro.com wrote:

Dear all,

Are there any insights on this incident that you can share with us? We are
still seeing repeated failures when a node is re-incorporated.

On some occasions we also see that the two front-end servers get out of
sync. One usually looses connectivity to all other nodes and as a
consequence it's CPU usage goes through the roof.

Best Regards

Alex

On Wednesday, August 8, 2012 1:40:24 PM UTC+2, Alexander Sennhauser wrote:

Thanks for the quick reply.

The JVM crash files can be found at JVM crash log file · GitHub.

The reported JVM version is being used by Elasticsearch.

Best Regards

Alex

On Wednesday, August 8, 2012 11:53:39 AM UTC+2, kimchy wrote:

Can you gist the crash file output of the JVM? Can you make sure that ES
is using the relevant Java version and not a different one (using the nodes
info API with the jvm flag).

On Aug 8, 2012, at 11:32 AM, Alex l...@squirro.com wrote:

Dear all,

We are seeing unexpected JVM crashes after an Elasticsearch cluster
node is restarted and reincorporate into the cluster. For load balancing and
robustness we have two Elasticsearch front-end servers which are balancers.
In the back-end there are three Elasticsearch worker nodes which hold all
the data. There are 5 shards and the number of replicas per shard is 2. We
notice that running the cluster continuously results in a memory / load /
query time build-up. To battle this we restart the nodes twice per day.

The restart scenario is the following:

Stop Elasticsearch on one of the worker nodes via
/etc/init.d/elasticsearch stop

Wait 120 seconds

Start Elasticsearch on the corresponding worker nodes via
/etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash
after the cluster state is updated. From the logs we see the following
messages:

Elasticsearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
[[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN
(6).

In the JVM crash log file on the worker nodes we have seen two
different root causes after an incident:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J

org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can
take to circumvent these JVM crashes? If you need more information on our
setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

Alex

Our stack:

elasticsearch-0.19.8

Java Version:
$ /opt/java/jre1.7.0_04/bin/java -version
java version "1.7.0_04"
Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

OS Version:
$ uname -a
Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
x86_64 GNU/Linux

alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com
www.squirro.com

--

--

Andrew_at_DataFeedFi · August 19, 2012, 5:59pm

Gentlemen,

I operate a quite large ES cluster, 3 front-end load-balanced http node
only ES, plus
16 data nodes. We have been developing and learning ES for about 1.5 years
now.

I just want to share our experiences and some similar pains we have
encountered like
the disconnects, sudden jvm crashes and even data corruptions.

I highly recommend tackling the problem by knowing more about your queries,
in our cases
we fix the problem by modifying our queries.

We used bigdesk plugin and examine the patterns and possible issues.

Example of scenario 1:
We had a query that loading to much field data cache, again we identified
this by watching bigdesk
and realized that all the data nodes field data cache were filling up very
fast and taking up 70 - 80% of our memory.
So that forced us to double check our codes and to fix the query. That
solved the problem.

Example of scenario 2:
Again another bad query caused CPU usage to peak at 90+% all the time.
Causing Garbage collection to be delayed,
or failing and eventually caused race condition and brought the node to its
knees.

My point simply, troubleshoot and fix your query or even restructure your
data better for more efficient search.
If that does not fix it, consider adding more nodes.

Regards,

--Andrew

On Wednesday, August 8, 2012 4:32:01 AM UTC-5, Alexander Sennhauser wrote:

Dear all,

We are seeing unexpected JVM crashes after an Elasticsearch cluster node
is restarted and reincorporate into the cluster. For load balancing and
robustness we have two Elasticsearch front-end servers which are balancers.
In the back-end there are three Elasticsearch worker nodes which hold all
the data. There are 5 shards and the number of replicas per shard is 2. We
notice that running the cluster continuously results in a memory / load /
query time build-up. To battle this we restart the nodes twice per day.

The restart scenario is the following:

Stop Elasticsearch on one of the worker nodes via
/etc/init.d/elasticsearch stop

Wait 120 seconds

Start Elasticsearch on the corresponding worker nodes via
/etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash after
the cluster state is updated. From the logs we see the following messages:

Elasticsearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
[[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN
(6).

In the JVM crash log file on the worker nodes we have seen two different
root causes after an incident:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J

org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can take
to circumvent these JVM crashes? If you need more information on our setup
please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

Alex

Our stack:

elasticsearch-0.19.8

Java Version:
$ /opt/java/jre1.7.0_04/bin/java -version
java version "1.7.0_04"
Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

OS Version:
$ uname -a
Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
x86_64 GNU/Linux

alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com <javascript:>
www.squirro.com

--

Joseph_Smith · August 10, 2013, 8:47pm

This thread is about a JVM crash that occurs when a node is entering the
cluster (while the cluster is idle). The platitude "add more hardware" is
as off-topic as it is unhelpful.

Every node in my cluster died because of OOM (which was expected). I was
able to bring all of the machines in the cluster back except for one. Note
that the cluster is completely idle (no queries, no indexing).

When I try to bring the last machine back, the JVM crashes about 10 seconds
after restarting. I get the JVM crash log as the OP
(org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;).

Any help regarding this crash would be appreciated.

Top of crash file:

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fbcccb40be4, pid=14593, tid=140429700744960

JRE version: 6.0_24-b24

Java VM: OpenJDK 64-Bit Server VM (20.0-b12 mixed mode linux-amd64 )

Derivative: IcedTea6 1.11.9

Distribution: CentOS release 6.4 (Final), package

rhel-1.57.1.11.9.el6_4-x86_64

Problematic frame:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

If you would like to submit a bug report, please include

instructions how to reproduce the bug and visit:

http://icedtea.classpath.org/bugzilla

~joe!

On Sunday, August 19, 2012 1:59:02 PM UTC-4, Andrew[.:at:.]DataFeedFile.com
wrote:

Gentlemen,

I operate a quite large ES cluster, 3 front-end load-balanced http node
only ES, plus
16 data nodes. We have been developing and learning ES for about 1.5 years
now.

I just want to share our experiences and some similar pains we have
encountered like
the disconnects, sudden jvm crashes and even data corruptions.

I highly recommend tackling the problem by knowing more about your
queries, in our cases
we fix the problem by modifying our queries.

We used bigdesk plugin and examine the patterns and possible issues.

Example of scenario 1:
We had a query that loading to much field data cache, again we identified
this by watching bigdesk
and realized that all the data nodes field data cache were filling up very
fast and taking up 70 - 80% of our memory.
So that forced us to double check our codes and to fix the query. That
solved the problem.

Example of scenario 2:
Again another bad query caused CPU usage to peak at 90+% all the time.
Causing Garbage collection to be delayed,
or failing and eventually caused race condition and brought the node to
its knees.

My point simply, troubleshoot and fix your query or even restructure your
data better for more efficient search.
If that does not fix it, consider adding more nodes.

Regards,

--Andrew

On Wednesday, August 8, 2012 4:32:01 AM UTC-5, Alexander Sennhauser wrote:

Dear all,

We are seeing unexpected JVM crashes after an Elasticsearch cluster node
is restarted and reincorporate into the cluster. For load balancing and
robustness we have two Elasticsearch front-end servers which are balancers.
In the back-end there are three Elasticsearch worker nodes which hold all
the data. There are 5 shards and the number of replicas per shard is 2. We
notice that running the cluster continuously results in a memory / load /
query time build-up. To battle this we restart the nodes twice per day.

The restart scenario is the following:

Stop Elasticsearch on one of the worker nodes via
/etc/init.d/elasticsearch stop

Wait 120 seconds

Start Elasticsearch on the corresponding worker nodes via
/etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash after
the cluster state is updated. From the logs we see the following messages:

Elasticsearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
[[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN
(6).

In the JVM crash log file on the worker nodes we have seen two different
root causes after an incident:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J

org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can
take to circumvent these JVM crashes? If you need more information on our
setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

Alex

Our stack:

elasticsearch-0.19.8

Java Version:
$ /opt/java/jre1.7.0_04/bin/java -version
java version "1.7.0_04"
Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

OS Version:
$ uname -a
Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
x86_64 GNU/Linux

alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com
www.squirro.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Joseph_Smith · August 10, 2013, 8:50pm

Sorry for the double post. I am also using the G1GC. Other threads hint
that using G1GC may be related to this issue.

On Saturday, August 10, 2013 4:47:47 PM UTC-4, Joseph Smith wrote:

This thread is about a JVM crash that occurs when a node is entering the
cluster (while the cluster is idle). The platitude "add more hardware" is
as off-topic as it is unhelpful.

Every node in my cluster died because of OOM (which was expected). I was
able to bring all of the machines in the cluster back except for one. Note
that the cluster is completely idle (no queries, no indexing).

When I try to bring the last machine back, the JVM crashes about 10
seconds after restarting. I get the JVM crash log as the OP
(org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;).

Any help regarding this crash would be appreciated.

Top of crash file:

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fbcccb40be4, pid=14593, tid=140429700744960

JRE version: 6.0_24-b24

Java VM: OpenJDK 64-Bit Server VM (20.0-b12 mixed mode linux-amd64 )

Derivative: IcedTea6 1.11.9

Distribution: CentOS release 6.4 (Final), package

rhel-1.57.1.11.9.el6_4-x86_64

Problematic frame:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

If you would like to submit a bug report, please include

instructions how to reproduce the bug and visit:

http://icedtea.classpath.org/bugzilla

~joe!

On Sunday, August 19, 2012 1:59:02 PM UTC-4,
Andrew[.:at:.]DataFeedFile.com wrote:

Gentlemen,

I operate a quite large ES cluster, 3 front-end load-balanced http node
only ES, plus
16 data nodes. We have been developing and learning ES for about 1.5
years now.

I just want to share our experiences and some similar pains we have
encountered like
the disconnects, sudden jvm crashes and even data corruptions.

I highly recommend tackling the problem by knowing more about your
queries, in our cases
we fix the problem by modifying our queries.

We used bigdesk plugin and examine the patterns and possible issues.

Example of scenario 1:
We had a query that loading to much field data cache, again we identified
this by watching bigdesk
and realized that all the data nodes field data cache were filling up
very fast and taking up 70 - 80% of our memory.
So that forced us to double check our codes and to fix the query. That
solved the problem.

Example of scenario 2:
Again another bad query caused CPU usage to peak at 90+% all the time.
Causing Garbage collection to be delayed,
or failing and eventually caused race condition and brought the node to
its knees.

My point simply, troubleshoot and fix your query or even restructure your
data better for more efficient search.
If that does not fix it, consider adding more nodes.

Regards,

--Andrew

On Wednesday, August 8, 2012 4:32:01 AM UTC-5, Alexander Sennhauser wrote:

Dear all,

We are seeing unexpected JVM crashes after an Elasticsearch cluster node
is restarted and reincorporate into the cluster. For load balancing and
robustness we have two Elasticsearch front-end servers which are balancers.
In the back-end there are three Elasticsearch worker nodes which hold all
the data. There are 5 shards and the number of replicas per shard is 2. We
notice that running the cluster continuously results in a memory / load /
query time build-up. To battle this we restart the nodes twice per day.

The restart scenario is the following:

Stop Elasticsearch on one of the worker nodes via
/etc/init.d/elasticsearch stop

Wait 120 seconds

Start Elasticsearch on the corresponding worker nodes via
/etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash after
the cluster state is updated. From the logs we see the following messages:

Elasticsearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
[[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN
(6).

In the JVM crash log file on the worker nodes we have seen two different
root causes after an incident:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J

org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can
take to circumvent these JVM crashes? If you need more information on our
setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

Alex

Our stack:

elasticsearch-0.19.8

Java Version:
$ /opt/java/jre1.7.0_04/bin/java -version
java version "1.7.0_04"
Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

OS Version:
$ uname -a
Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
x86_64 GNU/Linux

alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com
www.squirro.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Joseph_Smith · August 10, 2013, 9:02pm

After changing to -XX:+UseParNewGC -XX:+UseConcMarkSweepGC, the node
successfully joined the cluster without crashing. This is too bad since
G1GC is much more suited for the RAM allocations we are using.

On Saturday, August 10, 2013 4:50:56 PM UTC-4, Joseph Smith wrote:

Sorry for the double post. I am also using the G1GC. Other threads hint
that using G1GC may be related to this issue.

On Saturday, August 10, 2013 4:47:47 PM UTC-4, Joseph Smith wrote:

This thread is about a JVM crash that occurs when a node is entering the
cluster (while the cluster is idle). The platitude "add more hardware" is
as off-topic as it is unhelpful.

Every node in my cluster died because of OOM (which was expected). I was
able to bring all of the machines in the cluster back except for one. Note
that the cluster is completely idle (no queries, no indexing).

When I try to bring the last machine back, the JVM crashes about 10
seconds after restarting. I get the JVM crash log as the OP
(org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;).

Any help regarding this crash would be appreciated.

Top of crash file:

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fbcccb40be4, pid=14593, tid=140429700744960

JRE version: 6.0_24-b24

Java VM: OpenJDK 64-Bit Server VM (20.0-b12 mixed mode linux-amd64 )

Derivative: IcedTea6 1.11.9

Distribution: CentOS release 6.4 (Final), package

rhel-1.57.1.11.9.el6_4-x86_64

Problematic frame:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

If you would like to submit a bug report, please include

instructions how to reproduce the bug and visit:

http://icedtea.classpath.org/bugzilla

~joe!

On Sunday, August 19, 2012 1:59:02 PM UTC-4,
Andrew[.:at:.]DataFeedFile.com wrote:

Gentlemen,

I operate a quite large ES cluster, 3 front-end load-balanced http node
only ES, plus
16 data nodes. We have been developing and learning ES for about 1.5
years now.

I just want to share our experiences and some similar pains we have
encountered like
the disconnects, sudden jvm crashes and even data corruptions.

I highly recommend tackling the problem by knowing more about your
queries, in our cases
we fix the problem by modifying our queries.

We used bigdesk plugin and examine the patterns and possible issues.

Example of scenario 1:
We had a query that loading to much field data cache, again we
identified this by watching bigdesk
and realized that all the data nodes field data cache were filling up
very fast and taking up 70 - 80% of our memory.
So that forced us to double check our codes and to fix the query. That
solved the problem.

Example of scenario 2:
Again another bad query caused CPU usage to peak at 90+% all the time.
Causing Garbage collection to be delayed,
or failing and eventually caused race condition and brought the node to
its knees.

My point simply, troubleshoot and fix your query or even restructure
your data better for more efficient search.
If that does not fix it, consider adding more nodes.

Regards,

--Andrew

On Wednesday, August 8, 2012 4:32:01 AM UTC-5, Alexander Sennhauser
wrote:

Dear all,

We are seeing unexpected JVM crashes after an Elasticsearch cluster
node is restarted and reincorporate into the cluster. For load balancing
and robustness we have two Elasticsearch front-end servers which are
balancers. In the back-end there are three Elasticsearch worker nodes which
hold all the data. There are 5 shards and the number of replicas per shard
is 2. We notice that running the cluster continuously results in a memory /
load / query time build-up. To battle this we restart the nodes twice per
day.

The restart scenario is the following:

Stop Elasticsearch on one of the worker nodes via
/etc/init.d/elasticsearch stop

Wait 120 seconds

Start Elasticsearch on the corresponding worker nodes via
/etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash
after the cluster state is updated. From the logs we see the following
messages:

Elasticsearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
[[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN
(6).

In the JVM crash log file on the worker nodes we have seen two
different root causes after an incident:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J

org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can
take to circumvent these JVM crashes? If you need more information on our
setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

Alex

Our stack:

elasticsearch-0.19.8

Java Version:
$ /opt/java/jre1.7.0_04/bin/java -version
java version "1.7.0_04"
Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

OS Version:
$ uname -a
Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
x86_64 GNU/Linux

alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com
www.squirro.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · August 11, 2013, 11:57am

I can confirm that certain memory pressure situations of trove4j maps do
randomly crash with G1 GC. The trove4j maps are mostly used in the faceting
module of ES.

An approach to lower the risk of SIGSEGV could be increasing the heap - but
I'm not sure.

When I'm done with testing a replacement of ES trove4j with HPPC, there may
be also a more viable alternative: HPPC: High Performance Primitive Collections for Java

Jörg

On Sat, Aug 10, 2013 at 11:02 PM, Joseph Smith joseph.smith.tm@gmail.comwrote:

After changing to -XX:+UseParNewGC -XX:+UseConcMarkSweepGC, the node
successfully joined the cluster without crashing. This is too bad since
G1GC is much more suited for the RAM allocations we are using.

On Saturday, August 10, 2013 4:50:56 PM UTC-4, Joseph Smith wrote:

Sorry for the double post. I am also using the G1GC. Other threads hint
that using G1GC may be related to this issue.

On Saturday, August 10, 2013 4:47:47 PM UTC-4, Joseph Smith wrote:

This thread is about a JVM crash that occurs when a node is entering the
cluster (while the cluster is idle). The platitude "add more hardware" is
as off-topic as it is unhelpful.

Every node in my cluster died because of OOM (which was expected). I was
able to bring all of the machines in the cluster back except for one. Note
that the cluster is completely idle (no queries, no indexing).

When I try to bring the last machine back, the JVM crashes about 10
seconds after restarting. I get the JVM crash log as the OP
(org.elasticsearch.common.trove.impl.hash.TObjectHash.
insertKey(Ljava/lang/Object;).

Any help regarding this crash would be appreciated.

Top of crash file:

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fbcccb40be4, pid=14593, tid=140429700744960

JRE version: 6.0_24-b24

Java VM: OpenJDK 64-Bit Server VM (20.0-b12 mixed mode linux-amd64 )

Derivative: IcedTea6 1.11.9

Distribution: CentOS release 6.4 (Final), package

rhel-1.57.1.11.9.el6_4-x86_64

Problematic frame:

J org.elasticsearch.common.trove.impl.hash.TObjectHash.

insertKey(Ljava/lang/Object;)I

If you would like to submit a bug report, please include

instructions how to reproduce the bug and visit:

http://icedtea.classpath.org/**bugzilla http://icedtea.classpath.org/bugzilla

~joe!

On Sunday, August 19, 2012 1:59:02 PM UTC-4,
Andrew[.:at:.]DataFeedFile.com wrote:

Gentlemen,

I operate a quite large ES cluster, 3 front-end load-balanced http node
only ES, plus
16 data nodes. We have been developing and learning ES for about 1.5
years now.

I just want to share our experiences and some similar pains we have
encountered like
the disconnects, sudden jvm crashes and even data corruptions.

I highly recommend tackling the problem by knowing more about your
queries, in our cases
we fix the problem by modifying our queries.

We used bigdesk plugin and examine the patterns and possible issues.

Example of scenario 1:
We had a query that loading to much field data cache, again we
identified this by watching bigdesk
and realized that all the data nodes field data cache were filling up
very fast and taking up 70 - 80% of our memory.
So that forced us to double check our codes and to fix the query. That
solved the problem.

Example of scenario 2:
Again another bad query caused CPU usage to peak at 90+% all the time.
Causing Garbage collection to be delayed,
or failing and eventually caused race condition and brought the node to
its knees.

My point simply, troubleshoot and fix your query or even restructure
your data better for more efficient search.
If that does not fix it, consider adding more nodes.

Regards,

--Andrew

On Wednesday, August 8, 2012 4:32:01 AM UTC-5, Alexander Sennhauser
wrote:

Dear all,

We are seeing unexpected JVM crashes after an Elasticsearch cluster
node is restarted and reincorporate into the cluster. For load balancing
and robustness we have two Elasticsearch front-end servers which are
balancers. In the back-end there are three Elasticsearch worker nodes which
hold all the data. There are 5 shards and the number of replicas per shard
is 2. We notice that running the cluster continuously results in a memory /
load / query time build-up. To battle this we restart the nodes twice per
day.

The restart scenario is the following:

Stop Elasticsearch on one of the worker nodes via
/etc/init.d/elasticsearch stop

Wait 120 seconds

Start Elasticsearch on the corresponding worker nodes via
/etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash
after the cluster state is updated. From the logs we see the following
messages:

Elasticsearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.**service ]
[worker2] processing [zen-disco-receive(from master [[front1][D2gTiUeyRbS-
**Gj0rvpI4NQ][inet[/10.28.7.197:**9300]]{data=false, master=true}])]:
done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal
UNKNOWN (6).

In the JVM crash log file on the worker nodes we have seen two
different root causes after an incident:

J org.elasticsearch.common.trove.impl.hash.TObjectHash.

insertKey(Ljava/lang/Object;)I

or

J org.elasticsearch.common.io.stream.HandlesStreamOutput.

writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can
take to circumvent these JVM crashes? If you need more information on our
setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

Alex

Our stack:

elasticsearch-0.19.8

Java Version:
$ /opt/java/jre1.7.0_04/bin/java -version
java version "1.7.0_04"
Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

OS Version:
$ uname -a
Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
x86_64 GNU/Linux

alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com
www.squirro.com

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Joseph_Smith · August 11, 2013, 2:53pm

Very cool. I look forward to trying HPPC ES.

On Sunday, August 11, 2013 7:57:16 AM UTC-4, Jörg Prante wrote:

I can confirm that certain memory pressure situations of trove4j maps do
randomly crash with G1 GC. The trove4j maps are mostly used in the faceting
module of ES.

An approach to lower the risk of SIGSEGV could be increasing the heap -
but I'm not sure.

When I'm done with testing a replacement of ES trove4j with HPPC, there
may be also a more viable alternative:
HPPC: High Performance Primitive Collections for Java

Jörg

On Sat, Aug 10, 2013 at 11:02 PM, Joseph Smith <joseph....@gmail.com<javascript:>

wrote:

After changing to -XX:+UseParNewGC -XX:+UseConcMarkSweepGC, the node
successfully joined the cluster without crashing. This is too bad since
G1GC is much more suited for the RAM allocations we are using.

On Saturday, August 10, 2013 4:50:56 PM UTC-4, Joseph Smith wrote:

Sorry for the double post. I am also using the G1GC. Other threads hint
that using G1GC may be related to this issue.

On Saturday, August 10, 2013 4:47:47 PM UTC-4, Joseph Smith wrote:

This thread is about a JVM crash that occurs when a node is entering
the cluster (while the cluster is idle). The platitude "add more hardware"
is as off-topic as it is unhelpful.

Every node in my cluster died because of OOM (which was expected). I
was able to bring all of the machines in the cluster back except for one.
Note that the cluster is completely idle (no queries, no indexing).

When I try to bring the last machine back, the JVM crashes about 10
seconds after restarting. I get the JVM crash log as the OP
(org.elasticsearch.common.trove.impl.hash.TObjectHash.
insertKey(Ljava/lang/Object;).

Any help regarding this crash would be appreciated.

Top of crash file:

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fbcccb40be4, pid=14593,

tid=140429700744960

JRE version: 6.0_24-b24

Java VM: OpenJDK 64-Bit Server VM (20.0-b12 mixed mode linux-amd64 )

Derivative: IcedTea6 1.11.9

Distribution: CentOS release 6.4 (Final), package

rhel-1.57.1.11.9.el6_4-x86_64

Problematic frame:

J org.elasticsearch.common.trove.impl.hash.TObjectHash.

insertKey(Ljava/lang/Object;)I

If you would like to submit a bug report, please include

instructions how to reproduce the bug and visit:

http://icedtea.classpath.org/**bugzilla http://icedtea.classpath.org/bugzilla

~joe!

On Sunday, August 19, 2012 1:59:02 PM UTC-4,
Andrew[.:at:.]DataFeedFile.com wrote:

Gentlemen,

I operate a quite large ES cluster, 3 front-end load-balanced http
node only ES, plus
16 data nodes. We have been developing and learning ES for about 1.5
years now.

I just want to share our experiences and some similar pains we have
encountered like
the disconnects, sudden jvm crashes and even data corruptions.

I highly recommend tackling the problem by knowing more about your
queries, in our cases
we fix the problem by modifying our queries.

We used bigdesk plugin and examine the patterns and possible issues.

Example of scenario 1:
We had a query that loading to much field data cache, again we
identified this by watching bigdesk
and realized that all the data nodes field data cache were filling up
very fast and taking up 70 - 80% of our memory.
So that forced us to double check our codes and to fix the query. That
solved the problem.

Example of scenario 2:
Again another bad query caused CPU usage to peak at 90+% all the time.
Causing Garbage collection to be delayed,
or failing and eventually caused race condition and brought the node
to its knees.

My point simply, troubleshoot and fix your query or even restructure
your data better for more efficient search.
If that does not fix it, consider adding more nodes.

Regards,

--Andrew

On Wednesday, August 8, 2012 4:32:01 AM UTC-5, Alexander Sennhauser
wrote:

Dear all,

We are seeing unexpected JVM crashes after an Elasticsearch cluster
node is restarted and reincorporate into the cluster. For load balancing
and robustness we have two Elasticsearch front-end servers which are
balancers. In the back-end there are three Elasticsearch worker nodes which
hold all the data. There are 5 shards and the number of replicas per shard
is 2. We notice that running the cluster continuously results in a memory /
load / query time build-up. To battle this we restart the nodes twice per
day.

The restart scenario is the following:

Stop Elasticsearch on one of the worker nodes via
/etc/init.d/elasticsearch stop

Wait 120 seconds

Start Elasticsearch on the corresponding worker nodes via
/etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash
after the cluster state is updated. From the logs we see the following
messages:

Elasticsearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.**service ]
[worker2] processing [zen-disco-receive(from master [[front1][D2gTiUeyRbS-
**Gj0rvpI4NQ][inet[/10.28.7.197:**9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal
UNKNOWN (6).

In the JVM crash log file on the worker nodes we have seen two
different root causes after an incident:

J org.elasticsearch.common.trove.impl.hash.TObjectHash.

insertKey(Ljava/lang/Object;)I

or

J org.elasticsearch.common.io.stream.HandlesStreamOutput.

writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can
take to circumvent these JVM crashes? If you need more information on our
setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

Alex

Our stack:

elasticsearch-0.19.8

Java Version:
$ /opt/java/jre1.7.0_04/bin/java -version
java version "1.7.0_04"
Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

OS Version:
$ uname -a
Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
x86_64 GNU/Linux

alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com
www.squirro.com

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · October 4, 2013, 6:15pm

FYI thanks to Martijn van Groningen, Elasticsearch master migrated from
trove4j to HPPC

which is amazing news for those of us who like to give G1GC a try!

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · October 4, 2013, 9:44pm

Cannot believe I am hijacking the same thread twice but ...

I use Trove in my non-elasticsearch code. Is the transition to HPCC
worthwhile if I have no issues?

I wonder if Martijn has done any extensive benchmarking. I assume he has. I
have been profiling my cluster's memory settings lately, and I have been
tempted to switch to G1GC .

Cheers,

Ivan

On Fri, Oct 4, 2013 at 11:15 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

FYI thanks to Martijn van Groningen, Elasticsearch master migrated from
trove4j to HPPC

Migrate from Trove to Hppc. · elastic/elasticsearch@088e05b · GitHub

which is amazing news for those of us who like to give G1GC a try!

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · October 5, 2013, 10:02am

Just a bit of background, HPPC was started by Dawid Weiss of Carrotsearch a
while ago, who is also connected to the Lucene project (for example with
the finite automata implementation). The task of HPPC was picking up a
subset of the Colt primitive collections for Mahout
http://mahout.apache.org/

HPPC uses Murmur hashing and direct access to the underlying data
structures, so for mutable data structures, HPPC is very fast and
convenient.

Dawid Weiss put up a benchmark fwiw
http://mail-archives.apache.org/mod_mbox/mahout-dev/201103.mbox/<AANLkTi=gbd-GFxpQu_86haq16-gikTrwwZ8x4Tp250jH@mail.gmail.com>

And the HPPC license is Apache, same as ES license. So beside performance
aspects, ES will reuse a component of the Lucene software stack which is
very good.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

simonw_2 · October 6, 2013, 4:42pm

huge +1 to this comment Joerg! I couldn't have phrased it better!

simon

On Saturday, October 5, 2013 12:02:39 PM UTC+2, Jörg Prante wrote:

Just a bit of background, HPPC was started by Dawid Weiss of Carrotsearch
a while ago, who is also connected to the Lucene project (for example with
the finite automata implementation). The task of HPPC was picking up a
subset of the Colt primitive collections for Mahout
http://mahout.apache.org/

HPPC uses Murmur hashing and direct access to the underlying data
structures, so for mutable data structures, HPPC is very fast and
convenient.

Dawid Weiss put up a benchmark fwiw
http://mail-archives.apache.org/mod_mbox/mahout-dev/201103.mbox/<AANLkTi=gbd-GFxpQu_86haq16-gikTrwwZ8x4Tp250jH@mail.gmail.com>

And the HPPC license is Apache, same as ES license. So beside performance
aspects, ES will reuse a component of the Lucene software stack which is
very good.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · October 8, 2013, 5:11pm

Yup, thanks for the comment. I did some researching before you replied and
found the same benchmarks, including these in an interactive format:

http://dawidweiss123.appspot.com/run/dawid.weiss@gmail.com/com.carrotsearch.hppc.caliper.BenchmarkPut
http://dawidweiss123.appspot.com/run/dawid.weiss@gmail.com/com.carrotsearch.hppc.caliper.BenchmarkGetWithRemoved
http://dawidweiss123.appspot.com/run/dawid.weiss@gmail.com/com.carrotsearch.hppc.caliper.BenchmarkBigramCounting

In the past several years, I must have used every collection library
available. In terms of ES, it would be great to finally test with G1GC.
For my code, don't know if I have time to test another lib!

Cheers,

Ivan

On Sun, Oct 6, 2013 at 9:42 AM, simonw simon.willnauer@elasticsearch.comwrote:

huge +1 to this comment Joerg! I couldn't have phrased it better!

simon

On Saturday, October 5, 2013 12:02:39 PM UTC+2, Jörg Prante wrote:

Just a bit of background, HPPC was started by Dawid Weiss of Carrotsearch
a while ago, who is also connected to the Lucene project (for example with
the finite automata implementation). The task of HPPC was picking up a
subset of the Colt primitive collections for Mahout
http://mahout.apache.org/

HPPC uses Murmur hashing and direct access to the underlying data
structures, so for mutable data structures, HPPC is very fast and
convenient.

Dawid Weiss put up a benchmark fwiw http://mail-archives.**
apache.org/mod_mbox/mahout-dev/201103.mbox/%3CAANLkTi=
gbd-GFxpQu_86haq16-**gikTrwwZ8x4Tp250jH@mail.gmail.**com%3Ehttp://mail-archives.apache.org/mod_mbox/mahout-dev/201103.mbox/<AANLkTi=gbd-GFxpQu_86haq16-gikTrwwZ8x4Tp250jH@mail.gmail.com>

And the HPPC license is Apache, same as ES license. So beside performance
aspects, ES will reuse a component of the Lucene software stack which is
very good.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
6.5.0 JVM Crash Elasticsearch	10	1501	February 18, 2019
Jvm.dll crashes Elasticsearch	1	2006	July 6, 2017
ElasticSearch crashes in single node cluster- Issue #1 Elasticsearch	20	2878	June 12, 2019
Persistent high JVM usage issue (all methods tried already) Elasticsearch	1	426	May 26, 2017
JVM crash under load Elasticsearch	8	918	July 6, 2017

JVM crashes after reincorporating nodes into cluster

J

J

J org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

J org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

J

J

J

J

J

J

J

J

J

J

J

J

J

J

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fbcccb40be4, pid=14593, tid=140429700744960

JRE version: 6.0_24-b24

Java VM: OpenJDK 64-Bit Server VM (20.0-b12 mixed mode linux-amd64 )

Derivative: IcedTea6 1.11.9

Distribution: CentOS release 6.4 (Final), package

Problematic frame:

J

If you would like to submit a bug report, please include

instructions how to reproduce the bug and visit:

http://icedtea.classpath.org/bugzilla

J

J

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fbcccb40be4, pid=14593, tid=140429700744960

JRE version: 6.0_24-b24

Java VM: OpenJDK 64-Bit Server VM (20.0-b12 mixed mode linux-amd64 )

Derivative: IcedTea6 1.11.9

Distribution: CentOS release 6.4 (Final), package

Problematic frame:

J

If you would like to submit a bug report, please include

instructions how to reproduce the bug and visit:

http://icedtea.classpath.org/bugzilla

J

J

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fbcccb40be4, pid=14593, tid=140429700744960

JRE version: 6.0_24-b24

Java VM: OpenJDK 64-Bit Server VM (20.0-b12 mixed mode linux-amd64 )

Derivative: IcedTea6 1.11.9

Distribution: CentOS release 6.4 (Final), package

Problematic frame:

J

If you would like to submit a bug report, please include

instructions how to reproduce the bug and visit:

http://icedtea.classpath.org/bugzilla

J

J

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fbcccb40be4, pid=14593, tid=140429700744960

JRE version: 6.0_24-b24

Java VM: OpenJDK 64-Bit Server VM (20.0-b12 mixed mode linux-amd64 )

Derivative: IcedTea6 1.11.9

Distribution: CentOS release 6.4 (Final), package

Problematic frame:

J org.elasticsearch.common.trove.impl.hash.TObjectHash.

If you would like to submit a bug report, please include

instructions how to reproduce the bug and visit:

http://icedtea.classpath.org/**bugzillahttp://icedtea.classpath.org/bugzilla

J org.elasticsearch.common.trove.impl.hash.TObjectHash.

J org.elasticsearch.common.io.stream.HandlesStreamOutput.

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fbcccb40be4, pid=14593,

JRE version: 6.0_24-b24

Java VM: OpenJDK 64-Bit Server VM (20.0-b12 mixed mode linux-amd64 )

Derivative: IcedTea6 1.11.9

Distribution: CentOS release 6.4 (Final), package

Problematic frame:

J org.elasticsearch.common.trove.impl.hash.TObjectHash.

If you would like to submit a bug report, please include

http://icedtea.classpath.org/**bugzilla http://icedtea.classpath.org/bugzilla

http://icedtea.classpath.org/**bugzilla http://icedtea.classpath.org/bugzilla