JVM crashes after reincorporating nodes into cluster


(Alex-5) #1

Dear all,

We are seeing unexpected JVM crashes after an ElasticSearch cluster node is
restarted and reincorporate into the cluster. For load balancing and
robustness we have two ElasticSearch front-end servers which are balancers.
In the back-end there are three ElasticSearch worker nodes which hold all
the data. There are 5 shards and the number of replicas per shard is 2. We
notice that running the cluster continuously results in a memory / load /
query time build-up. To battle this we restart the nodes twice per day.

The restart scenario is the following:

  1. Stop ElasticSearch on one of the worker nodes via
    /etc/init.d/elasticsearch stop
  2. Wait 120 seconds
  3. Start ElasticSearch on the corresponding worker nodes via
    /etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash after
the cluster state is updated. From the logs we see the following messages:

ElasticSearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
[[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN (6).

In the JVM crash log file on the worker nodes we have seen two different
root causes after an incident:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J

org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can take
to circumvent these JVM crashes? If you need more information on our setup
please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

  • Alex

Our stack:

  • elasticsearch-0.19.8

  • Java Version:
    $ /opt/java/jre1.7.0_04/bin/java -version
    java version "1.7.0_04"
    Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
    Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

  • OS Version:
    $ uname -a
    Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
    x86_64 GNU/Linux


alexander sennhauser
software engineer

+41 79 420 4953 mobile
lex@squirro.com
www.squirro.com


(Shay Banon) #2

Can you gist the crash file output of the JVM? Can you make sure that ES is using the relevant Java version and not a different one (using the nodes info API with the jvm flag).

On Aug 8, 2012, at 11:32 AM, Alex lex@squirro.com wrote:

Dear all,

We are seeing unexpected JVM crashes after an ElasticSearch cluster node is restarted and reincorporate into the cluster. For load balancing and robustness we have two ElasticSearch front-end servers which are balancers. In the back-end there are three ElasticSearch worker nodes which hold all the data. There are 5 shards and the number of replicas per shard is 2. We notice that running the cluster continuously results in a memory / load / query time build-up. To battle this we restart the nodes twice per day.

The restart scenario is the following:

  1. Stop ElasticSearch on one of the worker nodes via /etc/init.d/elasticsearch stop
  2. Wait 120 seconds
  3. Start ElasticSearch on the corresponding worker nodes via /etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash after the cluster state is updated. From the logs we see the following messages:

ElasticSearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2] processing [zen-disco-receive(from master [[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false, master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN (6).

In the JVM crash log file on the worker nodes we have seen two different root causes after an incident:

J org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can take to circumvent these JVM crashes? If you need more information on our setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

  • Alex

Our stack:

  • elasticsearch-0.19.8

  • Java Version:
    $ /opt/java/jre1.7.0_04/bin/java -version
    java version "1.7.0_04"
    Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
    Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

  • OS Version:
    $ uname -a
    Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012 x86_64 GNU/Linux


alexander sennhauser
software engineer

+41 79 420 4953 mobile
lex@squirro.com
www.squirro.com


(Alex-5) #3

Thanks for the quick reply.

The JVM crash files can be found at https://gist.github.com/3294425.

The reported JVM version is being used by ElasticSearch.

Best Regards

  • Alex

On Wednesday, August 8, 2012 11:53:39 AM UTC+2, kimchy wrote:

Can you gist the crash file output of the JVM? Can you make sure that ES
is using the relevant Java version and not a different one (using the nodes
info API with the jvm flag).

On Aug 8, 2012, at 11:32 AM, Alex <l...@squirro.com <javascript:>> wrote:

Dear all,

We are seeing unexpected JVM crashes after an ElasticSearch cluster node
is restarted and reincorporate into the cluster. For load balancing and
robustness we have two ElasticSearch front-end servers which are balancers.
In the back-end there are three ElasticSearch worker nodes which hold all
the data. There are 5 shards and the number of replicas per shard is 2. We
notice that running the cluster continuously results in a memory / load /
query time build-up. To battle this we restart the nodes twice per day.

The restart scenario is the following:

  1. Stop ElasticSearch on one of the worker nodes via
    /etc/init.d/elasticsearch stop
  2. Wait 120 seconds
  3. Start ElasticSearch on the corresponding worker nodes via
    /etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash after
the cluster state is updated. From the logs we see the following messages:

ElasticSearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
[[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN
(6).

In the JVM crash log file on the worker nodes we have seen two different
root causes after an incident:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J

org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can
take to circumvent these JVM crashes? If you need more information on our
setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

  • Alex

Our stack:

  • elasticsearch-0.19.8

  • Java Version:
    $ /opt/java/jre1.7.0_04/bin/java -version
    java version "1.7.0_04"
    Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
    Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

  • OS Version:
    $ uname -a
    Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
    x86_64 GNU/Linux


alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com <javascript:>
www.squirro.com


(Alex-5) #4

Dear all,

Are there any insights on this incident that you can share with us? We are
still seeing repeated failures when a node is re-incorporated.

On some occasions we also see that the two front-end servers get out of
sync. One usually looses connectivity to all other nodes and as a
consequence it's CPU usage goes through the roof.

Best Regards

  • Alex

On Wednesday, August 8, 2012 1:40:24 PM UTC+2, Alexander Sennhauser wrote:

Thanks for the quick reply.

The JVM crash files can be found at https://gist.github.com/3294425.

The reported JVM version is being used by ElasticSearch.

Best Regards

  • Alex

On Wednesday, August 8, 2012 11:53:39 AM UTC+2, kimchy wrote:

Can you gist the crash file output of the JVM? Can you make sure that ES
is using the relevant Java version and not a different one (using the nodes
info API with the jvm flag).

On Aug 8, 2012, at 11:32 AM, Alex l...@squirro.com wrote:

Dear all,

We are seeing unexpected JVM crashes after an ElasticSearch cluster
node is restarted and reincorporate into the cluster. For load balancing
and robustness we have two ElasticSearch front-end servers which are
balancers. In the back-end there are three ElasticSearch worker nodes which
hold all the data. There are 5 shards and the number of replicas per shard
is 2. We notice that running the cluster continuously results in a memory /
load / query time build-up. To battle this we restart the nodes twice per
day.

The restart scenario is the following:

  1. Stop ElasticSearch on one of the worker nodes via
    /etc/init.d/elasticsearch stop
  2. Wait 120 seconds
  3. Start ElasticSearch on the corresponding worker nodes via
    /etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash
after the cluster state is updated. From the logs we see the following
messages:

ElasticSearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
[[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN
(6).

In the JVM crash log file on the worker nodes we have seen two
different root causes after an incident:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J

org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can
take to circumvent these JVM crashes? If you need more information on our
setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

  • Alex

Our stack:

  • elasticsearch-0.19.8

  • Java Version:
    $ /opt/java/jre1.7.0_04/bin/java -version
    java version "1.7.0_04"
    Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
    Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

  • OS Version:
    $ uname -a
    Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
    x86_64 GNU/Linux


alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com
www.squirro.com

--


(Ivan Brusic) #5

I am experiencing the same issue.

Cluster consisted of four 0.19.2 nodes. Was performing a rolling
upgrade to 0.19.8 one at a time. During each restart, the cluster
would freeze and a head dump would appear on the node that was just
restarted. Removing the hprof fixed the issue. As of now, 3 nodes have
been updated, 1 has not. Still in red state after 3 restart.

hprof file is 6.7G

On Thu, Aug 16, 2012 at 11:56 AM, Alexander Sennhauser lex@squirro.com wrote:

Dear all,

Are there any insights on this incident that you can share with us? We are
still seeing repeated failures when a node is re-incorporated.

On some occasions we also see that the two front-end servers get out of
sync. One usually looses connectivity to all other nodes and as a
consequence it's CPU usage goes through the roof.

Best Regards

  • Alex

On Wednesday, August 8, 2012 1:40:24 PM UTC+2, Alexander Sennhauser wrote:

Thanks for the quick reply.

The JVM crash files can be found at https://gist.github.com/3294425.

The reported JVM version is being used by ElasticSearch.

Best Regards

  • Alex

On Wednesday, August 8, 2012 11:53:39 AM UTC+2, kimchy wrote:

Can you gist the crash file output of the JVM? Can you make sure that ES
is using the relevant Java version and not a different one (using the nodes
info API with the jvm flag).

On Aug 8, 2012, at 11:32 AM, Alex l...@squirro.com wrote:

Dear all,

We are seeing unexpected JVM crashes after an ElasticSearch cluster
node is restarted and reincorporate into the cluster. For load balancing and
robustness we have two ElasticSearch front-end servers which are balancers.
In the back-end there are three ElasticSearch worker nodes which hold all
the data. There are 5 shards and the number of replicas per shard is 2. We
notice that running the cluster continuously results in a memory / load /
query time build-up. To battle this we restart the nodes twice per day.

The restart scenario is the following:

  1. Stop ElasticSearch on one of the worker nodes via
    /etc/init.d/elasticsearch stop
  2. Wait 120 seconds
  3. Start ElasticSearch on the corresponding worker nodes via
    /etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash
after the cluster state is updated. From the logs we see the following
messages:

ElasticSearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
[[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN
(6).

In the JVM crash log file on the worker nodes we have seen two
different root causes after an incident:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J

org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can
take to circumvent these JVM crashes? If you need more information on our
setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

  • Alex

Our stack:

  • elasticsearch-0.19.8

  • Java Version:
    $ /opt/java/jre1.7.0_04/bin/java -version
    java version "1.7.0_04"
    Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
    Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

  • OS Version:
    $ uname -a
    Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
    x86_64 GNU/Linux


alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com
www.squirro.com

--

--


(Ivan Brusic) #6

Forgot to include

$ java -version
java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)

$ cat /etc/redhat-release
CentOS release 6.2 (Final)

$ uname -a
Linux srch-dv105 2.6.32-220.13.1.el6.x86_64 #1 SMP Tue Apr 17 23:56:34
BST 2012 x86_64 x86_64 x86_64 GNU/Linux

Using the service wrapper. Part of the upgrade process included
upgrading the wrapper as well.

Cheers,

Ivan

On Thu, Aug 16, 2012 at 4:59 PM, Ivan Brusic ivan@brusic.com wrote:

I am experiencing the same issue.

Cluster consisted of four 0.19.2 nodes. Was performing a rolling
upgrade to 0.19.8 one at a time. During each restart, the cluster
would freeze and a head dump would appear on the node that was just
restarted. Removing the hprof fixed the issue. As of now, 3 nodes have
been updated, 1 has not. Still in red state after 3 restart.

hprof file is 6.7G

On Thu, Aug 16, 2012 at 11:56 AM, Alexander Sennhauser lex@squirro.com wrote:

Dear all,

Are there any insights on this incident that you can share with us? We are
still seeing repeated failures when a node is re-incorporated.

On some occasions we also see that the two front-end servers get out of
sync. One usually looses connectivity to all other nodes and as a
consequence it's CPU usage goes through the roof.

Best Regards

  • Alex

On Wednesday, August 8, 2012 1:40:24 PM UTC+2, Alexander Sennhauser wrote:

Thanks for the quick reply.

The JVM crash files can be found at https://gist.github.com/3294425.

The reported JVM version is being used by ElasticSearch.

Best Regards

  • Alex

On Wednesday, August 8, 2012 11:53:39 AM UTC+2, kimchy wrote:

Can you gist the crash file output of the JVM? Can you make sure that ES
is using the relevant Java version and not a different one (using the nodes
info API with the jvm flag).

On Aug 8, 2012, at 11:32 AM, Alex l...@squirro.com wrote:

Dear all,

We are seeing unexpected JVM crashes after an ElasticSearch cluster
node is restarted and reincorporate into the cluster. For load balancing and
robustness we have two ElasticSearch front-end servers which are balancers.
In the back-end there are three ElasticSearch worker nodes which hold all
the data. There are 5 shards and the number of replicas per shard is 2. We
notice that running the cluster continuously results in a memory / load /
query time build-up. To battle this we restart the nodes twice per day.

The restart scenario is the following:

  1. Stop ElasticSearch on one of the worker nodes via
    /etc/init.d/elasticsearch stop
  2. Wait 120 seconds
  3. Start ElasticSearch on the corresponding worker nodes via
    /etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash
after the cluster state is updated. From the logs we see the following
messages:

ElasticSearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
[[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN
(6).

In the JVM crash log file on the worker nodes we have seen two
different root causes after an incident:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J

org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can
take to circumvent these JVM crashes? If you need more information on our
setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

  • Alex

Our stack:

  • elasticsearch-0.19.8

  • Java Version:
    $ /opt/java/jre1.7.0_04/bin/java -version
    java version "1.7.0_04"
    Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
    Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

  • OS Version:
    $ uname -a
    Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
    x86_64 GNU/Linux


alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com
www.squirro.com

--

--


(Ivan Brusic) #7

Several restarts later, I am no longer experiencing JVM crashes or
stalled clusters.

The only change was updating the elasticsearch.conf file to the latest
version. I was previously using the new wrapper with an old config
file (mainly GC changes, ES jar was not listed first, and the
Bootstrap class was used directly). Only modifications are heap size
and es.path.conf.

One node is still running 0.19.2. I will replicate the upgrade
process, but I prefer to wait to see if there is some information I
should capture before upgrading.

Only other tidbit of information is that I try to set
cluster.routing.allocation.disable_allocation to true when restarting
a node, although I have not been consistent.

Ivan

On Thu, Aug 16, 2012 at 5:03 PM, Ivan Brusic ivan@brusic.com wrote:

Forgot to include

$ java -version
java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)

$ cat /etc/redhat-release
CentOS release 6.2 (Final)

$ uname -a
Linux srch-dv105 2.6.32-220.13.1.el6.x86_64 #1 SMP Tue Apr 17 23:56:34
BST 2012 x86_64 x86_64 x86_64 GNU/Linux

Using the service wrapper. Part of the upgrade process included
upgrading the wrapper as well.

Cheers,

Ivan

On Thu, Aug 16, 2012 at 4:59 PM, Ivan Brusic ivan@brusic.com wrote:

I am experiencing the same issue.

Cluster consisted of four 0.19.2 nodes. Was performing a rolling
upgrade to 0.19.8 one at a time. During each restart, the cluster
would freeze and a head dump would appear on the node that was just
restarted. Removing the hprof fixed the issue. As of now, 3 nodes have
been updated, 1 has not. Still in red state after 3 restart.

hprof file is 6.7G

On Thu, Aug 16, 2012 at 11:56 AM, Alexander Sennhauser lex@squirro.com wrote:

Dear all,

Are there any insights on this incident that you can share with us? We are
still seeing repeated failures when a node is re-incorporated.

On some occasions we also see that the two front-end servers get out of
sync. One usually looses connectivity to all other nodes and as a
consequence it's CPU usage goes through the roof.

Best Regards

  • Alex

On Wednesday, August 8, 2012 1:40:24 PM UTC+2, Alexander Sennhauser wrote:

Thanks for the quick reply.

The JVM crash files can be found at https://gist.github.com/3294425.

The reported JVM version is being used by ElasticSearch.

Best Regards

  • Alex

On Wednesday, August 8, 2012 11:53:39 AM UTC+2, kimchy wrote:

Can you gist the crash file output of the JVM? Can you make sure that ES
is using the relevant Java version and not a different one (using the nodes
info API with the jvm flag).

On Aug 8, 2012, at 11:32 AM, Alex l...@squirro.com wrote:

Dear all,

We are seeing unexpected JVM crashes after an ElasticSearch cluster
node is restarted and reincorporate into the cluster. For load balancing and
robustness we have two ElasticSearch front-end servers which are balancers.
In the back-end there are three ElasticSearch worker nodes which hold all
the data. There are 5 shards and the number of replicas per shard is 2. We
notice that running the cluster continuously results in a memory / load /
query time build-up. To battle this we restart the nodes twice per day.

The restart scenario is the following:

  1. Stop ElasticSearch on one of the worker nodes via
    /etc/init.d/elasticsearch stop
  2. Wait 120 seconds
  3. Start ElasticSearch on the corresponding worker nodes via
    /etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash
after the cluster state is updated. From the logs we see the following
messages:

ElasticSearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
[[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN
(6).

In the JVM crash log file on the worker nodes we have seen two
different root causes after an incident:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J

org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can
take to circumvent these JVM crashes? If you need more information on our
setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

  • Alex

Our stack:

  • elasticsearch-0.19.8

  • Java Version:
    $ /opt/java/jre1.7.0_04/bin/java -version
    java version "1.7.0_04"
    Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
    Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

  • OS Version:
    $ uname -a
    Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
    x86_64 GNU/Linux


alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com
www.squirro.com

--

--


(Ivan Brusic) #8

Sorry for thread jacking Alex, but I am curious if my situation is
identical to yours.

Despite not having done a single query after restarting the nodes, 2
nodes have their memory (10G) maxed out with lots of CPU activity.
Transport size is in constant action (2000, not sure what the
dimension is).

Nothing is occurring on this cluster. No indexing, no querying, nothing.

--
Ivan

On Fri, Aug 17, 2012 at 1:16 PM, Ivan Brusic ivan@brusic.com wrote:

Several restarts later, I am no longer experiencing JVM crashes or
stalled clusters.

The only change was updating the elasticsearch.conf file to the latest
version. I was previously using the new wrapper with an old config
file (mainly GC changes, ES jar was not listed first, and the
Bootstrap class was used directly). Only modifications are heap size
and es.path.conf.

One node is still running 0.19.2. I will replicate the upgrade
process, but I prefer to wait to see if there is some information I
should capture before upgrading.

Only other tidbit of information is that I try to set
cluster.routing.allocation.disable_allocation to true when restarting
a node, although I have not been consistent.

Ivan

On Thu, Aug 16, 2012 at 5:03 PM, Ivan Brusic ivan@brusic.com wrote:

Forgot to include

$ java -version
java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)

$ cat /etc/redhat-release
CentOS release 6.2 (Final)

$ uname -a
Linux srch-dv105 2.6.32-220.13.1.el6.x86_64 #1 SMP Tue Apr 17 23:56:34
BST 2012 x86_64 x86_64 x86_64 GNU/Linux

Using the service wrapper. Part of the upgrade process included
upgrading the wrapper as well.

Cheers,

Ivan

On Thu, Aug 16, 2012 at 4:59 PM, Ivan Brusic ivan@brusic.com wrote:

I am experiencing the same issue.

Cluster consisted of four 0.19.2 nodes. Was performing a rolling
upgrade to 0.19.8 one at a time. During each restart, the cluster
would freeze and a head dump would appear on the node that was just
restarted. Removing the hprof fixed the issue. As of now, 3 nodes have
been updated, 1 has not. Still in red state after 3 restart.

hprof file is 6.7G

On Thu, Aug 16, 2012 at 11:56 AM, Alexander Sennhauser lex@squirro.com wrote:

Dear all,

Are there any insights on this incident that you can share with us? We are
still seeing repeated failures when a node is re-incorporated.

On some occasions we also see that the two front-end servers get out of
sync. One usually looses connectivity to all other nodes and as a
consequence it's CPU usage goes through the roof.

Best Regards

  • Alex

On Wednesday, August 8, 2012 1:40:24 PM UTC+2, Alexander Sennhauser wrote:

Thanks for the quick reply.

The JVM crash files can be found at https://gist.github.com/3294425.

The reported JVM version is being used by ElasticSearch.

Best Regards

  • Alex

On Wednesday, August 8, 2012 11:53:39 AM UTC+2, kimchy wrote:

Can you gist the crash file output of the JVM? Can you make sure that ES
is using the relevant Java version and not a different one (using the nodes
info API with the jvm flag).

On Aug 8, 2012, at 11:32 AM, Alex l...@squirro.com wrote:

Dear all,

We are seeing unexpected JVM crashes after an ElasticSearch cluster
node is restarted and reincorporate into the cluster. For load balancing and
robustness we have two ElasticSearch front-end servers which are balancers.
In the back-end there are three ElasticSearch worker nodes which hold all
the data. There are 5 shards and the number of replicas per shard is 2. We
notice that running the cluster continuously results in a memory / load /
query time build-up. To battle this we restart the nodes twice per day.

The restart scenario is the following:

  1. Stop ElasticSearch on one of the worker nodes via
    /etc/init.d/elasticsearch stop
  2. Wait 120 seconds
  3. Start ElasticSearch on the corresponding worker nodes via
    /etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash
after the cluster state is updated. From the logs we see the following
messages:

ElasticSearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
[[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN
(6).

In the JVM crash log file on the worker nodes we have seen two
different root causes after an incident:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J

org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can
take to circumvent these JVM crashes? If you need more information on our
setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

  • Alex

Our stack:

  • elasticsearch-0.19.8

  • Java Version:
    $ /opt/java/jre1.7.0_04/bin/java -version
    java version "1.7.0_04"
    Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
    Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

  • OS Version:
    $ uname -a
    Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
    x86_64 GNU/Linux


alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com
www.squirro.com

--

--


(Andrew[.:at:.]DataFeedFile.com) #9

Gentlemen,

I operate a quite large ES cluster, 3 front-end load-balanced http node
only ES, plus
16 data nodes. We have been developing and learning ES for about 1.5 years
now.

I just want to share our experiences and some similar pains we have
encountered like
the disconnects, sudden jvm crashes and even data corruptions.

I highly recommend tackling the problem by knowing more about your queries,
in our cases
we fix the problem by modifying our queries.

We used bigdesk plugin and examine the patterns and possible issues.

Example of scenario 1:
We had a query that loading to much field data cache, again we identified
this by watching bigdesk
and realized that all the data nodes field data cache were filling up very
fast and taking up 70 - 80% of our memory.
So that forced us to double check our codes and to fix the query. That
solved the problem.

Example of scenario 2:
Again another bad query caused CPU usage to peak at 90+% all the time.
Causing Garbage collection to be delayed,
or failing and eventually caused race condition and brought the node to its
knees.

My point simply, troubleshoot and fix your query or even restructure your
data better for more efficient search.
If that does not fix it, consider adding more nodes.

Regards,

--Andrew

On Wednesday, August 8, 2012 4:32:01 AM UTC-5, Alexander Sennhauser wrote:

Dear all,

We are seeing unexpected JVM crashes after an ElasticSearch cluster node
is restarted and reincorporate into the cluster. For load balancing and
robustness we have two ElasticSearch front-end servers which are balancers.
In the back-end there are three ElasticSearch worker nodes which hold all
the data. There are 5 shards and the number of replicas per shard is 2. We
notice that running the cluster continuously results in a memory / load /
query time build-up. To battle this we restart the nodes twice per day.

The restart scenario is the following:

  1. Stop ElasticSearch on one of the worker nodes via
    /etc/init.d/elasticsearch stop
  2. Wait 120 seconds
  3. Start ElasticSearch on the corresponding worker nodes via
    /etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash after
the cluster state is updated. From the logs we see the following messages:

ElasticSearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
[[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN
(6).

In the JVM crash log file on the worker nodes we have seen two different
root causes after an incident:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J

org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can take
to circumvent these JVM crashes? If you need more information on our setup
please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

  • Alex

Our stack:

  • elasticsearch-0.19.8

  • Java Version:
    $ /opt/java/jre1.7.0_04/bin/java -version
    java version "1.7.0_04"
    Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
    Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

  • OS Version:
    $ uname -a
    Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
    x86_64 GNU/Linux


alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com <javascript:>
www.squirro.com

--


(Joseph Smith) #10

This thread is about a JVM crash that occurs when a node is entering the
cluster (while the cluster is idle). The platitude "add more hardware" is
as off-topic as it is unhelpful.

Every node in my cluster died because of OOM (which was expected). I was
able to bring all of the machines in the cluster back except for one. Note
that the cluster is completely idle (no queries, no indexing).

When I try to bring the last machine back, the JVM crashes about 10 seconds
after restarting. I get the JVM crash log as the OP
(org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;).

Any help regarding this crash would be appreciated.

Top of crash file:

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fbcccb40be4, pid=14593, tid=140429700744960

JRE version: 6.0_24-b24

Java VM: OpenJDK 64-Bit Server VM (20.0-b12 mixed mode linux-amd64 )

Derivative: IcedTea6 1.11.9

Distribution: CentOS release 6.4 (Final), package

rhel-1.57.1.11.9.el6_4-x86_64

Problematic frame:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

If you would like to submit a bug report, please include

instructions how to reproduce the bug and visit:

http://icedtea.classpath.org/bugzilla

~joe!

On Sunday, August 19, 2012 1:59:02 PM UTC-4, Andrew[.:at:.]DataFeedFile.com
wrote:

Gentlemen,

I operate a quite large ES cluster, 3 front-end load-balanced http node
only ES, plus
16 data nodes. We have been developing and learning ES for about 1.5 years
now.

I just want to share our experiences and some similar pains we have
encountered like
the disconnects, sudden jvm crashes and even data corruptions.

I highly recommend tackling the problem by knowing more about your
queries, in our cases
we fix the problem by modifying our queries.

We used bigdesk plugin and examine the patterns and possible issues.

Example of scenario 1:
We had a query that loading to much field data cache, again we identified
this by watching bigdesk
and realized that all the data nodes field data cache were filling up very
fast and taking up 70 - 80% of our memory.
So that forced us to double check our codes and to fix the query. That
solved the problem.

Example of scenario 2:
Again another bad query caused CPU usage to peak at 90+% all the time.
Causing Garbage collection to be delayed,
or failing and eventually caused race condition and brought the node to
its knees.

My point simply, troubleshoot and fix your query or even restructure your
data better for more efficient search.
If that does not fix it, consider adding more nodes.

Regards,

--Andrew

On Wednesday, August 8, 2012 4:32:01 AM UTC-5, Alexander Sennhauser wrote:

Dear all,

We are seeing unexpected JVM crashes after an ElasticSearch cluster node
is restarted and reincorporate into the cluster. For load balancing and
robustness we have two ElasticSearch front-end servers which are balancers.
In the back-end there are three ElasticSearch worker nodes which hold all
the data. There are 5 shards and the number of replicas per shard is 2. We
notice that running the cluster continuously results in a memory / load /
query time build-up. To battle this we restart the nodes twice per day.

The restart scenario is the following:

  1. Stop ElasticSearch on one of the worker nodes via
    /etc/init.d/elasticsearch stop
  2. Wait 120 seconds
  3. Start ElasticSearch on the corresponding worker nodes via
    /etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash after
the cluster state is updated. From the logs we see the following messages:

ElasticSearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
[[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN
(6).

In the JVM crash log file on the worker nodes we have seen two different
root causes after an incident:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J

org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can
take to circumvent these JVM crashes? If you need more information on our
setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

  • Alex

Our stack:

  • elasticsearch-0.19.8

  • Java Version:
    $ /opt/java/jre1.7.0_04/bin/java -version
    java version "1.7.0_04"
    Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
    Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

  • OS Version:
    $ uname -a
    Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
    x86_64 GNU/Linux


alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com
www.squirro.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Joseph Smith) #11

Sorry for the double post. I am also using the G1GC. Other threads hint
that using G1GC may be related to this issue.

On Saturday, August 10, 2013 4:47:47 PM UTC-4, Joseph Smith wrote:

This thread is about a JVM crash that occurs when a node is entering the
cluster (while the cluster is idle). The platitude "add more hardware" is
as off-topic as it is unhelpful.

Every node in my cluster died because of OOM (which was expected). I was
able to bring all of the machines in the cluster back except for one. Note
that the cluster is completely idle (no queries, no indexing).

When I try to bring the last machine back, the JVM crashes about 10
seconds after restarting. I get the JVM crash log as the OP
(org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;).

Any help regarding this crash would be appreciated.

Top of crash file:

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fbcccb40be4, pid=14593, tid=140429700744960

JRE version: 6.0_24-b24

Java VM: OpenJDK 64-Bit Server VM (20.0-b12 mixed mode linux-amd64 )

Derivative: IcedTea6 1.11.9

Distribution: CentOS release 6.4 (Final), package

rhel-1.57.1.11.9.el6_4-x86_64

Problematic frame:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

If you would like to submit a bug report, please include

instructions how to reproduce the bug and visit:

http://icedtea.classpath.org/bugzilla

~joe!

On Sunday, August 19, 2012 1:59:02 PM UTC-4,
Andrew[.:at:.]DataFeedFile.com wrote:

Gentlemen,

I operate a quite large ES cluster, 3 front-end load-balanced http node
only ES, plus
16 data nodes. We have been developing and learning ES for about 1.5
years now.

I just want to share our experiences and some similar pains we have
encountered like
the disconnects, sudden jvm crashes and even data corruptions.

I highly recommend tackling the problem by knowing more about your
queries, in our cases
we fix the problem by modifying our queries.

We used bigdesk plugin and examine the patterns and possible issues.

Example of scenario 1:
We had a query that loading to much field data cache, again we identified
this by watching bigdesk
and realized that all the data nodes field data cache were filling up
very fast and taking up 70 - 80% of our memory.
So that forced us to double check our codes and to fix the query. That
solved the problem.

Example of scenario 2:
Again another bad query caused CPU usage to peak at 90+% all the time.
Causing Garbage collection to be delayed,
or failing and eventually caused race condition and brought the node to
its knees.

My point simply, troubleshoot and fix your query or even restructure your
data better for more efficient search.
If that does not fix it, consider adding more nodes.

Regards,

--Andrew

On Wednesday, August 8, 2012 4:32:01 AM UTC-5, Alexander Sennhauser wrote:

Dear all,

We are seeing unexpected JVM crashes after an ElasticSearch cluster node
is restarted and reincorporate into the cluster. For load balancing and
robustness we have two ElasticSearch front-end servers which are balancers.
In the back-end there are three ElasticSearch worker nodes which hold all
the data. There are 5 shards and the number of replicas per shard is 2. We
notice that running the cluster continuously results in a memory / load /
query time build-up. To battle this we restart the nodes twice per day.

The restart scenario is the following:

  1. Stop ElasticSearch on one of the worker nodes via
    /etc/init.d/elasticsearch stop
  2. Wait 120 seconds
  3. Start ElasticSearch on the corresponding worker nodes via
    /etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash after
the cluster state is updated. From the logs we see the following messages:

ElasticSearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
[[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN
(6).

In the JVM crash log file on the worker nodes we have seen two different
root causes after an incident:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J

org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can
take to circumvent these JVM crashes? If you need more information on our
setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

  • Alex

Our stack:

  • elasticsearch-0.19.8

  • Java Version:
    $ /opt/java/jre1.7.0_04/bin/java -version
    java version "1.7.0_04"
    Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
    Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

  • OS Version:
    $ uname -a
    Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
    x86_64 GNU/Linux


alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com
www.squirro.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Joseph Smith) #12

After changing to -XX:+UseParNewGC -XX:+UseConcMarkSweepGC, the node
successfully joined the cluster without crashing. This is too bad since
G1GC is much more suited for the RAM allocations we are using.

On Saturday, August 10, 2013 4:50:56 PM UTC-4, Joseph Smith wrote:

Sorry for the double post. I am also using the G1GC. Other threads hint
that using G1GC may be related to this issue.

On Saturday, August 10, 2013 4:47:47 PM UTC-4, Joseph Smith wrote:

This thread is about a JVM crash that occurs when a node is entering the
cluster (while the cluster is idle). The platitude "add more hardware" is
as off-topic as it is unhelpful.

Every node in my cluster died because of OOM (which was expected). I was
able to bring all of the machines in the cluster back except for one. Note
that the cluster is completely idle (no queries, no indexing).

When I try to bring the last machine back, the JVM crashes about 10
seconds after restarting. I get the JVM crash log as the OP
(org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;).

Any help regarding this crash would be appreciated.

Top of crash file:

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fbcccb40be4, pid=14593, tid=140429700744960

JRE version: 6.0_24-b24

Java VM: OpenJDK 64-Bit Server VM (20.0-b12 mixed mode linux-amd64 )

Derivative: IcedTea6 1.11.9

Distribution: CentOS release 6.4 (Final), package

rhel-1.57.1.11.9.el6_4-x86_64

Problematic frame:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

If you would like to submit a bug report, please include

instructions how to reproduce the bug and visit:

http://icedtea.classpath.org/bugzilla

~joe!

On Sunday, August 19, 2012 1:59:02 PM UTC-4,
Andrew[.:at:.]DataFeedFile.com wrote:

Gentlemen,

I operate a quite large ES cluster, 3 front-end load-balanced http node
only ES, plus
16 data nodes. We have been developing and learning ES for about 1.5
years now.

I just want to share our experiences and some similar pains we have
encountered like
the disconnects, sudden jvm crashes and even data corruptions.

I highly recommend tackling the problem by knowing more about your
queries, in our cases
we fix the problem by modifying our queries.

We used bigdesk plugin and examine the patterns and possible issues.

Example of scenario 1:
We had a query that loading to much field data cache, again we
identified this by watching bigdesk
and realized that all the data nodes field data cache were filling up
very fast and taking up 70 - 80% of our memory.
So that forced us to double check our codes and to fix the query. That
solved the problem.

Example of scenario 2:
Again another bad query caused CPU usage to peak at 90+% all the time.
Causing Garbage collection to be delayed,
or failing and eventually caused race condition and brought the node to
its knees.

My point simply, troubleshoot and fix your query or even restructure
your data better for more efficient search.
If that does not fix it, consider adding more nodes.

Regards,

--Andrew

On Wednesday, August 8, 2012 4:32:01 AM UTC-5, Alexander Sennhauser
wrote:

Dear all,

We are seeing unexpected JVM crashes after an ElasticSearch cluster
node is restarted and reincorporate into the cluster. For load balancing
and robustness we have two ElasticSearch front-end servers which are
balancers. In the back-end there are three ElasticSearch worker nodes which
hold all the data. There are 5 shards and the number of replicas per shard
is 2. We notice that running the cluster continuously results in a memory /
load / query time build-up. To battle this we restart the nodes twice per
day.

The restart scenario is the following:

  1. Stop ElasticSearch on one of the worker nodes via
    /etc/init.d/elasticsearch stop
  2. Wait 120 seconds
  3. Start ElasticSearch on the corresponding worker nodes via
    /etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash
after the cluster state is updated. From the logs we see the following
messages:

ElasticSearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.service ] [worker2]
processing [zen-disco-receive(from master
[[front1][D2gTiUeyRbS-Gj0rvpI4NQ][inet[/10.28.7.197:9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal UNKNOWN
(6).

In the JVM crash log file on the worker nodes we have seen two
different root causes after an incident:

J

org.elasticsearch.common.trove.impl.hash.TObjectHash.insertKey(Ljava/lang/Object;)I

or

J

org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can
take to circumvent these JVM crashes? If you need more information on our
setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

  • Alex

Our stack:

  • elasticsearch-0.19.8

  • Java Version:
    $ /opt/java/jre1.7.0_04/bin/java -version
    java version "1.7.0_04"
    Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
    Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

  • OS Version:
    $ uname -a
    Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
    x86_64 GNU/Linux


alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com
www.squirro.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #13

I can confirm that certain memory pressure situations of trove4j maps do
randomly crash with G1 GC. The trove4j maps are mostly used in the faceting
module of ES.

An approach to lower the risk of SIGSEGV could be increasing the heap - but
I'm not sure.

When I'm done with testing a replacement of ES trove4j with HPPC, there may
be also a more viable alternative: http://labs.carrotsearch.com/hppc.html

Jörg

On Sat, Aug 10, 2013 at 11:02 PM, Joseph Smith joseph.smith.tm@gmail.comwrote:

After changing to -XX:+UseParNewGC -XX:+UseConcMarkSweepGC, the node
successfully joined the cluster without crashing. This is too bad since
G1GC is much more suited for the RAM allocations we are using.

On Saturday, August 10, 2013 4:50:56 PM UTC-4, Joseph Smith wrote:

Sorry for the double post. I am also using the G1GC. Other threads hint
that using G1GC may be related to this issue.

On Saturday, August 10, 2013 4:47:47 PM UTC-4, Joseph Smith wrote:

This thread is about a JVM crash that occurs when a node is entering the
cluster (while the cluster is idle). The platitude "add more hardware" is
as off-topic as it is unhelpful.

Every node in my cluster died because of OOM (which was expected). I was
able to bring all of the machines in the cluster back except for one. Note
that the cluster is completely idle (no queries, no indexing).

When I try to bring the last machine back, the JVM crashes about 10
seconds after restarting. I get the JVM crash log as the OP
(org.elasticsearch.common.trove.impl.hash.TObjectHash.
insertKey(Ljava/lang/Object;).

Any help regarding this crash would be appreciated.

Top of crash file:

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fbcccb40be4, pid=14593, tid=140429700744960

JRE version: 6.0_24-b24

Java VM: OpenJDK 64-Bit Server VM (20.0-b12 mixed mode linux-amd64 )

Derivative: IcedTea6 1.11.9

Distribution: CentOS release 6.4 (Final), package

rhel-1.57.1.11.9.el6_4-x86_64

Problematic frame:

J org.elasticsearch.common.trove.impl.hash.TObjectHash.

insertKey(Ljava/lang/Object;)I

If you would like to submit a bug report, please include

instructions how to reproduce the bug and visit:

http://icedtea.classpath.org/**bugzillahttp://icedtea.classpath.org/bugzilla

~joe!

On Sunday, August 19, 2012 1:59:02 PM UTC-4,
Andrew[.:at:.]DataFeedFile.com wrote:

Gentlemen,

I operate a quite large ES cluster, 3 front-end load-balanced http node
only ES, plus
16 data nodes. We have been developing and learning ES for about 1.5
years now.

I just want to share our experiences and some similar pains we have
encountered like
the disconnects, sudden jvm crashes and even data corruptions.

I highly recommend tackling the problem by knowing more about your
queries, in our cases
we fix the problem by modifying our queries.

We used bigdesk plugin and examine the patterns and possible issues.

Example of scenario 1:
We had a query that loading to much field data cache, again we
identified this by watching bigdesk
and realized that all the data nodes field data cache were filling up
very fast and taking up 70 - 80% of our memory.
So that forced us to double check our codes and to fix the query. That
solved the problem.

Example of scenario 2:
Again another bad query caused CPU usage to peak at 90+% all the time.
Causing Garbage collection to be delayed,
or failing and eventually caused race condition and brought the node to
its knees.

My point simply, troubleshoot and fix your query or even restructure
your data better for more efficient search.
If that does not fix it, consider adding more nodes.

Regards,

--Andrew

On Wednesday, August 8, 2012 4:32:01 AM UTC-5, Alexander Sennhauser
wrote:

Dear all,

We are seeing unexpected JVM crashes after an ElasticSearch cluster
node is restarted and reincorporate into the cluster. For load balancing
and robustness we have two ElasticSearch front-end servers which are
balancers. In the back-end there are three ElasticSearch worker nodes which
hold all the data. There are 5 shards and the number of replicas per shard
is 2. We notice that running the cluster continuously results in a memory /
load / query time build-up. To battle this we restart the nodes twice per
day.

The restart scenario is the following:

  1. Stop ElasticSearch on one of the worker nodes via
    /etc/init.d/elasticsearch stop
  2. Wait 120 seconds
  3. Start ElasticSearch on the corresponding worker nodes via
    /etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash
after the cluster state is updated. From the logs we see the following
messages:

ElasticSearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.**service ]
[worker2] processing [zen-disco-receive(from master [[front1][D2gTiUeyRbS-
**Gj0rvpI4NQ][inet[/10.28.7.197:**9300]]{data=false, master=true}])]:
done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal
UNKNOWN (6).

In the JVM crash log file on the worker nodes we have seen two
different root causes after an incident:

J org.elasticsearch.common.trove.impl.hash.TObjectHash.

insertKey(Ljava/lang/Object;)I

or

J org.elasticsearch.common.io.stream.HandlesStreamOutput.

writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can
take to circumvent these JVM crashes? If you need more information on our
setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

  • Alex

Our stack:

  • elasticsearch-0.19.8

  • Java Version:
    $ /opt/java/jre1.7.0_04/bin/java -version
    java version "1.7.0_04"
    Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
    Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

  • OS Version:
    $ uname -a
    Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
    x86_64 GNU/Linux


alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com
www.squirro.com

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Joseph Smith) #14

Very cool. I look forward to trying HPPC ES.

On Sunday, August 11, 2013 7:57:16 AM UTC-4, Jörg Prante wrote:

I can confirm that certain memory pressure situations of trove4j maps do
randomly crash with G1 GC. The trove4j maps are mostly used in the faceting
module of ES.

An approach to lower the risk of SIGSEGV could be increasing the heap -
but I'm not sure.

When I'm done with testing a replacement of ES trove4j with HPPC, there
may be also a more viable alternative:
http://labs.carrotsearch.com/hppc.html

Jörg

On Sat, Aug 10, 2013 at 11:02 PM, Joseph Smith <joseph....@gmail.com<javascript:>

wrote:

After changing to -XX:+UseParNewGC -XX:+UseConcMarkSweepGC, the node
successfully joined the cluster without crashing. This is too bad since
G1GC is much more suited for the RAM allocations we are using.

On Saturday, August 10, 2013 4:50:56 PM UTC-4, Joseph Smith wrote:

Sorry for the double post. I am also using the G1GC. Other threads hint
that using G1GC may be related to this issue.

On Saturday, August 10, 2013 4:47:47 PM UTC-4, Joseph Smith wrote:

This thread is about a JVM crash that occurs when a node is entering
the cluster (while the cluster is idle). The platitude "add more hardware"
is as off-topic as it is unhelpful.

Every node in my cluster died because of OOM (which was expected). I
was able to bring all of the machines in the cluster back except for one.
Note that the cluster is completely idle (no queries, no indexing).

When I try to bring the last machine back, the JVM crashes about 10
seconds after restarting. I get the JVM crash log as the OP
(org.elasticsearch.common.trove.impl.hash.TObjectHash.
insertKey(Ljava/lang/Object;).

Any help regarding this crash would be appreciated.

Top of crash file:

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fbcccb40be4, pid=14593,

tid=140429700744960

JRE version: 6.0_24-b24

Java VM: OpenJDK 64-Bit Server VM (20.0-b12 mixed mode linux-amd64 )

Derivative: IcedTea6 1.11.9

Distribution: CentOS release 6.4 (Final), package

rhel-1.57.1.11.9.el6_4-x86_64

Problematic frame:

J org.elasticsearch.common.trove.impl.hash.TObjectHash.

insertKey(Ljava/lang/Object;)I

If you would like to submit a bug report, please include

instructions how to reproduce the bug and visit:

http://icedtea.classpath.org/**bugzillahttp://icedtea.classpath.org/bugzilla

~joe!

On Sunday, August 19, 2012 1:59:02 PM UTC-4,
Andrew[.:at:.]DataFeedFile.com wrote:

Gentlemen,

I operate a quite large ES cluster, 3 front-end load-balanced http
node only ES, plus
16 data nodes. We have been developing and learning ES for about 1.5
years now.

I just want to share our experiences and some similar pains we have
encountered like
the disconnects, sudden jvm crashes and even data corruptions.

I highly recommend tackling the problem by knowing more about your
queries, in our cases
we fix the problem by modifying our queries.

We used bigdesk plugin and examine the patterns and possible issues.

Example of scenario 1:
We had a query that loading to much field data cache, again we
identified this by watching bigdesk
and realized that all the data nodes field data cache were filling up
very fast and taking up 70 - 80% of our memory.
So that forced us to double check our codes and to fix the query. That
solved the problem.

Example of scenario 2:
Again another bad query caused CPU usage to peak at 90+% all the time.
Causing Garbage collection to be delayed,
or failing and eventually caused race condition and brought the node
to its knees.

My point simply, troubleshoot and fix your query or even restructure
your data better for more efficient search.
If that does not fix it, consider adding more nodes.

Regards,

--Andrew

On Wednesday, August 8, 2012 4:32:01 AM UTC-5, Alexander Sennhauser
wrote:

Dear all,

We are seeing unexpected JVM crashes after an ElasticSearch cluster
node is restarted and reincorporate into the cluster. For load balancing
and robustness we have two ElasticSearch front-end servers which are
balancers. In the back-end there are three ElasticSearch worker nodes which
hold all the data. There are 5 shards and the number of replicas per shard
is 2. We notice that running the cluster continuously results in a memory /
load / query time build-up. To battle this we restart the nodes twice per
day.

The restart scenario is the following:

  1. Stop ElasticSearch on one of the worker nodes via
    /etc/init.d/elasticsearch stop
  2. Wait 120 seconds
  3. Start ElasticSearch on the corresponding worker nodes via
    /etc/init.d/elasticsearch start

What we see is that worker nodes sometimes suffer from a JVM crash
after the cluster state is updated. From the logs we see the following
messages:

ElasticSearch log file:

[2012-08-08 08:29:30,918][DEBUG][cluster.**service ]
[worker2] processing [zen-disco-receive(from master [[front1][D2gTiUeyRbS-
**Gj0rvpI4NQ][inet[/10.28.7.197:**9300]]{data=false,
master=true}])]: done applying updated cluster_state

Wrapper log file:

STATUS | wrapper | 2012/08/08 08:29:56 | JVM received a signal
UNKNOWN (6).

In the JVM crash log file on the worker nodes we have seen two
different root causes after an incident:

J org.elasticsearch.common.trove.impl.hash.TObjectHash.

insertKey(Ljava/lang/Object;)I

or

J org.elasticsearch.common.io.stream.HandlesStreamOutput.

writeUTF(Ljava/lang/String;)V

These crashes seem completely unexpected. Is there any measure we can
take to circumvent these JVM crashes? If you need more information on our
setup please do not hesitate to contact us.

Please find the individual component versions below.

Best Regards

  • Alex

Our stack:

  • elasticsearch-0.19.8

  • Java Version:
    $ /opt/java/jre1.7.0_04/bin/java -version
    java version "1.7.0_04"
    Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
    Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

  • OS Version:
    $ uname -a
    Linux elastic29 2.6.32-5-xen-amd64 #1 SMP Sun May 6 08:57:29 UTC 2012
    x86_64 GNU/Linux


alexander sennhauser
software engineer

+41 79 420 4953 mobile
l...@squirro.com
www.squirro.com

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #15

FYI thanks to Martijn van Groningen, Elasticsearch master migrated from
trove4j to HPPC

which is amazing news for those of us who like to give G1GC a try!

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ivan Brusic) #16

Cannot believe I am hijacking the same thread twice but ...

I use Trove in my non-elasticsearch code. Is the transition to HPCC
worthwhile if I have no issues?

I wonder if Martijn has done any extensive benchmarking. I assume he has. I
have been profiling my cluster's memory settings lately, and I have been
tempted to switch to G1GC .

Cheers,

Ivan

On Fri, Oct 4, 2013 at 11:15 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

FYI thanks to Martijn van Groningen, Elasticsearch master migrated from
trove4j to HPPC

https://github.com/elasticsearch/elasticsearch/commit/088e05b36849f0659d3b20cd0376c1a89e2b426b

which is amazing news for those of us who like to give G1GC a try!

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #17

Just a bit of background, HPPC was started by Dawid Weiss of Carrotsearch a
while ago, who is also connected to the Lucene project (for example with
the finite automata implementation). The task of HPPC was picking up a
subset of the Colt primitive collections for Mahout
http://mahout.apache.org/

HPPC uses Murmur hashing and direct access to the underlying data
structures, so for mutable data structures, HPPC is very fast and
convenient.

Dawid Weiss put up a benchmark fwiw
http://mail-archives.apache.org/mod_mbox/mahout-dev/201103.mbox/<AANLkTi=gbd-GFxpQu_86haq16-gikTrwwZ8x4Tp250jH@mail.gmail.com>

And the HPPC license is Apache, same as ES license. So beside performance
aspects, ES will reuse a component of the Lucene software stack which is
very good.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(simonw-2) #18

huge +1 to this comment Joerg! I couldn't have phrased it better!

simon

On Saturday, October 5, 2013 12:02:39 PM UTC+2, Jörg Prante wrote:

Just a bit of background, HPPC was started by Dawid Weiss of Carrotsearch
a while ago, who is also connected to the Lucene project (for example with
the finite automata implementation). The task of HPPC was picking up a
subset of the Colt primitive collections for Mahout
http://mahout.apache.org/

HPPC uses Murmur hashing and direct access to the underlying data
structures, so for mutable data structures, HPPC is very fast and
convenient.

Dawid Weiss put up a benchmark fwiw
http://mail-archives.apache.org/mod_mbox/mahout-dev/201103.mbox/<AANLkTi=gbd-GFxpQu_86haq16-gikTrwwZ8x4Tp250jH@mail.gmail.com>

And the HPPC license is Apache, same as ES license. So beside performance
aspects, ES will reuse a component of the Lucene software stack which is
very good.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ivan Brusic) #19

Yup, thanks for the comment. I did some researching before you replied and
found the same benchmarks, including these in an interactive format:

http://dawidweiss123.appspot.com/run/dawid.weiss@gmail.com/com.carrotsearch.hppc.caliper.BenchmarkPut
http://dawidweiss123.appspot.com/run/dawid.weiss@gmail.com/com.carrotsearch.hppc.caliper.BenchmarkGetWithRemoved
http://dawidweiss123.appspot.com/run/dawid.weiss@gmail.com/com.carrotsearch.hppc.caliper.BenchmarkBigramCounting

In the past several years, I must have used every collection library
available. :slight_smile: In terms of ES, it would be great to finally test with G1GC.
For my code, don't know if I have time to test another lib!

Cheers,

Ivan

On Sun, Oct 6, 2013 at 9:42 AM, simonw simon.willnauer@elasticsearch.comwrote:

huge +1 to this comment Joerg! I couldn't have phrased it better!

simon

On Saturday, October 5, 2013 12:02:39 PM UTC+2, Jörg Prante wrote:

Just a bit of background, HPPC was started by Dawid Weiss of Carrotsearch
a while ago, who is also connected to the Lucene project (for example with
the finite automata implementation). The task of HPPC was picking up a
subset of the Colt primitive collections for Mahout
http://mahout.apache.org/

HPPC uses Murmur hashing and direct access to the underlying data
structures, so for mutable data structures, HPPC is very fast and
convenient.

Dawid Weiss put up a benchmark fwiw http://mail-archives.**
apache.org/mod_mbox/mahout-dev/201103.mbox/%3CAANLkTi=
gbd-GFxpQu_86haq16-**gikTrwwZ8x4Tp250jH@mail.gmail.**com%3Ehttp://mail-archives.apache.org/mod_mbox/mahout-dev/201103.mbox/<AANLkTi=gbd-GFxpQu_86haq16-gikTrwwZ8x4Tp250jH@mail.gmail.com>

And the HPPC license is Apache, same as ES license. So beside performance
aspects, ES will reuse a component of the Lucene software stack which is
very good.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #20