When one node goes down, memory usage jumps several gigabytes on other nodes


(Grant Rodgers) #1

I'm seeing strange behavior with my cluster of 3 nodes. It seems like
when one node drops out, the other nodes suddenly spike in cpu and
memory usage after a short time. We are using ES 0.10.0 with a 7GB
heap.

We have an intensive backup task on one node that causes high iowait,
and occasionally that node drops out of the cluster. To avoid split
brain, we started shutting that node down during the back up task.
This shouldn't cause any big activity elsewhere in the cluster since
all our shards are fully replicated on each node. The third node comes
back after an hour or so.

A short time after the node shuts down, the other two nodes suddenly
see a significant cpu and load spike, and memory usage increases
significantly. One node saw a 2GB jump, the other a 4GB jump.
Interestingly, the increases happen 20-30 minutes after the shutdown
node leaves the cluster, and appear to coincide with garbage
collection.

Here are munin graphs for elasticsearch memory usage to illustrate the
problem:
master node: http://i.imgur.com/8GTkb.png
non-master node that was not stopped: http://i.imgur.com/MIVPZ.png
non-master node that was stopped:: http://i.imgur.com/jkuYg.png

(the gaps for the nodes that were not stopped are due to the munin
collector being unable to

As you can see in the graphs, this newly allocated memory is not
garbage collected. The next day when the node is shutdown again, the
accompanying memory spike hits the limit on both remaining nodes.
Additionally, one cpu is pegged at 100% for several hours probably due
to GC storms, but it is probably exacerbated by elasticsearch trying
to recover from hitting memory limits. Once the memory limit is hit
lots of buffer overrun and xml parsing exceptions start being thrown.

Shay, can you shed any light on this issue?

Thanks,
Grant


(Grant Rodgers) #2

Oops, just noticed there is an unfinished thought in the previous
message. It should be:

(the data gaps in the graphs for the nodes that were not stopped are
due to the elasticsearch jmx endpoint not responding)

On Sep 15, 11:17 am, Grant Rodgers gra...@gmail.com wrote:

I'm seeing strange behavior with my cluster of 3 nodes. It seems like
when one node drops out, the other nodes suddenly spike in cpu and
memory usage after a short time. We are using ES 0.10.0 with a 7GB
heap.

We have an intensive backup task on one node that causes high iowait,
and occasionally that node drops out of the cluster. To avoid split
brain, we started shutting that node down during the back up task.
This shouldn't cause any big activity elsewhere in the cluster since
all our shards are fully replicated on each node. The third node comes
back after an hour or so.

A short time after the node shuts down, the other two nodes suddenly
see a significant cpu and load spike, and memory usage increases
significantly. One node saw a 2GB jump, the other a 4GB jump.
Interestingly, the increases happen 20-30 minutes after the shutdown
node leaves the cluster, and appear to coincide with garbage
collection.

Here are munin graphs for elasticsearch memory usage to illustrate the
problem:
master node:http://i.imgur.com/8GTkb.png
non-master node that was not stopped:http://i.imgur.com/MIVPZ.png
non-master node that was stopped::http://i.imgur.com/jkuYg.png

(the gaps for the nodes that were not stopped are due to the munin
collector being unable to

As you can see in the graphs, this newly allocated memory is not
garbage collected. The next day when the node is shutdown again, the
accompanying memory spike hits the limit on both remaining nodes.
Additionally, one cpu is pegged at 100% for several hours probably due
to GC storms, but it is probably exacerbated by elasticsearch trying
to recover from hitting memory limits. Once the memory limit is hit
lots of buffer overrun and xml parsing exceptions start being thrown.

Shay, can you shed any light on this issue?

Thanks,
Grant


(Shay Banon) #3

Hi,

First a clarification. When you shut down the third node, shards that
were allocated on it will get reallocated on the other two nodes. This is
done to maintain the the required number of replicas parameter. For example,
if you have a 3 shards with 1 replica index, and you have 3 nodes, then each
node will have 2 shards. If you shutdown the third node, you will then have
2 nodes with 3 shards (and if you shutdown the second node, you will still
have 1 node with 3 shards, since it makes no sense to allocate a replica of
a shard on the same node).

What you see is very interesting. First, if after one hour you bring that
node back, then shards should get reallocated and you will end up again with
2 shards per node, so memory usage should certainly not increase. You say
that you see the increase 30 minutes after you killed the node. Can you
check the logs and see if this 30 minutes are the time it took to discover
the node got removed (because of long discovery fault detection settings)?
Those 30 minutes might also be the time it took to fully recover the
reallocated shards until they get into a fully operational mode. This will
help a bit to narrow down the problem.

If you want to know more on how the cluster looks like, you can check the
cluster state API, it will tell you which nodes are in the cluster, which
indices, and which shards are allocated on which node.

The best and quickest way for me to try and help is if you can issue a
heap dump when this spike happens (even for the first one). Its a bit of a
pain to get it, but thats the best way for me to check where the memory
problem is. To get a heap dump, you can use the latest visualvm (
https://visualvm.dev.java.net/), run it against the JVM in question
(simplest is to run it on the same box, it will autodetect it), and issue a
heap dump. You can also use jmap with the -dump option (
http://download.oracle.com/javase/6/docs/technotes/tools/share/jmap.html).

Once you have a heap dump, then comes the interesting bit. Tar gz it and
somehow provide me with a link to download it ;).

I would love to help solve this problem you are having. I am here to
help.

-shay.banon

On Wed, Sep 15, 2010 at 8:26 PM, Grant Rodgers grantr@gmail.com wrote:

Oops, just noticed there is an unfinished thought in the previous
message. It should be:

(the data gaps in the graphs for the nodes that were not stopped are
due to the elasticsearch jmx endpoint not responding)

On Sep 15, 11:17 am, Grant Rodgers gra...@gmail.com wrote:

I'm seeing strange behavior with my cluster of 3 nodes. It seems like
when one node drops out, the other nodes suddenly spike in cpu and
memory usage after a short time. We are using ES 0.10.0 with a 7GB
heap.

We have an intensive backup task on one node that causes high iowait,
and occasionally that node drops out of the cluster. To avoid split
brain, we started shutting that node down during the back up task.
This shouldn't cause any big activity elsewhere in the cluster since
all our shards are fully replicated on each node. The third node comes
back after an hour or so.

A short time after the node shuts down, the other two nodes suddenly
see a significant cpu and load spike, and memory usage increases
significantly. One node saw a 2GB jump, the other a 4GB jump.
Interestingly, the increases happen 20-30 minutes after the shutdown
node leaves the cluster, and appear to coincide with garbage
collection.

Here are munin graphs for elasticsearch memory usage to illustrate the
problem:
master node:http://i.imgur.com/8GTkb.png
non-master node that was not stopped:http://i.imgur.com/MIVPZ.png
non-master node that was stopped::http://i.imgur.com/jkuYg.png

(the gaps for the nodes that were not stopped are due to the munin
collector being unable to

As you can see in the graphs, this newly allocated memory is not
garbage collected. The next day when the node is shutdown again, the
accompanying memory spike hits the limit on both remaining nodes.
Additionally, one cpu is pegged at 100% for several hours probably due
to GC storms, but it is probably exacerbated by elasticsearch trying
to recover from hitting memory limits. Once the memory limit is hit
lots of buffer overrun and xml parsing exceptions start being thrown.

Shay, can you shed any light on this issue?

Thanks,
Grant


(Grant Rodgers) #4

All of our indices have number_of_replicas 2, so every shard is
replicated to every node (we have 3 nodes and 3 copies of each shard).
So when a node goes down, no shards should be reallocated.

I can say for sure that the shutdown node is discovered and removed
immediately. Here is a segment of the log from one node when this
error occurs (this is the non-master node). The node is shutdown at
11:

[11:00:02,263][INFO ][cluster.service ] [Darkdevil] removed
{[Cassiopea][d51ef323-99c6-4f2d-b2ff-8b444294c2fa][inet[/
10.102.9.169:9300]],}, reason: zen-disco-receive(from [[Inertia]
[d83bddd4-a83f-45b5-863b-3aea1bf96722][inet[/10.102.19.139:9300]]])
[11:33:42,705][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [10.6s], breached threshold [10s]
[11:34:05,341][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [10s], breached threshold [10s]
[11:34:18,076][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [12.3s], breached threshold [10s]
[11:34:31,183][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [11s], breached threshold [10s]
[11:34:47,133][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [15.6s], breached threshold [10s]
[11:36:11,656][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [1.3m], breached threshold [10s]
[11:36:50,326][WARN ][transport.netty ] [Darkdevil] Exception
caught on netty layer [[id: 0x242110fc, /10.102.43.160:49037 => /
10.102.19.139:9300]]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3221)
at java.lang.String.(String.java:233)
at
org.elasticsearch.common.io.stream.StreamInput.readUTF(StreamInput.java:
183)
at
org.elasticsearch.common.io.stream.HandlesStreamInput.readUTF(HandlesStreamInput.java:
49)
at org.elasticsearch.monitor.jvm.JvmInfo.readFrom(JvmInfo.java:
206)
at
org.elasticsearch.monitor.jvm.JvmInfo.readJvmInfo(JvmInfo.java:199)
at
org.elasticsearch.action.admin.cluster.node.info.NodeInfo.readFrom(NodeInfo.java:
207)
at
org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:
124)
at
org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:
102)
at
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:
80)

I will try to get a heap dump and put it on s3 for you to download.
Stay tuned.

Grant

On Sep 15, 12:07 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

First a clarification. When you shut down the third node, shards that
were allocated on it will get reallocated on the other two nodes. This is
done to maintain the the required number of replicas parameter. For example,
if you have a 3 shards with 1 replica index, and you have 3 nodes, then each
node will have 2 shards. If you shutdown the third node, you will then have
2 nodes with 3 shards (and if you shutdown the second node, you will still
have 1 node with 3 shards, since it makes no sense to allocate a replica of
a shard on the same node).

What you see is very interesting. First, if after one hour you bring that
node back, then shards should get reallocated and you will end up again with
2 shards per node, so memory usage should certainly not increase. You say
that you see the increase 30 minutes after you killed the node. Can you
check the logs and see if this 30 minutes are the time it took to discover
the node got removed (because of long discovery fault detection settings)?
Those 30 minutes might also be the time it took to fully recover the
reallocated shards until they get into a fully operational mode. This will
help a bit to narrow down the problem.

If you want to know more on how the cluster looks like, you can check the
cluster state API, it will tell you which nodes are in the cluster, which
indices, and which shards are allocated on which node.

The best and quickest way for me to try and help is if you can issue a
heap dump when this spike happens (even for the first one). Its a bit of a
pain to get it, but thats the best way for me to check where the memory
problem is. To get a heap dump, you can use the latest visualvm (https://visualvm.dev.java.net/), run it against the JVM in question
(simplest is to run it on the same box, it will autodetect it), and issue a
heap dump. You can also use jmap with the -dump option (http://download.oracle.com/javase/6/docs/technotes/tools/share/jmap.html).

Once you have a heap dump, then comes the interesting bit. Tar gz it and
somehow provide me with a link to download it ;).

I would love to help solve this problem you are having. I am here to
help.

-shay.banon

On Wed, Sep 15, 2010 at 8:26 PM, Grant Rodgers gra...@gmail.com wrote:

Oops, just noticed there is an unfinished thought in the previous
message. It should be:

(the data gaps in the graphs for the nodes that were not stopped are
due to the elasticsearch jmx endpoint not responding)

On Sep 15, 11:17 am, Grant Rodgers gra...@gmail.com wrote:

I'm seeing strange behavior with my cluster of 3 nodes. It seems like
when one node drops out, the other nodes suddenly spike in cpu and
memory usage after a short time. We are using ES 0.10.0 with a 7GB
heap.

We have an intensive backup task on one node that causes high iowait,
and occasionally that node drops out of the cluster. To avoid split
brain, we started shutting that node down during the back up task.
This shouldn't cause any big activity elsewhere in the cluster since
all our shards are fully replicated on each node. The third node comes
back after an hour or so.

A short time after the node shuts down, the other two nodes suddenly
see a significant cpu and load spike, and memory usage increases
significantly. One node saw a 2GB jump, the other a 4GB jump.
Interestingly, the increases happen 20-30 minutes after the shutdown
node leaves the cluster, and appear to coincide with garbage
collection.

Here are munin graphs for elasticsearch memory usage to illustrate the
problem:
master node:http://i.imgur.com/8GTkb.png
non-master node that was not stopped:http://i.imgur.com/MIVPZ.png
non-master node that was stopped::http://i.imgur.com/jkuYg.png

(the gaps for the nodes that were not stopped are due to the munin
collector being unable to

As you can see in the graphs, this newly allocated memory is not
garbage collected. The next day when the node is shutdown again, the
accompanying memory spike hits the limit on both remaining nodes.
Additionally, one cpu is pegged at 100% for several hours probably due
to GC storms, but it is probably exacerbated by elasticsearch trying
to recover from hitting memory limits. Once the memory limit is hit
lots of buffer overrun and xml parsing exceptions start being thrown.

Shay, can you shed any light on this issue?

Thanks,
Grant


(Shay Banon) #5

Thats good!. You are right, no allocation should happen in this case and
hopefully it will be easier to resolve. The heap dump should have caused a
heap.prof file (or something similar) created, either in ES_HOME/work/heap
or in the working dir of the process. So no need to generate one.

-shay.banon

On Wed, Sep 15, 2010 at 11:22 PM, Grant Rodgers grantr@gmail.com wrote:

All of our indices have number_of_replicas 2, so every shard is
replicated to every node (we have 3 nodes and 3 copies of each shard).
So when a node goes down, no shards should be reallocated.

I can say for sure that the shutdown node is discovered and removed
immediately. Here is a segment of the log from one node when this
error occurs (this is the non-master node). The node is shutdown at
11:

[11:00:02,263][INFO ][cluster.service ] [Darkdevil] removed
{[Cassiopea][d51ef323-99c6-4f2d-b2ff-8b444294c2fa][inet[/
10.102.9.169:9300]],}, reason: zen-disco-receive(from [[Inertia]
[d83bddd4-a83f-45b5-863b-3aea1bf96722][inet[/10.102.19.139:9300]]])
[11:33:42,705][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [10.6s], breached threshold [10s]
[11:34:05,341][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [10s], breached threshold [10s]
[11:34:18,076][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [12.3s], breached threshold [10s]
[11:34:31,183][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [11s], breached threshold [10s]
[11:34:47,133][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [15.6s], breached threshold [10s]
[11:36:11,656][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [1.3m], breached threshold [10s]
[11:36:50,326][WARN ][transport.netty ] [Darkdevil] Exception
caught on netty layer [[id: 0x242110fc, /10.102.43.160:49037 => /
10.102.19.139:9300]]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3221)
at java.lang.String.(String.java:233)
at
org.elasticsearch.common.io.stream.StreamInput.readUTF(StreamInput.java:
183)
at

org.elasticsearch.common.io.stream.HandlesStreamInput.readUTF(HandlesStreamInput.java:
49)
at org.elasticsearch.monitor.jvm.JvmInfo.readFrom(JvmInfo.java:
206)
at
org.elasticsearch.monitor.jvm.JvmInfo.readJvmInfo(JvmInfo.java:199)
at

org.elasticsearch.action.admin.cluster.node.info.NodeInfo.readFrom(NodeInfo.java:
207)
at

org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:
124)
at

org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:
102)
at

org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:
80)

I will try to get a heap dump and put it on s3 for you to download.
Stay tuned.

Grant

On Sep 15, 12:07 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

First a clarification. When you shut down the third node, shards that
were allocated on it will get reallocated on the other two nodes. This is
done to maintain the the required number of replicas parameter. For
example,
if you have a 3 shards with 1 replica index, and you have 3 nodes, then
each
node will have 2 shards. If you shutdown the third node, you will then
have
2 nodes with 3 shards (and if you shutdown the second node, you will
still
have 1 node with 3 shards, since it makes no sense to allocate a replica
of
a shard on the same node).

What you see is very interesting. First, if after one hour you bring
that
node back, then shards should get reallocated and you will end up again
with
2 shards per node, so memory usage should certainly not increase. You say
that you see the increase 30 minutes after you killed the node. Can you
check the logs and see if this 30 minutes are the time it took to
discover
the node got removed (because of long discovery fault detection
settings)?
Those 30 minutes might also be the time it took to fully recover the
reallocated shards until they get into a fully operational mode. This
will
help a bit to narrow down the problem.

If you want to know more on how the cluster looks like, you can check
the
cluster state API, it will tell you which nodes are in the cluster, which
indices, and which shards are allocated on which node.

The best and quickest way for me to try and help is if you can issue a
heap dump when this spike happens (even for the first one). Its a bit of
a
pain to get it, but thats the best way for me to check where the memory
problem is. To get a heap dump, you can use the latest visualvm (
https://visualvm.dev.java.net/), run it against the JVM in question
(simplest is to run it on the same box, it will autodetect it), and issue
a
heap dump. You can also use jmap with the -dump option (
http://download.oracle.com/javase/6/docs/technotes/tools/share/jmap.html).

Once you have a heap dump, then comes the interesting bit. Tar gz it
and
somehow provide me with a link to download it ;).

I would love to help solve this problem you are having. I am here to
help.

-shay.banon

On Wed, Sep 15, 2010 at 8:26 PM, Grant Rodgers gra...@gmail.com wrote:

Oops, just noticed there is an unfinished thought in the previous
message. It should be:

(the data gaps in the graphs for the nodes that were not stopped are
due to the elasticsearch jmx endpoint not responding)

On Sep 15, 11:17 am, Grant Rodgers gra...@gmail.com wrote:

I'm seeing strange behavior with my cluster of 3 nodes. It seems like
when one node drops out, the other nodes suddenly spike in cpu and
memory usage after a short time. We are using ES 0.10.0 with a 7GB
heap.

We have an intensive backup task on one node that causes high iowait,
and occasionally that node drops out of the cluster. To avoid split
brain, we started shutting that node down during the back up task.
This shouldn't cause any big activity elsewhere in the cluster since
all our shards are fully replicated on each node. The third node
comes

back after an hour or so.

A short time after the node shuts down, the other two nodes suddenly
see a significant cpu and load spike, and memory usage increases
significantly. One node saw a 2GB jump, the other a 4GB jump.
Interestingly, the increases happen 20-30 minutes after the shutdown
node leaves the cluster, and appear to coincide with garbage
collection.

Here are munin graphs for elasticsearch memory usage to illustrate
the

problem:
master node:http://i.imgur.com/8GTkb.png
non-master node that was not stopped:http://i.imgur.com/MIVPZ.png
non-master node that was stopped::http://i.imgur.com/jkuYg.png

(the gaps for the nodes that were not stopped are due to the munin
collector being unable to

As you can see in the graphs, this newly allocated memory is not
garbage collected. The next day when the node is shutdown again, the
accompanying memory spike hits the limit on both remaining nodes.
Additionally, one cpu is pegged at 100% for several hours probably
due

to GC storms, but it is probably exacerbated by elasticsearch trying
to recover from hitting memory limits. Once the memory limit is hit
lots of buffer overrun and xml parsing exceptions start being thrown.

Shay, can you shed any light on this issue?

Thanks,
Grant


(Grant Rodgers) #6

Update: this actually had nothing to do with a node going down, or io
performance on the node. That was just a coincidence. The real reason
was that we had several quite massive and poorly optimized queries
scheduled to be run at that time (unbeknownst to me). ES still
shouldn't run out of memory trying to process poorly written queries,
but at least the scope of the problem is smaller!

Grant

On Sep 15, 2:37 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Thats good!. You are right, no allocation should happen in this case and
hopefully it will be easier to resolve. The heap dump should have caused a
heap.prof file (or something similar) created, either in ES_HOME/work/heap
or in the working dir of the process. So no need to generate one.

-shay.banon

On Wed, Sep 15, 2010 at 11:22 PM, Grant Rodgers gra...@gmail.com wrote:

All of our indices have number_of_replicas 2, so every shard is
replicated to every node (we have 3 nodes and 3 copies of each shard).
So when a node goes down, no shards should be reallocated.

I can say for sure that the shutdown node is discovered and removed
immediately. Here is a segment of the log from one node when this
error occurs (this is the non-master node). The node is shutdown at
11:

[11:00:02,263][INFO ][cluster.service ] [Darkdevil] removed
{[Cassiopea][d51ef323-99c6-4f2d-b2ff-8b444294c2fa][inet[/
10.102.9.169:9300]],}, reason: zen-disco-receive(from [[Inertia]
[d83bddd4-a83f-45b5-863b-3aea1bf96722][inet[/10.102.19.139:9300]]])
[11:33:42,705][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [10.6s], breached threshold [10s]
[11:34:05,341][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [10s], breached threshold [10s]
[11:34:18,076][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [12.3s], breached threshold [10s]
[11:34:31,183][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [11s], breached threshold [10s]
[11:34:47,133][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [15.6s], breached threshold [10s]
[11:36:11,656][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [1.3m], breached threshold [10s]
[11:36:50,326][WARN ][transport.netty ] [Darkdevil] Exception
caught on netty layer [[id: 0x242110fc, /10.102.43.160:49037 => /
10.102.19.139:9300]]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3221)
at java.lang.String.(String.java:233)
at
org.elasticsearch.common.io.stream.StreamInput.readUTF(StreamInput.java:
183)
at

org.elasticsearch.common.io.stream.HandlesStreamInput.readUTF(HandlesStream Input.java:
49)
at org.elasticsearch.monitor.jvm.JvmInfo.readFrom(JvmInfo.java:
206)
at
org.elasticsearch.monitor.jvm.JvmInfo.readJvmInfo(JvmInfo.java:199)
at

org.elasticsearch.action.admin.cluster.node.info.NodeInfo.readFrom(NodeInfo .java:
207)
at

org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(Mess ageChannelHandler.java:
124)
at

org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(Mes sageChannelHandler.java:
102)
at

org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleU pstream(SimpleChannelUpstreamHandler.java:
80)

I will try to get a heap dump and put it on s3 for you to download.
Stay tuned.

Grant

On Sep 15, 12:07 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

First a clarification. When you shut down the third node, shards that
were allocated on it will get reallocated on the other two nodes. This is
done to maintain the the required number of replicas parameter. For
example,
if you have a 3 shards with 1 replica index, and you have 3 nodes, then
each
node will have 2 shards. If you shutdown the third node, you will then
have
2 nodes with 3 shards (and if you shutdown the second node, you will
still
have 1 node with 3 shards, since it makes no sense to allocate a replica
of
a shard on the same node).

What you see is very interesting. First, if after one hour you bring
that
node back, then shards should get reallocated and you will end up again
with
2 shards per node, so memory usage should certainly not increase. You say
that you see the increase 30 minutes after you killed the node. Can you
check the logs and see if this 30 minutes are the time it took to
discover
the node got removed (because of long discovery fault detection
settings)?
Those 30 minutes might also be the time it took to fully recover the
reallocated shards until they get into a fully operational mode. This
will
help a bit to narrow down the problem.

If you want to know more on how the cluster looks like, you can check
the
cluster state API, it will tell you which nodes are in the cluster, which
indices, and which shards are allocated on which node.

The best and quickest way for me to try and help is if you can issue a
heap dump when this spike happens (even for the first one). Its a bit of
a
pain to get it, but thats the best way for me to check where the memory
problem is. To get a heap dump, you can use the latest visualvm (
https://visualvm.dev.java.net/), run it against the JVM in question
(simplest is to run it on the same box, it will autodetect it), and issue
a
heap dump. You can also use jmap with the -dump option (
http://download.oracle.com/javase/6/docs/technotes/tools/share/jmap.html).

Once you have a heap dump, then comes the interesting bit. Tar gz it
and
somehow provide me with a link to download it ;).

I would love to help solve this problem you are having. I am here to
help.

-shay.banon

On Wed, Sep 15, 2010 at 8:26 PM, Grant Rodgers gra...@gmail.com wrote:

Oops, just noticed there is an unfinished thought in the previous
message. It should be:

(the data gaps in the graphs for the nodes that were not stopped are
due to the elasticsearch jmx endpoint not responding)

On Sep 15, 11:17 am, Grant Rodgers gra...@gmail.com wrote:

I'm seeing strange behavior with my cluster of 3 nodes. It seems like
when one node drops out, the other nodes suddenly spike in cpu and
memory usage after a short time. We are using ES 0.10.0 with a 7GB
heap.

We have an intensive backup task on one node that causes high iowait,
and occasionally that node drops out of the cluster. To avoid split
brain, we started shutting that node down during the back up task.
This shouldn't cause any big activity elsewhere in the cluster since
all our shards are fully replicated on each node. The third node
comes

back after an hour or so.

A short time after the node shuts down, the other two nodes suddenly
see a significant cpu and load spike, and memory usage increases
significantly. One node saw a 2GB jump, the other a 4GB jump.
Interestingly, the increases happen 20-30 minutes after the shutdown
node leaves the cluster, and appear to coincide with garbage
collection.

Here are munin graphs for elasticsearch memory usage to illustrate
the

problem:
master node:http://i.imgur.com/8GTkb.png
non-master node that was not stopped:http://i.imgur.com/MIVPZ.png
non-master node that was stopped::http://i.imgur.com/jkuYg.png

(the gaps for the nodes that were not stopped are due to the munin
collector being unable to

As you can see in the graphs, this newly allocated memory is not
garbage collected. The next day when the node is shutdown again, the
accompanying memory spike hits the limit on both remaining nodes.
Additionally, one cpu is pegged at 100% for several hours probably
due

to GC storms, but it is probably exacerbated by elasticsearch trying
to recover from hitting memory limits. Once the memory limit is hit
lots of buffer overrun and xml parsing exceptions start being thrown.

Shay, can you shed any light on this issue?

Thanks,
Grant


(Shay Banon) #7

Hi,

Yes, I improved the caching aspect of filters and field facets in upcoming
0.11, so they should be more properly reclaimed now.

-shay.banon

On Fri, Sep 24, 2010 at 1:35 AM, Grant Rodgers grantr@gmail.com wrote:

Update: this actually had nothing to do with a node going down, or io
performance on the node. That was just a coincidence. The real reason
was that we had several quite massive and poorly optimized queries
scheduled to be run at that time (unbeknownst to me). ES still
shouldn't run out of memory trying to process poorly written queries,
but at least the scope of the problem is smaller!

Grant

On Sep 15, 2:37 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Thats good!. You are right, no allocation should happen in this case and
hopefully it will be easier to resolve. The heap dump should have caused
a
heap.prof file (or something similar) created, either in
ES_HOME/work/heap
or in the working dir of the process. So no need to generate one.

-shay.banon

On Wed, Sep 15, 2010 at 11:22 PM, Grant Rodgers gra...@gmail.com
wrote:

All of our indices have number_of_replicas 2, so every shard is
replicated to every node (we have 3 nodes and 3 copies of each shard).
So when a node goes down, no shards should be reallocated.

I can say for sure that the shutdown node is discovered and removed
immediately. Here is a segment of the log from one node when this
error occurs (this is the non-master node). The node is shutdown at
11:

[11:00:02,263][INFO ][cluster.service ] [Darkdevil] removed
{[Cassiopea][d51ef323-99c6-4f2d-b2ff-8b444294c2fa][inet[/
10.102.9.169:9300]],}, reason: zen-disco-receive(from [[Inertia]
[d83bddd4-a83f-45b5-863b-3aea1bf96722][inet[/10.102.19.139:9300]]])
[11:33:42,705][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [10.6s], breached threshold [10s]
[11:34:05,341][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [10s], breached threshold [10s]
[11:34:18,076][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [12.3s], breached threshold [10s]
[11:34:31,183][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [11s], breached threshold [10s]
[11:34:47,133][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [15.6s], breached threshold [10s]
[11:36:11,656][WARN ][monitor.jvm ] [Darkdevil] Long GC
collection occurred, took [1.3m], breached threshold [10s]
[11:36:50,326][WARN ][transport.netty ] [Darkdevil] Exception
caught on netty layer [[id: 0x242110fc, /10.102.43.160:49037 => /
10.102.19.139:9300]]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3221)
at java.lang.String.(String.java:233)
at

org.elasticsearch.common.io.stream.StreamInput.readUTF(StreamInput.java:

  1. at

org.elasticsearch.common.io.stream.HandlesStreamInput.readUTF(HandlesStream
Input.java:

  1. at org.elasticsearch.monitor.jvm.JvmInfo.readFrom(JvmInfo.java:
  2. at
    org.elasticsearch.monitor.jvm.JvmInfo.readJvmInfo(JvmInfo.java:199)
    at

org.elasticsearch.action.admin.cluster.node.info.NodeInfo.readFrom(NodeInfo
.java:

  1. at

org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(Mess
ageChannelHandler.java:

  1. at

org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(Mes
sageChannelHandler.java:

  1. at

org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleU
pstream(SimpleChannelUpstreamHandler.java:

I will try to get a heap dump and put it on s3 for you to download.
Stay tuned.

Grant

On Sep 15, 12:07 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

First a clarification. When you shut down the third node, shards
that

were allocated on it will get reallocated on the other two nodes.
This is

done to maintain the the required number of replicas parameter. For
example,
if you have a 3 shards with 1 replica index, and you have 3 nodes,
then

each

node will have 2 shards. If you shutdown the third node, you will
then

have

2 nodes with 3 shards (and if you shutdown the second node, you will
still
have 1 node with 3 shards, since it makes no sense to allocate a
replica

of

a shard on the same node).

What you see is very interesting. First, if after one hour you
bring

that

node back, then shards should get reallocated and you will end up
again

with

2 shards per node, so memory usage should certainly not increase. You
say

that you see the increase 30 minutes after you killed the node. Can
you

check the logs and see if this 30 minutes are the time it took to
discover
the node got removed (because of long discovery fault detection
settings)?
Those 30 minutes might also be the time it took to fully recover the
reallocated shards until they get into a fully operational mode. This
will
help a bit to narrow down the problem.

If you want to know more on how the cluster looks like, you can
check

the

cluster state API, it will tell you which nodes are in the cluster,
which

indices, and which shards are allocated on which node.

The best and quickest way for me to try and help is if you can
issue a

heap dump when this spike happens (even for the first one). Its a bit
of

a

pain to get it, but thats the best way for me to check where the
memory

problem is. To get a heap dump, you can use the latest visualvm (
https://visualvm.dev.java.net/), run it against the JVM in question
(simplest is to run it on the same box, it will autodetect it), and
issue

a

heap dump. You can also use jmap with the -dump option (

http://download.oracle.com/javase/6/docs/technotes/tools/share/jmap.html).

Once you have a heap dump, then comes the interesting bit. Tar gz
it

and

somehow provide me with a link to download it ;).

I would love to help solve this problem you are having. I am here
to

help.

-shay.banon

On Wed, Sep 15, 2010 at 8:26 PM, Grant Rodgers gra...@gmail.com
wrote:

Oops, just noticed there is an unfinished thought in the previous
message. It should be:

(the data gaps in the graphs for the nodes that were not stopped
are

due to the elasticsearch jmx endpoint not responding)

On Sep 15, 11:17 am, Grant Rodgers gra...@gmail.com wrote:

I'm seeing strange behavior with my cluster of 3 nodes. It seems
like

when one node drops out, the other nodes suddenly spike in cpu
and

memory usage after a short time. We are using ES 0.10.0 with a
7GB

heap.

We have an intensive backup task on one node that causes high
iowait,

and occasionally that node drops out of the cluster. To avoid
split

brain, we started shutting that node down during the back up
task.

This shouldn't cause any big activity elsewhere in the cluster
since

all our shards are fully replicated on each node. The third node
comes

back after an hour or so.

A short time after the node shuts down, the other two nodes
suddenly

see a significant cpu and load spike, and memory usage increases
significantly. One node saw a 2GB jump, the other a 4GB jump.
Interestingly, the increases happen 20-30 minutes after the
shutdown

node leaves the cluster, and appear to coincide with garbage
collection.

Here are munin graphs for elasticsearch memory usage to
illustrate

the

problem:
master node:http://i.imgur.com/8GTkb.png
non-master node that was not stopped:
http://i.imgur.com/MIVPZ.png

non-master node that was stopped::http://i.imgur.com/jkuYg.png

(the gaps for the nodes that were not stopped are due to the
munin

collector being unable to

As you can see in the graphs, this newly allocated memory is not
garbage collected. The next day when the node is shutdown again,
the

accompanying memory spike hits the limit on both remaining nodes.
Additionally, one cpu is pegged at 100% for several hours
probably

due

to GC storms, but it is probably exacerbated by elasticsearch
trying

to recover from hitting memory limits. Once the memory limit is
hit

lots of buffer overrun and xml parsing exceptions start being
thrown.

Shay, can you shed any light on this issue?

Thanks,
Grant


(system) #8