OutOfMemoryError on marvel node brought down the production cluster


(T Vinod Gupta) #1

hi,
in my setup, marvel node is different from production cluster.. the
production nodes send data to marvel node.. marvel node had OOM exception.
this brings me to the quesiton, how much heap does it need? i ran with
default config.

in my prod cluster, i have a load balancer which is no data node. it runs
with just 2GB heap. due to marvel failure, this node was getting timeouts
and for some strange reason went down.

what are the best practices here? how can i avoid this in the future?

marvel node -
[2014-04-17 09:13:33,715][WARN ][index.engine.internal ] [Gorilla-Man]
[.marvel-2014.04.17][0] failed engine
java.lang.OutOfMemoryError: Java heap space
[2014-04-17 09:13:46,890][ERROR][index.engine.internal ] [Gorilla-Man]
[.marvel-2014.04.17][0] failed to acquire searcher, source search_factory
org.apache.lucene.store.AlreadyClosedException: this ReferenceManager is
closed
at
org.apache.lucene.search.ReferenceManager.acquire(ReferenceManager.java:98)
...

ES LB node -
[2014-04-17 00:01:00,567][ERROR][marvel.agent.exporter ] [Darkoth]
create fai
lure (index:[.marvel-2014.04.16] type: [node_stats]):
UnavailableShardsException
[[.marvel-2014.04.16][0] [2] shardIt, [0] active : Timeout waiting for
[1m], req
uest: org.elasticsearch.action.bulk.BulkShardRequest@5d9be928]
[2014-04-17 06:41:46,975][ERROR][marvel.agent.exporter ] [Darkoth] error
conn
ecting to [ip-10-68-145-124.ec2.internal:9200]
java.net.SocketTimeoutException: connect timed out
[2014-04-17 18:53:09,969][DEBUG][action.admin.cluster.node.info] [Darkoth]
faile
d to execute on node [L1f57myxQLK1SSRHRFcvFQ]
java.lang.OutOfMemoryError: Java heap space
[2014-04-17 19:35:05,805][DEBUG][action.search.type ] [Witchfire]
[twitter
_072013][0], node[5GNeFfbPTGi-1EccVvR7Nw], [P], s[STARTED]: Failed to
execute [o
rg.elasticsearch.action.search.SearchRequest@2f94d571] lastShard [true]
org.elasticsearch.transport.RemoteTransportException: [Mauvais][inet[/
10.183.42.
216:9300]][search/phase/query]
Caused by:
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException
: rejected execution (queue capacity 1000) on
org.elasticsearch.transport.netty.
MessageChannelHandler$RequestHandler@4c75d754
at
org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecut
ion(EsAbortPolicy.java:62)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHau4yvYsVO%2BbSk_U0cU7%3Di7G4FFgqwHQo_1as%3DezM9t20TRuA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Boaz Leskes) #2

Hi,

Regarding monitoring node sizing - you have to go through pretty much the
same procedure as with your main cluster. See how much data it generates
per day and montior the memory usage of the node while using marvel on a
single day index. That would be the basis for you calculation. Based on
that and the number of days of data you want to retain you can decide how
many nodes you need and how much memory each should get. BTW - make sure
you use the latest version of marvel (1.1) - it has a way smaller data
signature.

Regarding error on you main production cluster. I'm a bit puzzled but the
log output as the events are pretty far apart. It starts by a timeout of
the marvel agent, 6 hours later it failed to connect (in between it seems
everything is fine). Almsot 13 hours later the node has had an OOM (after
which you have restarted it right? it has a different name). Then 40m later
the log shows that another node (10.183.42.216) is under pressure and
rejecting searchers.

I'm not sure the first part is related to the second part. Can you share
your marvel chart of JVM memory regarding the Darkoth node? it seems your
main cluster is also under memory pressure.

Cheers,
Boaz

On Thursday, April 17, 2014 10:08:04 PM UTC+2, T Vinod Gupta wrote:

hi,
in my setup, marvel node is different from production cluster.. the
production nodes send data to marvel node.. marvel node had OOM exception.
this brings me to the quesiton, how much heap does it need? i ran with
default config.

in my prod cluster, i have a load balancer which is no data node. it runs
with just 2GB heap. due to marvel failure, this node was getting timeouts
and for some strange reason went down.

what are the best practices here? how can i avoid this in the future?

marvel node -
[2014-04-17 09:13:33,715][WARN ][index.engine.internal ] [Gorilla-Man]
[.marvel-2014.04.17][0] failed engine
java.lang.OutOfMemoryError: Java heap space
[2014-04-17 09:13:46,890][ERROR][index.engine.internal ] [Gorilla-Man]
[.marvel-2014.04.17][0] failed to acquire searcher, source search_factory
org.apache.lucene.store.AlreadyClosedException: this ReferenceManager is
closed
at
org.apache.lucene.search.ReferenceManager.acquire(ReferenceManager.java:98)
...

ES LB node -
[2014-04-17 00:01:00,567][ERROR][marvel.agent.exporter ] [Darkoth]
create fai
lure (index:[.marvel-2014.04.16] type: [node_stats]):
UnavailableShardsException
[[.marvel-2014.04.16][0] [2] shardIt, [0] active : Timeout waiting for
[1m], req
uest: org.elasticsearch.action.bulk.BulkShardRequest@5d9be928]
[2014-04-17 06:41:46,975][ERROR][marvel.agent.exporter ] [Darkoth]
error conn
ecting to [ip-10-68-145-124.ec2.internal:9200]
java.net.SocketTimeoutException: connect timed out
[2014-04-17 18:53:09,969][DEBUG][action.admin.cluster.node.info]
[Darkoth] faile
d to execute on node [L1f57myxQLK1SSRHRFcvFQ]
java.lang.OutOfMemoryError: Java heap space
[2014-04-17 19:35:05,805][DEBUG][action.search.type ] [Witchfire]
[twitter
_072013][0], node[5GNeFfbPTGi-1EccVvR7Nw], [P], s[STARTED]: Failed to
execute [o
rg.elasticsearch.action.search.SearchRequest@2f94d571] lastShard [true]
org.elasticsearch.transport.RemoteTransportException: [Mauvais][inet[/
10.183.42.
216:9300]][search/phase/query]
Caused by:
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException
: rejected execution (queue capacity 1000) on
org.elasticsearch.transport.netty.
MessageChannelHandler$RequestHandler@4c75d754
at
org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecut
ion(EsAbortPolicy.java:62)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/614c3f0e-6aa4-4848-9f47-1a9b93e536f5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(T Vinod Gupta) #3

Thanks Boaz for the reply.. I was using the latest marvel 1.1 by the way.
Looks like you need marvel for marvel!
Actually, my marvel cluster got so messed up that no matter what i did it
would show shard failures in the dashboard and nothing was functional. i
actually had a 2 node cluster for marvel monitoring. and after restart,
they never got out of red state.
so i just gave up on my experimentation with marvel and abandoned it fully..

i probably will go back to bigdesk. any other alternatives that are good?

thanks

ps - my feedback to the marvel team would be to provide marvel as a
service.. that will be huge! I noticed that the size of my data dir on
marvel node was 37G just from a few days of monitoring. thats heavy.

On Sat, Apr 19, 2014 at 1:05 AM, Boaz Leskes b.leskes@gmail.com wrote:

Hi,

Regarding monitoring node sizing - you have to go through pretty much the
same procedure as with your main cluster. See how much data it generates
per day and montior the memory usage of the node while using marvel on a
single day index. That would be the basis for you calculation. Based on
that and the number of days of data you want to retain you can decide how
many nodes you need and how much memory each should get. BTW - make sure
you use the latest version of marvel (1.1) - it has a way smaller data
signature.

Regarding error on you main production cluster. I'm a bit puzzled but the
log output as the events are pretty far apart. It starts by a timeout of
the marvel agent, 6 hours later it failed to connect (in between it seems
everything is fine). Almsot 13 hours later the node has had an OOM (after
which you have restarted it right? it has a different name). Then 40m later
the log shows that another node (10.183.42.216) is under pressure and
rejecting searchers.

I'm not sure the first part is related to the second part. Can you share
your marvel chart of JVM memory regarding the Darkoth node? it seems your
main cluster is also under memory pressure.

Cheers,
Boaz

On Thursday, April 17, 2014 10:08:04 PM UTC+2, T Vinod Gupta wrote:

hi,
in my setup, marvel node is different from production cluster.. the
production nodes send data to marvel node.. marvel node had OOM exception.
this brings me to the quesiton, how much heap does it need? i ran with
default config.

in my prod cluster, i have a load balancer which is no data node. it runs
with just 2GB heap. due to marvel failure, this node was getting timeouts
and for some strange reason went down.

what are the best practices here? how can i avoid this in the future?

marvel node -
[2014-04-17 09:13:33,715][WARN ][index.engine.internal ] [Gorilla-Man]
[.marvel-2014.04.17][0] failed engine
java.lang.OutOfMemoryError: Java heap space
[2014-04-17 09:13:46,890][ERROR][index.engine.internal ]
[Gorilla-Man] [.marvel-2014.04.17][0] failed to acquire searcher, source
search_factory
org.apache.lucene.store.AlreadyClosedException: this ReferenceManager is
closed
at org.apache.lucene.search.ReferenceManager.acquire(
ReferenceManager.java:98)
...

ES LB node -
[2014-04-17 00:01:00,567][ERROR][marvel.agent.exporter ] [Darkoth]
create fai
lure (index:[.marvel-2014.04.16] type: [node_stats]):
UnavailableShardsException
[[.marvel-2014.04.16][0] [2] shardIt, [0] active : Timeout waiting for
[1m], req
uest: org.elasticsearch.action.bulk.BulkShardRequest@5d9be928]
[2014-04-17 06:41:46,975][ERROR][marvel.agent.exporter ] [Darkoth]
error conn
ecting to [ip-10-68-145-124.ec2.internal:9200]
java.net.SocketTimeoutException: connect timed out
[2014-04-17 18:53:09,969][DEBUG][action.admin.cluster.node.info]
[Darkoth] faile
d to execute on node [L1f57myxQLK1SSRHRFcvFQ]
java.lang.OutOfMemoryError: Java heap space
[2014-04-17 19:35:05,805][DEBUG][action.search.type ] [Witchfire]
[twitter
_072013][0], node[5GNeFfbPTGi-1EccVvR7Nw], [P], s[STARTED]: Failed to
execute [o
rg.elasticsearch.action.search.SearchRequest@2f94d571] lastShard [true]
org.elasticsearch.transport.RemoteTransportException: [Mauvais][inet[/
10.183.42.
216:9300]][search/phase/query]
Caused by: org.elasticsearch.common.util.concurrent.
EsRejectedExecutionException
: rejected execution (queue capacity 1000) on org.elasticsearch.transport.
netty.
MessageChannelHandler$RequestHandler@4c75d754
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.
rejectedExecut
ion(EsAbortPolicy.java:62)

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/614c3f0e-6aa4-4848-9f47-1a9b93e536f5%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/614c3f0e-6aa4-4848-9f47-1a9b93e536f5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHau4yvOpujFvW%2BqDkjE4j0xTpqdejR_-py5Nx_H6%2BzaQP5Vkw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Boaz Leskes) #4

I think the biggest different with bigdesk and relatives is that they lack
of history, which is why Marvel stores data - so you can always go back and
find out what went wrong at night.

If you don't mind me chasing this more (I do want to know what went wrong
:slight_smile: ) - in your production cluster, how many nodes and indices do you have?
I'm asking this to get a grip of your 37GB of data (if you prefer to share
it privately, you should be able to via the groups interface, otherwise I'm
bleskes on freenode at the #elasticsearch where I'm online for most of
european waking hours).

Cheers,
Boaz

On Mon, Apr 21, 2014 at 9:45 AM, T Vinod Gupta tvinod@readypulse.comwrote:

Thanks Boaz for the reply.. I was using the latest marvel 1.1 by the way.
Looks like you need marvel for marvel!
Actually, my marvel cluster got so messed up that no matter what i did it
would show shard failures in the dashboard and nothing was functional. i
actually had a 2 node cluster for marvel monitoring. and after restart,
they never got out of red state.
so i just gave up on my experimentation with marvel and abandoned it
fully..

i probably will go back to bigdesk. any other alternatives that are good?

thanks

ps - my feedback to the marvel team would be to provide marvel as a
service.. that will be huge! I noticed that the size of my data dir on
marvel node was 37G just from a few days of monitoring. thats heavy.

On Sat, Apr 19, 2014 at 1:05 AM, Boaz Leskes b.leskes@gmail.com wrote:

Hi,

Regarding monitoring node sizing - you have to go through pretty much the
same procedure as with your main cluster. See how much data it generates
per day and montior the memory usage of the node while using marvel on a
single day index. That would be the basis for you calculation. Based on
that and the number of days of data you want to retain you can decide how
many nodes you need and how much memory each should get. BTW - make sure
you use the latest version of marvel (1.1) - it has a way smaller data
signature.

Regarding error on you main production cluster. I'm a bit puzzled but the
log output as the events are pretty far apart. It starts by a timeout of
the marvel agent, 6 hours later it failed to connect (in between it seems
everything is fine). Almsot 13 hours later the node has had an OOM (after
which you have restarted it right? it has a different name). Then 40m later
the log shows that another node (10.183.42.216) is under pressure and
rejecting searchers.

I'm not sure the first part is related to the second part. Can you share
your marvel chart of JVM memory regarding the Darkoth node? it seems your
main cluster is also under memory pressure.

Cheers,
Boaz

On Thursday, April 17, 2014 10:08:04 PM UTC+2, T Vinod Gupta wrote:

hi,
in my setup, marvel node is different from production cluster.. the
production nodes send data to marvel node.. marvel node had OOM exception.
this brings me to the quesiton, how much heap does it need? i ran with
default config.

in my prod cluster, i have a load balancer which is no data node. it
runs with just 2GB heap. due to marvel failure, this node was getting
timeouts and for some strange reason went down.

what are the best practices here? how can i avoid this in the future?

marvel node -
[2014-04-17 09:13:33,715][WARN ][index.engine.internal ]
[Gorilla-Man] [.marvel-2014.04.17][0] failed engine
java.lang.OutOfMemoryError: Java heap space
[2014-04-17 09:13:46,890][ERROR][index.engine.internal ]
[Gorilla-Man] [.marvel-2014.04.17][0] failed to acquire searcher, source
search_factory
org.apache.lucene.store.AlreadyClosedException: this ReferenceManager
is closed
at org.apache.lucene.search.ReferenceManager.acquire(
ReferenceManager.java:98)
...

ES LB node -
[2014-04-17 00:01:00,567][ERROR][marvel.agent.exporter ] [Darkoth]
create fai
lure (index:[.marvel-2014.04.16] type: [node_stats]):
UnavailableShardsException
[[.marvel-2014.04.16][0] [2] shardIt, [0] active : Timeout waiting for
[1m], req
uest: org.elasticsearch.action.bulk.BulkShardRequest@5d9be928]
[2014-04-17 06:41:46,975][ERROR][marvel.agent.exporter ] [Darkoth]
error conn
ecting to [ip-10-68-145-124.ec2.internal:9200]
java.net.SocketTimeoutException: connect timed out
[2014-04-17 18:53:09,969][DEBUG][action.admin.cluster.node.info]
[Darkoth] faile
d to execute on node [L1f57myxQLK1SSRHRFcvFQ]
java.lang.OutOfMemoryError: Java heap space
[2014-04-17 19:35:05,805][DEBUG][action.search.type ] [Witchfire]
[twitter
_072013][0], node[5GNeFfbPTGi-1EccVvR7Nw], [P], s[STARTED]: Failed to
execute [o
rg.elasticsearch.action.search.SearchRequest@2f94d571] lastShard [true]
org.elasticsearch.transport.RemoteTransportException: [Mauvais][inet[/
10.183.42.
216:9300]][search/phase/query]
Caused by: org.elasticsearch.common.util.concurrent.
EsRejectedExecutionException
: rejected execution (queue capacity 1000) on
org.elasticsearch.transport.netty.
MessageChannelHandler$RequestHandler@4c75d754
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.
rejectedExecut
ion(EsAbortPolicy.java:62)

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/614c3f0e-6aa4-4848-9f47-1a9b93e536f5%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/614c3f0e-6aa4-4848-9f47-1a9b93e536f5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Syi85qoZ3Uo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHau4yvOpujFvW%2BqDkjE4j0xTpqdejR_-py5Nx_H6%2BzaQP5Vkw%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAHau4yvOpujFvW%2BqDkjE4j0xTpqdejR_-py5Nx_H6%2BzaQP5Vkw%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKzwz0rmdvto8SOOLNz6yFqOUMVhx6FcGV9Rk3y%3Di-%2B_e7UcJg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #5