ES 0.20.0 hangs regularly

T_Vinod_Gupta · October 31, 2013, 6:42pm

hi,
i have running cluster of 3 nodes (ES 0.20.0) and once a day or 2, it hangs
for me.. by hang, i mean curl command for health check or get or search
hangs forever. the log files don't have any clue on this.

what can i do to debug this further? will upgrading to lucene 4 based
versions help?

thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · November 1, 2013, 6:27am

Is there truly nothing in the logs? How about frequent garbage collections?
Do you have any monitoring?

Without knowing anything more, I would strongly suggest upgrading to the
latest release. The memory improvements in 0.90.x are truly remarkable
thanks to Lucene 4 and other changes. I am not one to upgrade frequently (I
am still on version 0.90.2), but the Lucene 4 based version is the way to
go.

Cheers,

Ivan

On Thu, Oct 31, 2013 at 11:42 AM, T Vinod Gupta tvinod@readypulse.comwrote:

hi,
i have running cluster of 3 nodes (ES 0.20.0) and once a day or 2, it
hangs for me.. by hang, i mean curl command for health check or get or
search hangs forever. the log files don't have any clue on this.

what can i do to debug this further? will upgrading to lucene 4 based
versions help?

thanks

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

T_Vinod_Gupta · November 1, 2013, 6:56am

thanks for the response.. i should have provided more details - by nothing
in the logs, i mean nothing around the time of hang. i have a monit task to
run the curl based health check every minute, so i know when it hangs.

there are other indicators in the logs - i see long gc times, exceptions..
and occasionally, it would not just hang, but crash fully due to OOM. here
are some of the messages -

[2013-11-01 00:00:12,910][WARN ][monitor.jvm ] [Marrina]
[gc][ParNew][54839][23098] duration [3.7s], collections [2]/[20.7s], total
[3.7s]/[40.7m], memory [7.9gb]->[7.8gb]/[7.9gb], all_pools {[Code Cache]
[12.4mb]->[12.4mb]/[48mb]}{[Par Eden Space]
[3.3mb]->[13.1mb]/[66.5mb]}{[Par Survivor Space] [7mb]->[0b]/[8.3mb]}{[CMS
Old Gen] [7.8gb]->[7.8gb]/[7.9gb]}{[CMS Perm Gen]
[44.7mb]->[44.7mb]/[168mb]}
*
*
[2013-11-01 03:10:33,244][WARN ][search.action ] [Marrina]
Failed tosend release search context
org.elasticsearch.transport.NodeDisconnectedException:
[Star-Lord][inet[/10.6.14.94:9300]][search/freeContext] disconnected
[2013-11-01 03:10:35,293][WARN ][search.action ] [Marrina]
Failed to send release search context
org.elasticsearch.transport.SendRequestTransportException:
[Star-Lord][inet[/10.6.14.94:9300]][search/freeContext]

is it because i have more data and operations than what the nodes i have
can support. its a 3 node cluster (only 2 data nodes though) running on aws
ec2 m1.xlarge with 8GB dedicated heap on each node.

thanks

On Thu, Oct 31, 2013 at 11:27 PM, Ivan Brusic ivan@brusic.com wrote:

Is there truly nothing in the logs? How about frequent garbage
collections? Do you have any monitoring?

Without knowing anything more, I would strongly suggest upgrading to the
latest release. The memory improvements in 0.90.x are truly remarkable
thanks to Lucene 4 and other changes. I am not one to upgrade frequently (I
am still on version 0.90.2), but the Lucene 4 based version is the way to
go.

Cheers,

Ivan

On Thu, Oct 31, 2013 at 11:42 AM, T Vinod Gupta tvinod@readypulse.comwrote:

hi,
i have running cluster of 3 nodes (ES 0.20.0) and once a day or 2, it
hangs for me.. by hang, i mean curl command for health check or get or
search hangs forever. the log files don't have any clue on this.

what can i do to debug this further? will upgrading to lucene 4 based
versions help?

thanks

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

T_Vinod_Gupta · November 1, 2013, 6:59am

also, i see lot of messages such as these in the log file -
[2013-11-01 00:04:06,534][WARN ][transport ] [Star-Lord]
Received response for a request that has timed out, sent [59098ms] ago,
timed out [29098ms] ago, action [discovery/zen/fd/masterPing], node
[[Marrina][NuHNT0fXTWa33_t6qk2tGg][inet[/10.241.58.143:9300]]], id [15642]

thanks

On Thu, Oct 31, 2013 at 11:56 PM, T Vinod Gupta tvinod@readypulse.comwrote:

thanks for the response.. i should have provided more details - by nothing
in the logs, i mean nothing around the time of hang. i have a monit task to
run the curl based health check every minute, so i know when it hangs.

there are other indicators in the logs - i see long gc times, exceptions..
and occasionally, it would not just hang, but crash fully due to OOM. here
are some of the messages -

[2013-11-01 00:00:12,910][WARN ][monitor.jvm ] [Marrina]
[gc][ParNew][54839][23098] duration [3.7s], collections [2]/[20.7s], total
[3.7s]/[40.7m], memory [7.9gb]->[7.8gb]/[7.9gb], all_pools {[Code Cache]
[12.4mb]->[12.4mb]/[48mb]}{[Par Eden Space]
[3.3mb]->[13.1mb]/[66.5mb]}{[Par Survivor Space] [7mb]->[0b]/[8.3mb]}{[CMS
Old Gen] [7.8gb]->[7.8gb]/[7.9gb]}{[CMS Perm Gen]
[44.7mb]->[44.7mb]/[168mb]}
*
*
[2013-11-01 03:10:33,244][WARN ][search.action ] [Marrina]
Failed tosend release search context
org.elasticsearch.transport.NodeDisconnectedException:
[Star-Lord][inet[/10.6.14.94:9300]][search/freeContext] disconnected
[2013-11-01 03:10:35,293][WARN ][search.action ] [Marrina]
Failed to send release search context
org.elasticsearch.transport.SendRequestTransportException:
[Star-Lord][inet[/10.6.14.94:9300]][search/freeContext]

is it because i have more data and operations than what the nodes i have
can support. its a 3 node cluster (only 2 data nodes though) running on aws
ec2 m1.xlarge with 8GB dedicated heap on each node.

thanks

On Thu, Oct 31, 2013 at 11:27 PM, Ivan Brusic ivan@brusic.com wrote:

Is there truly nothing in the logs? How about frequent garbage
collections? Do you have any monitoring?

Without knowing anything more, I would strongly suggest upgrading to the
latest release. The memory improvements in 0.90.x are truly remarkable
thanks to Lucene 4 and other changes. I am not one to upgrade frequently (I
am still on version 0.90.2), but the Lucene 4 based version is the way to
go.

Cheers,

Ivan

On Thu, Oct 31, 2013 at 11:42 AM, T Vinod Gupta tvinod@readypulse.comwrote:

hi,
i have running cluster of 3 nodes (ES 0.20.0) and once a day or 2, it
hangs for me.. by hang, i mean curl command for health check or get or
search hangs forever. the log files don't have any clue on this.

what can i do to debug this further? will upgrading to lucene 4 based
versions help?

thanks

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · November 1, 2013, 5:04pm

The servers are too busy doing garbage collection and cannot respond to
ping requests.

There are a few ways to resolve this issue, optimizing your queries/tuning
JVM and cache settings, but upgrading elasticsearch is the most sensible
IMHO. Elasticsearch 1.0 will not be out until 2014, but I rather upgrade
twice and not have to endure cluster hiccups due to memory pressure.

Ivan

On Thu, Oct 31, 2013 at 11:59 PM, T Vinod Gupta tvinod@readypulse.comwrote:

also, i see lot of messages such as these in the log file -
[2013-11-01 00:04:06,534][WARN ][transport ] [Star-Lord]
Received response for a request that has timed out, sent [59098ms] ago,
timed out [29098ms] ago, action [discovery/zen/fd/masterPing], node
[[Marrina][NuHNT0fXTWa33_t6qk2tGg][inet[/10.241.58.143:9300]]], id [15642]

thanks

On Thu, Oct 31, 2013 at 11:56 PM, T Vinod Gupta tvinod@readypulse.comwrote:

thanks for the response.. i should have provided more details - by
nothing in the logs, i mean nothing around the time of hang. i have a monit
task to run the curl based health check every minute, so i know when it
hangs.

there are other indicators in the logs - i see long gc times,
exceptions.. and occasionally, it would not just hang, but crash fully due
to OOM. here are some of the messages -

[2013-11-01 00:00:12,910][WARN ][monitor.jvm ] [Marrina]
[gc][ParNew][54839][23098] duration [3.7s], collections [2]/[20.7s], total
[3.7s]/[40.7m], memory [7.9gb]->[7.8gb]/[7.9gb], all_pools {[Code Cache]
[12.4mb]->[12.4mb]/[48mb]}{[Par Eden Space]
[3.3mb]->[13.1mb]/[66.5mb]}{[Par Survivor Space] [7mb]->[0b]/[8.3mb]}{[CMS
Old Gen] [7.8gb]->[7.8gb]/[7.9gb]}{[CMS Perm Gen]
[44.7mb]->[44.7mb]/[168mb]}
*
*
[2013-11-01 03:10:33,244][WARN ][search.action ] [Marrina]
Failed tosend release search context
org.elasticsearch.transport.NodeDisconnectedException:
[Star-Lord][inet[/10.6.14.94:9300]][search/freeContext] disconnected
[2013-11-01 03:10:35,293][WARN ][search.action ] [Marrina]
Failed to send release search context
org.elasticsearch.transport.SendRequestTransportException:
[Star-Lord][inet[/10.6.14.94:9300]][search/freeContext]

is it because i have more data and operations than what the nodes i have
can support. its a 3 node cluster (only 2 data nodes though) running on aws
ec2 m1.xlarge with 8GB dedicated heap on each node.

thanks

On Thu, Oct 31, 2013 at 11:27 PM, Ivan Brusic ivan@brusic.com wrote:

Is there truly nothing in the logs? How about frequent garbage
collections? Do you have any monitoring?

Without knowing anything more, I would strongly suggest upgrading to the
latest release. The memory improvements in 0.90.x are truly remarkable
thanks to Lucene 4 and other changes. I am not one to upgrade frequently (I
am still on version 0.90.2), but the Lucene 4 based version is the way to
go.

Cheers,

Ivan

On Thu, Oct 31, 2013 at 11:42 AM, T Vinod Gupta tvinod@readypulse.comwrote:

hi,
i have running cluster of 3 nodes (ES 0.20.0) and once a day or 2, it
hangs for me.. by hang, i mean curl command for health check or get or
search hangs forever. the log files don't have any clue on this.

what can i do to debug this further? will upgrading to lucene 4 based
versions help?

thanks

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · November 1, 2013, 7:11pm

Without knowing anything about your queries and your memory considerations,
it is not easy to give advice.

From what it seems, you may also have to check JVM settings to be sure.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Hung node, cluster state green Elasticsearch	6	1166	July 6, 2017
ES hangs after some time Elasticsearch	4	566	July 6, 2017
.16.3 Hangs Elasticsearch	9	594	July 6, 2017
Elasticsearch dies every other day Elasticsearch	15	1639	July 6, 2017
Cluster crash, symptoms and possible explanation Elasticsearch	20	2138	July 6, 2017

ES 0.20.0 hangs regularly

Related topics