ES 0.20.0 hangs regularly


(T Vinod Gupta) #1

hi,
i have running cluster of 3 nodes (ES 0.20.0) and once a day or 2, it hangs
for me.. by hang, i mean curl command for health check or get or search
hangs forever. the log files don't have any clue on this.

what can i do to debug this further? will upgrading to lucene 4 based
versions help?

thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ivan Brusic) #2

Is there truly nothing in the logs? How about frequent garbage collections?
Do you have any monitoring?

Without knowing anything more, I would strongly suggest upgrading to the
latest release. The memory improvements in 0.90.x are truly remarkable
thanks to Lucene 4 and other changes. I am not one to upgrade frequently (I
am still on version 0.90.2), but the Lucene 4 based version is the way to
go.

Cheers,

Ivan

On Thu, Oct 31, 2013 at 11:42 AM, T Vinod Gupta tvinod@readypulse.comwrote:

hi,
i have running cluster of 3 nodes (ES 0.20.0) and once a day or 2, it
hangs for me.. by hang, i mean curl command for health check or get or
search hangs forever. the log files don't have any clue on this.

what can i do to debug this further? will upgrading to lucene 4 based
versions help?

thanks

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(T Vinod Gupta) #3

thanks for the response.. i should have provided more details - by nothing
in the logs, i mean nothing around the time of hang. i have a monit task to
run the curl based health check every minute, so i know when it hangs.

there are other indicators in the logs - i see long gc times, exceptions..
and occasionally, it would not just hang, but crash fully due to OOM. here
are some of the messages -

[2013-11-01 00:00:12,910][WARN ][monitor.jvm ] [Marrina]
[gc][ParNew][54839][23098] duration [3.7s], collections [2]/[20.7s], total
[3.7s]/[40.7m], memory [7.9gb]->[7.8gb]/[7.9gb], all_pools {[Code Cache]
[12.4mb]->[12.4mb]/[48mb]}{[Par Eden Space]
[3.3mb]->[13.1mb]/[66.5mb]}{[Par Survivor Space] [7mb]->[0b]/[8.3mb]}{[CMS
Old Gen] [7.8gb]->[7.8gb]/[7.9gb]}{[CMS Perm Gen]
[44.7mb]->[44.7mb]/[168mb]}

*
*
[2013-11-01 03:10:33,244][WARN ][search.action ] [Marrina]
Failed tosend release search context

org.elasticsearch.transport.NodeDisconnectedException:
[Star-Lord][inet[/10.6.14.94:9300]][search/freeContext] disconnected

[2013-11-01 03:10:35,293][WARN ][search.action ] [Marrina]
Failed to send release search context

org.elasticsearch.transport.SendRequestTransportException:
[Star-Lord][inet[/10.6.14.94:9300]][search/freeContext]

is it because i have more data and operations than what the nodes i have
can support. its a 3 node cluster (only 2 data nodes though) running on aws
ec2 m1.xlarge with 8GB dedicated heap on each node.

thanks

On Thu, Oct 31, 2013 at 11:27 PM, Ivan Brusic ivan@brusic.com wrote:

Is there truly nothing in the logs? How about frequent garbage
collections? Do you have any monitoring?

Without knowing anything more, I would strongly suggest upgrading to the
latest release. The memory improvements in 0.90.x are truly remarkable
thanks to Lucene 4 and other changes. I am not one to upgrade frequently (I
am still on version 0.90.2), but the Lucene 4 based version is the way to
go.

Cheers,

Ivan

On Thu, Oct 31, 2013 at 11:42 AM, T Vinod Gupta tvinod@readypulse.comwrote:

hi,
i have running cluster of 3 nodes (ES 0.20.0) and once a day or 2, it
hangs for me.. by hang, i mean curl command for health check or get or
search hangs forever. the log files don't have any clue on this.

what can i do to debug this further? will upgrading to lucene 4 based
versions help?

thanks

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(T Vinod Gupta) #4

also, i see lot of messages such as these in the log file -
[2013-11-01 00:04:06,534][WARN ][transport ] [Star-Lord]
Received response for a request that has timed out, sent [59098ms] ago,
timed out [29098ms] ago, action [discovery/zen/fd/masterPing], node
[[Marrina][NuHNT0fXTWa33_t6qk2tGg][inet[/10.241.58.143:9300]]], id [15642]

thanks

On Thu, Oct 31, 2013 at 11:56 PM, T Vinod Gupta tvinod@readypulse.comwrote:

thanks for the response.. i should have provided more details - by nothing
in the logs, i mean nothing around the time of hang. i have a monit task to
run the curl based health check every minute, so i know when it hangs.

there are other indicators in the logs - i see long gc times, exceptions..
and occasionally, it would not just hang, but crash fully due to OOM. here
are some of the messages -

[2013-11-01 00:00:12,910][WARN ][monitor.jvm ] [Marrina]
[gc][ParNew][54839][23098] duration [3.7s], collections [2]/[20.7s], total
[3.7s]/[40.7m], memory [7.9gb]->[7.8gb]/[7.9gb], all_pools {[Code Cache]
[12.4mb]->[12.4mb]/[48mb]}{[Par Eden Space]
[3.3mb]->[13.1mb]/[66.5mb]}{[Par Survivor Space] [7mb]->[0b]/[8.3mb]}{[CMS
Old Gen] [7.8gb]->[7.8gb]/[7.9gb]}{[CMS Perm Gen]
[44.7mb]->[44.7mb]/[168mb]}

*
*
[2013-11-01 03:10:33,244][WARN ][search.action ] [Marrina]
Failed tosend release search context

org.elasticsearch.transport.NodeDisconnectedException:
[Star-Lord][inet[/10.6.14.94:9300]][search/freeContext] disconnected

[2013-11-01 03:10:35,293][WARN ][search.action ] [Marrina]
Failed to send release search context

org.elasticsearch.transport.SendRequestTransportException:
[Star-Lord][inet[/10.6.14.94:9300]][search/freeContext]

is it because i have more data and operations than what the nodes i have
can support. its a 3 node cluster (only 2 data nodes though) running on aws
ec2 m1.xlarge with 8GB dedicated heap on each node.

thanks

On Thu, Oct 31, 2013 at 11:27 PM, Ivan Brusic ivan@brusic.com wrote:

Is there truly nothing in the logs? How about frequent garbage
collections? Do you have any monitoring?

Without knowing anything more, I would strongly suggest upgrading to the
latest release. The memory improvements in 0.90.x are truly remarkable
thanks to Lucene 4 and other changes. I am not one to upgrade frequently (I
am still on version 0.90.2), but the Lucene 4 based version is the way to
go.

Cheers,

Ivan

On Thu, Oct 31, 2013 at 11:42 AM, T Vinod Gupta tvinod@readypulse.comwrote:

hi,
i have running cluster of 3 nodes (ES 0.20.0) and once a day or 2, it
hangs for me.. by hang, i mean curl command for health check or get or
search hangs forever. the log files don't have any clue on this.

what can i do to debug this further? will upgrading to lucene 4 based
versions help?

thanks

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ivan Brusic) #5

The servers are too busy doing garbage collection and cannot respond to
ping requests.

There are a few ways to resolve this issue, optimizing your queries/tuning
JVM and cache settings, but upgrading elasticsearch is the most sensible
IMHO. Elasticsearch 1.0 will not be out until 2014, but I rather upgrade
twice and not have to endure cluster hiccups due to memory pressure.

Ivan

On Thu, Oct 31, 2013 at 11:59 PM, T Vinod Gupta tvinod@readypulse.comwrote:

also, i see lot of messages such as these in the log file -
[2013-11-01 00:04:06,534][WARN ][transport ] [Star-Lord]
Received response for a request that has timed out, sent [59098ms] ago,
timed out [29098ms] ago, action [discovery/zen/fd/masterPing], node
[[Marrina][NuHNT0fXTWa33_t6qk2tGg][inet[/10.241.58.143:9300]]], id [15642]

thanks

On Thu, Oct 31, 2013 at 11:56 PM, T Vinod Gupta tvinod@readypulse.comwrote:

thanks for the response.. i should have provided more details - by
nothing in the logs, i mean nothing around the time of hang. i have a monit
task to run the curl based health check every minute, so i know when it
hangs.

there are other indicators in the logs - i see long gc times,
exceptions.. and occasionally, it would not just hang, but crash fully due
to OOM. here are some of the messages -

[2013-11-01 00:00:12,910][WARN ][monitor.jvm ] [Marrina]
[gc][ParNew][54839][23098] duration [3.7s], collections [2]/[20.7s], total
[3.7s]/[40.7m], memory [7.9gb]->[7.8gb]/[7.9gb], all_pools {[Code Cache]
[12.4mb]->[12.4mb]/[48mb]}{[Par Eden Space]
[3.3mb]->[13.1mb]/[66.5mb]}{[Par Survivor Space] [7mb]->[0b]/[8.3mb]}{[CMS
Old Gen] [7.8gb]->[7.8gb]/[7.9gb]}{[CMS Perm Gen]
[44.7mb]->[44.7mb]/[168mb]}

*
*
[2013-11-01 03:10:33,244][WARN ][search.action ] [Marrina]
Failed tosend release search context

org.elasticsearch.transport.NodeDisconnectedException:
[Star-Lord][inet[/10.6.14.94:9300]][search/freeContext] disconnected

[2013-11-01 03:10:35,293][WARN ][search.action ] [Marrina]
Failed to send release search context

org.elasticsearch.transport.SendRequestTransportException:
[Star-Lord][inet[/10.6.14.94:9300]][search/freeContext]

is it because i have more data and operations than what the nodes i have
can support. its a 3 node cluster (only 2 data nodes though) running on aws
ec2 m1.xlarge with 8GB dedicated heap on each node.

thanks

On Thu, Oct 31, 2013 at 11:27 PM, Ivan Brusic ivan@brusic.com wrote:

Is there truly nothing in the logs? How about frequent garbage
collections? Do you have any monitoring?

Without knowing anything more, I would strongly suggest upgrading to the
latest release. The memory improvements in 0.90.x are truly remarkable
thanks to Lucene 4 and other changes. I am not one to upgrade frequently (I
am still on version 0.90.2), but the Lucene 4 based version is the way to
go.

Cheers,

Ivan

On Thu, Oct 31, 2013 at 11:42 AM, T Vinod Gupta tvinod@readypulse.comwrote:

hi,
i have running cluster of 3 nodes (ES 0.20.0) and once a day or 2, it
hangs for me.. by hang, i mean curl command for health check or get or
search hangs forever. the log files don't have any clue on this.

what can i do to debug this further? will upgrading to lucene 4 based
versions help?

thanks

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #6

Without knowing anything about your queries and your memory considerations,
it is not easy to give advice.

From what it seems, you may also have to check JVM settings to be sure.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #7