Unresponsive cluster after too large of a query (OutOfMemoryError: Java heap space)?

natefox · May 5, 2014, 7:34pm

We're using ES 1.1.0 for central logging storage/searching. When we use
Kibana and search a month's worth of data, our cluster becomes
unresponsive. By unresponsive I mean that many nodes will respond
immediately to a 'curl localhost:9200' but a couple will not. This leads to
any cluster metrics not being available when quering the master and we're
unable to set any cluster-level settings.

We're getting a these types of errors in the logs:
[2014-05-05 19:10:50,763][WARN ][transport.netty ] [Leap-Frog]
exception caught on transport layer [[id: 0x4b074069, /10.6.10.211:57563 =>
/10.6.10.148:9300]], closing connection
java.lang.OutOfMemoryError: Java heap space

The cluster seems to never recover either - and that is my biggest concern.
So my questions are:

Is it normal for the entire cluster to just close up shop because a
couple nodes are unresponsive? I thought the field data circuit breaker
would fix this, but maybe this is a different problem.
How to best get ES to recover from this scenario? I dont really want to
restart just the two nodes, as we have >1Tb of data on each node, but
issuing a disable_allocation fails because it cannot write to all nodes in
the cluster

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2fb4e427-cf95-4882-bd87-728fbfef10dd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · May 5, 2014, 11:03pm

You have only two nodes it seems. Adding nodes may help.

Beside data nodes that do the heavy work, set up 3 master eligible nodes
(data-less nodes, with reasonable smaller heap size for cluster state and
mappings). Set the other data nodes to non-eligible for becoming master.

Jörg

On Mon, May 5, 2014 at 9:34 PM, Nate Fox thefox@gmail.com wrote:

We're using ES 1.1.0 for central logging storage/searching. When we use
Kibana and search a month's worth of data, our cluster becomes
unresponsive. By unresponsive I mean that many nodes will respond
immediately to a 'curl localhost:9200' but a couple will not. This leads to
any cluster metrics not being available when quering the master and we're
unable to set any cluster-level settings.

We're getting a these types of errors in the logs:
[2014-05-05 19:10:50,763][WARN ][transport.netty ] [Leap-Frog]
exception caught on transport layer [[id: 0x4b074069, /10.6.10.211:57563=> /10.6.10.148:9300]],
closing connection
java.lang.OutOfMemoryError: Java heap space

The cluster seems to never recover either - and that is my biggest
concern. So my questions are:

Is it normal for the entire cluster to just close up shop because a
couple nodes are unresponsive? I thought the field data circuit breaker
would fix this, but maybe this is a different problem.

How to best get ES to recover from this scenario? I dont really want to
restart just the two nodes, as we have >1Tb of data on each node, but
issuing a disable_allocation fails because it cannot write to all nodes in
the cluster

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2fb4e427-cf95-4882-bd87-728fbfef10dd%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/2fb4e427-cf95-4882-bd87-728fbfef10dd%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE2EXjGifcjQkvhE1NmeEnHUJO%3Dr-iB7E%3DLzY-jxz%2BAAw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

natefox · May 5, 2014, 11:11pm

I have 11 nodes. 3 are dedicated masters and the other 8 are data nodes.
On May 5, 2014 4:03 PM, "joergprante@gmail.com" joergprante@gmail.com
wrote:

You have only two nodes it seems. Adding nodes may help.

Beside data nodes that do the heavy work, set up 3 master eligible nodes
(data-less nodes, with reasonable smaller heap size for cluster state and
mappings). Set the other data nodes to non-eligible for becoming master.

Jörg

On Mon, May 5, 2014 at 9:34 PM, Nate Fox thefox@gmail.com wrote:

We're using ES 1.1.0 for central logging storage/searching. When we use
Kibana and search a month's worth of data, our cluster becomes
unresponsive. By unresponsive I mean that many nodes will respond
immediately to a 'curl localhost:9200' but a couple will not. This leads to
any cluster metrics not being available when quering the master and we're
unable to set any cluster-level settings.

We're getting a these types of errors in the logs:
[2014-05-05 19:10:50,763][WARN ][transport.netty ] [Leap-Frog]
exception caught on transport layer [[id: 0x4b074069, /10.6.10.211:57563=> /10.6.10.148:9300]],
closing connection
java.lang.OutOfMemoryError: Java heap space

The cluster seems to never recover either - and that is my biggest
concern. So my questions are:

Is it normal for the entire cluster to just close up shop because a
couple nodes are unresponsive? I thought the field data circuit breaker
would fix this, but maybe this is a different problem.

How to best get ES to recover from this scenario? I dont really want
to restart just the two nodes, as we have >1Tb of data on each node, but
issuing a disable_allocation fails because it cannot write to all nodes in
the cluster

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2fb4e427-cf95-4882-bd87-728fbfef10dd%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/2fb4e427-cf95-4882-bd87-728fbfef10dd%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/pNgeukzPL3A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE2EXjGifcjQkvhE1NmeEnHUJO%3Dr-iB7E%3DLzY-jxz%2BAAw%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE2EXjGifcjQkvhE1NmeEnHUJO%3Dr-iB7E%3DLzY-jxz%2BAAw%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHU4sP8-t%3Duqie%3Dz5pjb2ab7Te51q%2BWAac7pECeJrM%3DyrnDT7w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

warkolm · May 5, 2014, 11:12pm

Then you need more nodes, more heap on existing nodes or less data.
You've reached the limit of what your current cluster can handle, that is
why this is happening.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 6 May 2014 09:11, Nate Fox thefox@gmail.com wrote:

I have 11 nodes. 3 are dedicated masters and the other 8 are data nodes.
On May 5, 2014 4:03 PM, "joergprante@gmail.com" joergprante@gmail.com
wrote:

You have only two nodes it seems. Adding nodes may help.

Beside data nodes that do the heavy work, set up 3 master eligible nodes
(data-less nodes, with reasonable smaller heap size for cluster state and
mappings). Set the other data nodes to non-eligible for becoming master.

Jörg

On Mon, May 5, 2014 at 9:34 PM, Nate Fox thefox@gmail.com wrote:

We're using ES 1.1.0 for central logging storage/searching. When we use
Kibana and search a month's worth of data, our cluster becomes
unresponsive. By unresponsive I mean that many nodes will respond
immediately to a 'curl localhost:9200' but a couple will not. This leads to
any cluster metrics not being available when quering the master and we're
unable to set any cluster-level settings.

We're getting a these types of errors in the logs:
[2014-05-05 19:10:50,763][WARN ][transport.netty ] [Leap-Frog]
exception caught on transport layer [[id: 0x4b074069, /10.6.10.211:57563=> /10.6.10.148:9300]],
closing connection
java.lang.OutOfMemoryError: Java heap space

The cluster seems to never recover either - and that is my biggest
concern. So my questions are:

Is it normal for the entire cluster to just close up shop because a
couple nodes are unresponsive? I thought the field data circuit breaker
would fix this, but maybe this is a different problem.

How to best get ES to recover from this scenario? I dont really want
to restart just the two nodes, as we have >1Tb of data on each node, but
issuing a disable_allocation fails because it cannot write to all nodes in
the cluster

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2fb4e427-cf95-4882-bd87-728fbfef10dd%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/2fb4e427-cf95-4882-bd87-728fbfef10dd%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/pNgeukzPL3A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE2EXjGifcjQkvhE1NmeEnHUJO%3Dr-iB7E%3DLzY-jxz%2BAAw%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE2EXjGifcjQkvhE1NmeEnHUJO%3Dr-iB7E%3DLzY-jxz%2BAAw%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHU4sP8-t%3Duqie%3Dz5pjb2ab7Te51q%2BWAac7pECeJrM%3DyrnDT7w%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAHU4sP8-t%3Duqie%3Dz5pjb2ab7Te51q%2BWAac7pECeJrM%3DyrnDT7w%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624Z4xuDmxXsL-fFP3rB4X%3D5_dwZpoeTZ6kfFg6VkE0db7A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

natefox · May 6, 2014, 12:06am

Is there any way to prevent ES from blowing up just by selecting too much
data? This is my biggest concern.
Is it because the bootstrap.mlockall is on, so we give ES/JVM a specified
amount of memory and thats all that node will receive? If we turned that
off and had gobs more swap available for ES, would it not blow up, but just
be real slow?

On Mon, May 5, 2014 at 4:12 PM, Mark Walkom markw@campaignmonitor.comwrote:

Then you need more nodes, more heap on existing nodes or less data.
You've reached the limit of what your current cluster can handle, that is
why this is happening.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 6 May 2014 09:11, Nate Fox thefox@gmail.com wrote:

I have 11 nodes. 3 are dedicated masters and the other 8 are data nodes.
On May 5, 2014 4:03 PM, "joergprante@gmail.com" joergprante@gmail.com
wrote:

You have only two nodes it seems. Adding nodes may help.

Beside data nodes that do the heavy work, set up 3 master eligible nodes
(data-less nodes, with reasonable smaller heap size for cluster state and
mappings). Set the other data nodes to non-eligible for becoming master.

Jörg

On Mon, May 5, 2014 at 9:34 PM, Nate Fox thefox@gmail.com wrote:

We're using ES 1.1.0 for central logging storage/searching. When we use
Kibana and search a month's worth of data, our cluster becomes
unresponsive. By unresponsive I mean that many nodes will respond
immediately to a 'curl localhost:9200' but a couple will not. This leads to
any cluster metrics not being available when quering the master and we're
unable to set any cluster-level settings.

We're getting a these types of errors in the logs:
[2014-05-05 19:10:50,763][WARN ][transport.netty ] [Leap-Frog]
exception caught on transport layer [[id: 0x4b074069, /
10.6.10.211:57563 => /10.6.10.148:9300]], closing connection
java.lang.OutOfMemoryError: Java heap space

The cluster seems to never recover either - and that is my biggest
concern. So my questions are:

Is it normal for the entire cluster to just close up shop because a
couple nodes are unresponsive? I thought the field data circuit breaker
would fix this, but maybe this is a different problem.

How to best get ES to recover from this scenario? I dont really want
to restart just the two nodes, as we have >1Tb of data on each node, but
issuing a disable_allocation fails because it cannot write to all nodes in
the cluster

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2fb4e427-cf95-4882-bd87-728fbfef10dd%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/2fb4e427-cf95-4882-bd87-728fbfef10dd%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/pNgeukzPL3A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE2EXjGifcjQkvhE1NmeEnHUJO%3Dr-iB7E%3DLzY-jxz%2BAAw%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE2EXjGifcjQkvhE1NmeEnHUJO%3Dr-iB7E%3DLzY-jxz%2BAAw%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHU4sP8-t%3Duqie%3Dz5pjb2ab7Te51q%2BWAac7pECeJrM%3DyrnDT7w%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAHU4sP8-t%3Duqie%3Dz5pjb2ab7Te51q%2BWAac7pECeJrM%3DyrnDT7w%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/pNgeukzPL3A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEM624Z4xuDmxXsL-fFP3rB4X%3D5_dwZpoeTZ6kfFg6VkE0db7A%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAEM624Z4xuDmxXsL-fFP3rB4X%3D5_dwZpoeTZ6kfFg6VkE0db7A%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHU4sP-eCbbWu%2B01EuTwmaYwZBwN_REPUS0QpwdfpMraY57Y0w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

jprante · May 6, 2014, 12:32pm

ES has a lot of failsafe mechanisms against "OutOfMemoryError" built in:

thread pools are strict, they do not grow endlessly
field cache usage is limited
a field circuit breaker helps to terminate queries early before too much
memory is consumed
closing unused indices frees heap resources that are no longer required
balancing shards over nodes for equalizing resource usage over the nodes
catching "Throwable" in several critical modules to allow spontaneous
recovery from temporary JVM OOMs (e.g. if GC is too slow)

Nevertheless you can override defaults and get into the "red area" where an
ES node is no longer able to react properly over the API, also because of

misconfigurations
"bad behaving" queries which exploit CPU usage or exceed available heap
in unpredictable ways
unexpected, huge query loads, large result sets
sudden peaks of resource usage, e.g. while merging large segments, or
bulk indexing
distorted document/term distribution over shards that knock out equal
shard balancing
etc.

Unresponsive nodes are taken out from the cluster after a few seconds, so
this is not really a problem, unless you have no replica, or the cluster
can't keep up with recovery from such events.

There is no known mechanism to protect you automatically from crossing the
line to the "red area" when a JVM can not recover from OOM and gets
unresponsive. This is not specific to ES but to all JVM applications.

Best practice is "know your data, know your nodes". Exercise your ES
cluster before putting real data on it to get an idea of the maximum
capacity of a node or the whole cluster and the best configuration options,
and put a proxy before ES to allow only "well behaving" actions.

Jörg

On Tue, May 6, 2014 at 2:06 AM, Nate Fox thefox@gmail.com wrote:

Is there any way to prevent ES from blowing up just by selecting too much
data? This is my biggest concern.
Is it because the bootstrap.mlockall is on, so we give ES/JVM a specified
amount of memory and thats all that node will receive? If we turned that
off and had gobs more swap available for ES, would it not blow up, but just
be real slow?

On Mon, May 5, 2014 at 4:12 PM, Mark Walkom markw@campaignmonitor.comwrote:

Then you need more nodes, more heap on existing nodes or less data.
You've reached the limit of what your current cluster can handle, that is
why this is happening.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 6 May 2014 09:11, Nate Fox thefox@gmail.com wrote:

I have 11 nodes. 3 are dedicated masters and the other 8 are data
nodes.
On May 5, 2014 4:03 PM, "joergprante@gmail.com" joergprante@gmail.com
wrote:

You have only two nodes it seems. Adding nodes may help.

Beside data nodes that do the heavy work, set up 3 master eligible
nodes (data-less nodes, with reasonable smaller heap size for cluster state
and mappings). Set the other data nodes to non-eligible for becoming master.

Jörg

On Mon, May 5, 2014 at 9:34 PM, Nate Fox thefox@gmail.com wrote:

We're using ES 1.1.0 for central logging storage/searching. When we
use Kibana and search a month's worth of data, our cluster becomes
unresponsive. By unresponsive I mean that many nodes will respond
immediately to a 'curl localhost:9200' but a couple will not. This leads to
any cluster metrics not being available when quering the master and we're
unable to set any cluster-level settings.

We're getting a these types of errors in the logs:
[2014-05-05 19:10:50,763][WARN ][transport.netty ]
[Leap-Frog] exception caught on transport layer [[id: 0x4b074069, /
10.6.10.211:57563 => /10.6.10.148:9300]], closing connection
java.lang.OutOfMemoryError: Java heap space

The cluster seems to never recover either - and that is my biggest
concern. So my questions are:

Is it normal for the entire cluster to just close up shop because a
couple nodes are unresponsive? I thought the field data circuit breaker
would fix this, but maybe this is a different problem.

How to best get ES to recover from this scenario? I dont really
want to restart just the two nodes, as we have >1Tb of data on each node,
but issuing a disable_allocation fails because it cannot write to all nodes
in the cluster

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2fb4e427-cf95-4882-bd87-728fbfef10dd%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/2fb4e427-cf95-4882-bd87-728fbfef10dd%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/pNgeukzPL3A/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE2EXjGifcjQkvhE1NmeEnHUJO%3Dr-iB7E%3DLzY-jxz%2BAAw%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE2EXjGifcjQkvhE1NmeEnHUJO%3Dr-iB7E%3DLzY-jxz%2BAAw%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHU4sP8-t%3Duqie%3Dz5pjb2ab7Te51q%2BWAac7pECeJrM%3DyrnDT7w%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAHU4sP8-t%3Duqie%3Dz5pjb2ab7Te51q%2BWAac7pECeJrM%3DyrnDT7w%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/pNgeukzPL3A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEM624Z4xuDmxXsL-fFP3rB4X%3D5_dwZpoeTZ6kfFg6VkE0db7A%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAEM624Z4xuDmxXsL-fFP3rB4X%3D5_dwZpoeTZ6kfFg6VkE0db7A%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHU4sP-eCbbWu%2B01EuTwmaYwZBwN_REPUS0QpwdfpMraY57Y0w%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAHU4sP-eCbbWu%2B01EuTwmaYwZBwN_REPUS0QpwdfpMraY57Y0w%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHqnUU6j3HgZkirfmd3amvKSuZud4b6khvK1_84GewPDA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

natefox · May 7, 2014, 4:17pm

Thanks Jörg. I think my experiences in other data stores (mongo, sql) has
me thinking that ES is similar, when in reality its a different tool with
different pros and cons. As for configuration, we're basically running
stock (enabled slow logs, unicast and site tagging). I think ultimately ES
is heavily reliant upon not swapping, so the memory you give it needs to
hold all of your results. Where in my past I've allowed SQL servers to swap
to handle larger loads.

We'll play with some other settings. I liken Kibana to handing someone with
no SQL best practices knowledge a 5Tb sql database and a GUI query builder
and being surprised when they join 50 tables and bring the SQL server to
its knees

On Tuesday, May 6, 2014 5:32:51 AM UTC-7, Jörg Prante wrote:

ES has a lot of failsafe mechanisms against "OutOfMemoryError" built in:

thread pools are strict, they do not grow endlessly

field cache usage is limited

a field circuit breaker helps to terminate queries early before too much
memory is consumed

closing unused indices frees heap resources that are no longer required

balancing shards over nodes for equalizing resource usage over the nodes

catching "Throwable" in several critical modules to allow spontaneous
recovery from temporary JVM OOMs (e.g. if GC is too slow)

Nevertheless you can override defaults and get into the "red area" where
an ES node is no longer able to react properly over the API, also because of

misconfigurations

"bad behaving" queries which exploit CPU usage or exceed available heap
in unpredictable ways

unexpected, huge query loads, large result sets

sudden peaks of resource usage, e.g. while merging large segments, or
bulk indexing

distorted document/term distribution over shards that knock out equal
shard balancing

etc.

Unresponsive nodes are taken out from the cluster after a few seconds, so
this is not really a problem, unless you have no replica, or the cluster
can't keep up with recovery from such events.

There is no known mechanism to protect you automatically from crossing the
line to the "red area" when a JVM can not recover from OOM and gets
unresponsive. This is not specific to ES but to all JVM applications.

Best practice is "know your data, know your nodes". Exercise your ES
cluster before putting real data on it to get an idea of the maximum
capacity of a node or the whole cluster and the best configuration options,
and put a proxy before ES to allow only "well behaving" actions.

Jörg

On Tue, May 6, 2014 at 2:06 AM, Nate Fox <the...@gmail.com <javascript:>>wrote:

Is there any way to prevent ES from blowing up just by selecting too much
data? This is my biggest concern.
Is it because the bootstrap.mlockall is on, so we give ES/JVM a specified
amount of memory and thats all that node will receive? If we turned that
off and had gobs more swap available for ES, would it not blow up, but just
be real slow?

On Mon, May 5, 2014 at 4:12 PM, Mark Walkom <ma...@campaignmonitor.com<javascript:>

wrote:

Then you need more nodes, more heap on existing nodes or less data.
You've reached the limit of what your current cluster can handle, that
is why this is happening.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com <javascript:>
web: www.campaignmonitor.com

On 6 May 2014 09:11, Nate Fox <the...@gmail.com <javascript:>> wrote:

I have 11 nodes. 3 are dedicated masters and the other 8 are data
nodes.
On May 5, 2014 4:03 PM, "joerg...@gmail.com <javascript:>" <
joerg...@gmail.com <javascript:>> wrote:

You have only two nodes it seems. Adding nodes may help.

Beside data nodes that do the heavy work, set up 3 master eligible
nodes (data-less nodes, with reasonable smaller heap size for cluster state
and mappings). Set the other data nodes to non-eligible for becoming master.

Jörg

On Mon, May 5, 2014 at 9:34 PM, Nate Fox <the...@gmail.com<javascript:>

wrote:

We're using ES 1.1.0 for central logging storage/searching. When we
use Kibana and search a month's worth of data, our cluster becomes
unresponsive. By unresponsive I mean that many nodes will respond
immediately to a 'curl localhost:9200' but a couple will not. This leads to
any cluster metrics not being available when quering the master and we're
unable to set any cluster-level settings.

We're getting a these types of errors in the logs:
[2014-05-05 19:10:50,763][WARN ][transport.netty ]
[Leap-Frog] exception caught on transport layer [[id: 0x4b074069, /
10.6.10.211:57563 => /10.6.10.148:9300]], closing connection
java.lang.OutOfMemoryError: Java heap space

The cluster seems to never recover either - and that is my biggest
concern. So my questions are:

Is it normal for the entire cluster to just close up shop because
a couple nodes are unresponsive? I thought the field data circuit breaker
would fix this, but maybe this is a different problem.

How to best get ES to recover from this scenario? I dont really
want to restart just the two nodes, as we have >1Tb of data on each node,
but issuing a disable_allocation fails because it cannot write to all nodes
in the cluster

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2fb4e427-cf95-4882-bd87-728fbfef10dd%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/2fb4e427-cf95-4882-bd87-728fbfef10dd%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/pNgeukzPL3A/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE2EXjGifcjQkvhE1NmeEnHUJO%3Dr-iB7E%3DLzY-jxz%2BAAw%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE2EXjGifcjQkvhE1NmeEnHUJO%3Dr-iB7E%3DLzY-jxz%2BAAw%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHU4sP8-t%3Duqie%3Dz5pjb2ab7Te51q%2BWAac7pECeJrM%3DyrnDT7w%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAHU4sP8-t%3Duqie%3Dz5pjb2ab7Te51q%2BWAac7pECeJrM%3DyrnDT7w%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/pNgeukzPL3A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEM624Z4xuDmxXsL-fFP3rB4X%3D5_dwZpoeTZ6kfFg6VkE0db7A%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAEM624Z4xuDmxXsL-fFP3rB4X%3D5_dwZpoeTZ6kfFg6VkE0db7A%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHU4sP-eCbbWu%2B01EuTwmaYwZBwN_REPUS0QpwdfpMraY57Y0w%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAHU4sP-eCbbWu%2B01EuTwmaYwZBwN_REPUS0QpwdfpMraY57Y0w%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f24e33df-9327-479b-8e09-33887cd8e0bb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Identifying the cause of an unresponsive ES Cluster Elasticsearch	24	1044	June 12, 2023
ES cluster becomes unresponsive Elasticsearch	2	696	July 6, 2017
ES becomes unresponsive! Elasticsearch	8	2437	July 5, 2017
Searching the big index - java.lang.OutOfMemoryError: Java heap space Elasticsearch	8	495	July 6, 2017
ES cluster becomes unresponsive Elasticsearch	1	348	July 6, 2017

Unresponsive cluster after too large of a query (OutOfMemoryError: Java heap space)?

Related topics