Single low memory data node impact whole cluster

Hi,

We recently had a few incidents where a single index with low memory is
impacting the entire cluster. All the cluster related APIs are not
responding. Kibana 3 and 4 are failing to load too. From log it seems it's
doing GC and not responding to any requests. And there's no log between
2:29 to 4:07 where i restarted the node. Is there anyway to make this more
resilient?

[2015-02-05 14:29:17,199][INFO ][monitor.jvm ] [Big Wheel]
[gc][young][78379][36567] duration [864ms], collections [1]/[1.7s], total
[864ms]/[1.4h], memory [15.2gb]->[14.6gb]/[19.9gb], all_pools {[young]
[599.8mb]->[2.8mb]/[665.6mb]}{[survivor] [75.6mb]->[83.1mb]/[83.1mb]}{[old]
[14.5gb]->[14.5gb]/[19.1gb]}

[2015-02-05 14:29:23,302][WARN ][monitor.jvm ] [Big Wheel]
[gc][young][78384][36568] duration [1.4s], collections [1]/[2s], total
[1.4s]/[1.4h], memory [15.1gb]->[14.7gb]/[19.9gb], all_pools {[young]
[459.7mb]->[15.7mb]/[665.6mb]}{[survivor]
[83.1mb]->[83.1mb]/[83.1mb]}{[old] [14.5gb]->[14.6gb]/[19.1gb]}

[2015-02-05 14:29:34,990][INFO ][monitor.jvm ] [Big Wheel]
[gc][young][78395][36571] duration [900ms], collections [1]/[1.4s], total
[900ms]/[1.4h], memory [15.1gb]->[14.6gb]/[19.9gb], all_pools {[young]
[484.9mb]->[3.9mb]/[665.6mb]}{[survivor] [71.7mb]->[52.4mb]/[83.1mb]}{[old]
[14.6gb]->[14.6gb]/[19.1gb]}

[2015-02-05 14:29:45,055*][WARN ][monitor.jvm ] [Big Wheel]
[gc][young][78404][36574] duration [1.2s], collections [1]/[2s], total
[1.2s]/[1.4h], memory [15.1gb]->[14.7gb]/[19.9gb], all_pools {[young]
[472.8mb]->[2.9mb]/[665.6mb]}{[survivor] [83.1mb]->[67.6mb]/[83.1mb]}{[old]
[14.6gb]->[14.6gb]/[19.1gb]}*

*[2015-02-05 16:07:15,509][*INFO ][node ] [Pyro]
version[1.4.2], pid[9796], build[927caff/2014-12-16T14:11:12Z]

[2015-02-05 16:07:15,510][INFO ][node ] [Pyro]
initializing ...

[2015-02-05 16:07:15,638][INFO ][plugins ] [Pyro] loaded
[marvel, cloud-azure], sites [marvel, kopf]

[2015-02-05 16:07:24,844][INFO ][node ] [Pyro]
initialized

[2015-02-05 16:07:24,845][INFO ][node ] [Pyro]
starting ...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/39c89f64-8614-4c4d-bef4-420a5a9eae46%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Indeed JVMs sometimes need to "stop the world" in case of memory pressure.
You might find some advices about GC tuning here or there but these but I
would advise to avoid it as it is very hard to evaluate the impact of these
settings.

If this issue happens on a regular basis, it might mean that your cluster
is undersized and should be given more memory so that the JVM doesn't have
to run full GCs so often. Otherwise, you should look at how you could
modify elasticsearch's configuration in order to load less stuff in memory
(such as using doc values for fielddata). Another option is to run two
nodes instead of one per machine (with half the memory). Given that full
gcs are shorter on small heaps, this should limit the issue.

On Sat, Feb 7, 2015 at 2:55 AM, liu wei liu.liuwei.wei@gmail.com wrote:

Hi,

We recently had a few incidents where a single index with low memory is
impacting the entire cluster. All the cluster related APIs are not
responding. Kibana 3 and 4 are failing to load too. From log it seems it's
doing GC and not responding to any requests. And there's no log between
2:29 to 4:07 where i restarted the node. Is there anyway to make this more
resilient?

[2015-02-05 14:29:17,199][INFO ][monitor.jvm ] [Big Wheel]
[gc][young][78379][36567] duration [864ms], collections [1]/[1.7s], total
[864ms]/[1.4h], memory [15.2gb]->[14.6gb]/[19.9gb], all_pools {[young]
[599.8mb]->[2.8mb]/[665.6mb]}{[survivor] [75.6mb]->[83.1mb]/[83.1mb]}{[old]
[14.5gb]->[14.5gb]/[19.1gb]}

[2015-02-05 14:29:23,302][WARN ][monitor.jvm ] [Big Wheel]
[gc][young][78384][36568] duration [1.4s], collections [1]/[2s], total
[1.4s]/[1.4h], memory [15.1gb]->[14.7gb]/[19.9gb], all_pools {[young]
[459.7mb]->[15.7mb]/[665.6mb]}{[survivor]
[83.1mb]->[83.1mb]/[83.1mb]}{[old] [14.5gb]->[14.6gb]/[19.1gb]}

[2015-02-05 14:29:34,990][INFO ][monitor.jvm ] [Big Wheel]
[gc][young][78395][36571] duration [900ms], collections [1]/[1.4s], total
[900ms]/[1.4h], memory [15.1gb]->[14.6gb]/[19.9gb], all_pools {[young]
[484.9mb]->[3.9mb]/[665.6mb]}{[survivor] [71.7mb]->[52.4mb]/[83.1mb]}{[old]
[14.6gb]->[14.6gb]/[19.1gb]}

[2015-02-05 14:29:45,055*][WARN ][monitor.jvm ] [Big
Wheel] [gc][young][78404][36574] duration [1.2s], collections [1]/[2s],
total [1.2s]/[1.4h], memory [15.1gb]->[14.7gb]/[19.9gb], all_pools {[young]
[472.8mb]->[2.9mb]/[665.6mb]}{[survivor] [83.1mb]->[67.6mb]/[83.1mb]}{[old]
[14.6gb]->[14.6gb]/[19.1gb]}*

*[2015-02-05 16:07:15,509][*INFO ][node ] [Pyro]
version[1.4.2], pid[9796], build[927caff/2014-12-16T14:11:12Z]

[2015-02-05 16:07:15,510][INFO ][node ] [Pyro]
initializing ...

[2015-02-05 16:07:15,638][INFO ][plugins ] [Pyro] loaded
[marvel, cloud-azure], sites [marvel, kopf]

[2015-02-05 16:07:24,844][INFO ][node ] [Pyro]
initialized

[2015-02-05 16:07:24,845][INFO ][node ] [Pyro]
starting ...

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/39c89f64-8614-4c4d-bef4-420a5a9eae46%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/39c89f64-8614-4c4d-bef4-420a5a9eae46%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7VWW%3DB9hYsR95yHT1fcPyna29c9KrOOWcpV_qwEDkUUQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for the reply. Today the problem happened again. A bad node stop
responding and bring down the whole cluster. But this time memory is ok.
Here are some details.

  1. Again, management APIs such as _node, cat are not returning. Default the
    default 9200 response. If I directly hit the master node, default 9200 is
    returning 200. But the other APIs are not working.
  2. No out of memory exception. We set HEAP at 20GB, the but the usage is
    about 15GB only. ( Could it be because of this? Machine is 32GB memory)
  3. I restarted a couple of high memory nodes, and master too, still not
    recovering. Until I found on some master node logs pointing to a node
    saying operation cannot be executed on bad node.
  4. Again, the bad node's log is missing an entire time period since a
    couple of hours ago. And in Marvel, the node stopped reporting status
    around the same time too. Didn't see anything suspicious on Marvel events
    though. Unlike the first time, there's no obvious problem(didn't see GC
    log) except some index operation failing. And this time I checked the
    field_data size too, it's not big, around 1GB only.

What can i do to pinpoint what's going on?

On Sun, Feb 8, 2015 at 11:11 AM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

Indeed JVMs sometimes need to "stop the world" in case of memory pressure.
You might find some advices about GC tuning here or there but these but I
would advise to avoid it as it is very hard to evaluate the impact of these
settings.

If this issue happens on a regular basis, it might mean that your cluster
is undersized and should be given more memory so that the JVM doesn't have
to run full GCs so often. Otherwise, you should look at how you could
modify elasticsearch's configuration in order to load less stuff in memory
(such as using doc values for fielddata). Another option is to run two
nodes instead of one per machine (with half the memory). Given that full
gcs are shorter on small heaps, this should limit the issue.

On Sat, Feb 7, 2015 at 2:55 AM, liu wei liu.liuwei.wei@gmail.com wrote:

Hi,

We recently had a few incidents where a single index with low memory is
impacting the entire cluster. All the cluster related APIs are not
responding. Kibana 3 and 4 are failing to load too. From log it seems it's
doing GC and not responding to any requests. And there's no log between
2:29 to 4:07 where i restarted the node. Is there anyway to make this more
resilient?

[2015-02-05 14:29:17,199][INFO ][monitor.jvm ] [Big Wheel]
[gc][young][78379][36567] duration [864ms], collections [1]/[1.7s], total
[864ms]/[1.4h], memory [15.2gb]->[14.6gb]/[19.9gb], all_pools {[young]
[599.8mb]->[2.8mb]/[665.6mb]}{[survivor] [75.6mb]->[83.1mb]/[83.1mb]}{[old]
[14.5gb]->[14.5gb]/[19.1gb]}

[2015-02-05 14:29:23,302][WARN ][monitor.jvm ] [Big Wheel]
[gc][young][78384][36568] duration [1.4s], collections [1]/[2s], total
[1.4s]/[1.4h], memory [15.1gb]->[14.7gb]/[19.9gb], all_pools {[young]
[459.7mb]->[15.7mb]/[665.6mb]}{[survivor]
[83.1mb]->[83.1mb]/[83.1mb]}{[old] [14.5gb]->[14.6gb]/[19.1gb]}

[2015-02-05 14:29:34,990][INFO ][monitor.jvm ] [Big Wheel]
[gc][young][78395][36571] duration [900ms], collections [1]/[1.4s], total
[900ms]/[1.4h], memory [15.1gb]->[14.6gb]/[19.9gb], all_pools {[young]
[484.9mb]->[3.9mb]/[665.6mb]}{[survivor] [71.7mb]->[52.4mb]/[83.1mb]}{[old]
[14.6gb]->[14.6gb]/[19.1gb]}

[2015-02-05 14:29:45,055*][WARN ][monitor.jvm ] [Big
Wheel] [gc][young][78404][36574] duration [1.2s], collections [1]/[2s],
total [1.2s]/[1.4h], memory [15.1gb]->[14.7gb]/[19.9gb], all_pools {[young]
[472.8mb]->[2.9mb]/[665.6mb]}{[survivor] [83.1mb]->[67.6mb]/[83.1mb]}{[old]
[14.6gb]->[14.6gb]/[19.1gb]}*

*[2015-02-05 16:07:15,509][*INFO ][node ] [Pyro]
version[1.4.2], pid[9796], build[927caff/2014-12-16T14:11:12Z]

[2015-02-05 16:07:15,510][INFO ][node ] [Pyro]
initializing ...

[2015-02-05 16:07:15,638][INFO ][plugins ] [Pyro]
loaded [marvel, cloud-azure], sites [marvel, kopf]

[2015-02-05 16:07:24,844][INFO ][node ] [Pyro]
initialized

[2015-02-05 16:07:24,845][INFO ][node ] [Pyro]
starting ...

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/39c89f64-8614-4c4d-bef4-420a5a9eae46%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/39c89f64-8614-4c4d-bef4-420a5a9eae46%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/O9pkFK5eMJ0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7VWW%3DB9hYsR95yHT1fcPyna29c9KrOOWcpV_qwEDkUUQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7VWW%3DB9hYsR95yHT1fcPyna29c9KrOOWcpV_qwEDkUUQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAFzuQNRnXr3nq-tfGtq4zoJpYPr2A7dxP%2B2CiEFgwsLR9A%2BS-w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Try looking into hot_threads?

On 13 February 2015 at 14:41, liu wei liu.liuwei.wei@gmail.com wrote:

Thanks for the reply. Today the problem happened again. A bad node stop
responding and bring down the whole cluster. But this time memory is ok.
Here are some details.

  1. Again, management APIs such as _node, cat are not returning. Default
    the default 9200 response. If I directly hit the master node, default 9200
    is returning 200. But the other APIs are not working.
  2. No out of memory exception. We set HEAP at 20GB, the but the usage is
    about 15GB only. ( Could it be because of this? Machine is 32GB memory)
  3. I restarted a couple of high memory nodes, and master too, still not
    recovering. Until I found on some master node logs pointing to a node
    saying operation cannot be executed on bad node.
  4. Again, the bad node's log is missing an entire time period since a
    couple of hours ago. And in Marvel, the node stopped reporting status
    around the same time too. Didn't see anything suspicious on Marvel events
    though. Unlike the first time, there's no obvious problem(didn't see GC
    log) except some index operation failing. And this time I checked the
    field_data size too, it's not big, around 1GB only.

What can i do to pinpoint what's going on?

On Sun, Feb 8, 2015 at 11:11 AM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

Indeed JVMs sometimes need to "stop the world" in case of memory
pressure. You might find some advices about GC tuning here or there but
these but I would advise to avoid it as it is very hard to evaluate the
impact of these settings.

If this issue happens on a regular basis, it might mean that your cluster
is undersized and should be given more memory so that the JVM doesn't have
to run full GCs so often. Otherwise, you should look at how you could
modify elasticsearch's configuration in order to load less stuff in memory
(such as using doc values for fielddata). Another option is to run two
nodes instead of one per machine (with half the memory). Given that full
gcs are shorter on small heaps, this should limit the issue.

On Sat, Feb 7, 2015 at 2:55 AM, liu wei liu.liuwei.wei@gmail.com wrote:

Hi,

We recently had a few incidents where a single index with low memory is
impacting the entire cluster. All the cluster related APIs are not
responding. Kibana 3 and 4 are failing to load too. From log it seems it's
doing GC and not responding to any requests. And there's no log between
2:29 to 4:07 where i restarted the node. Is there anyway to make this more
resilient?

[2015-02-05 14:29:17,199][INFO ][monitor.jvm ] [Big Wheel]
[gc][young][78379][36567] duration [864ms], collections [1]/[1.7s], total
[864ms]/[1.4h], memory [15.2gb]->[14.6gb]/[19.9gb], all_pools {[young]
[599.8mb]->[2.8mb]/[665.6mb]}{[survivor] [75.6mb]->[83.1mb]/[83.1mb]}{[old]
[14.5gb]->[14.5gb]/[19.1gb]}

[2015-02-05 14:29:23,302][WARN ][monitor.jvm ] [Big Wheel]
[gc][young][78384][36568] duration [1.4s], collections [1]/[2s], total
[1.4s]/[1.4h], memory [15.1gb]->[14.7gb]/[19.9gb], all_pools {[young]
[459.7mb]->[15.7mb]/[665.6mb]}{[survivor]
[83.1mb]->[83.1mb]/[83.1mb]}{[old] [14.5gb]->[14.6gb]/[19.1gb]}

[2015-02-05 14:29:34,990][INFO ][monitor.jvm ] [Big Wheel]
[gc][young][78395][36571] duration [900ms], collections [1]/[1.4s], total
[900ms]/[1.4h], memory [15.1gb]->[14.6gb]/[19.9gb], all_pools {[young]
[484.9mb]->[3.9mb]/[665.6mb]}{[survivor] [71.7mb]->[52.4mb]/[83.1mb]}{[old]
[14.6gb]->[14.6gb]/[19.1gb]}

[2015-02-05 14:29:45,055*][WARN ][monitor.jvm ] [Big
Wheel] [gc][young][78404][36574] duration [1.2s], collections [1]/[2s],
total [1.2s]/[1.4h], memory [15.1gb]->[14.7gb]/[19.9gb], all_pools {[young]
[472.8mb]->[2.9mb]/[665.6mb]}{[survivor] [83.1mb]->[67.6mb]/[83.1mb]}{[old]
[14.6gb]->[14.6gb]/[19.1gb]}*

*[2015-02-05 16:07:15,509][*INFO ][node ] [Pyro]
version[1.4.2], pid[9796], build[927caff/2014-12-16T14:11:12Z]

[2015-02-05 16:07:15,510][INFO ][node ] [Pyro]
initializing ...

[2015-02-05 16:07:15,638][INFO ][plugins ] [Pyro]
loaded [marvel, cloud-azure], sites [marvel, kopf]

[2015-02-05 16:07:24,844][INFO ][node ] [Pyro]
initialized

[2015-02-05 16:07:24,845][INFO ][node ] [Pyro]
starting ...

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/39c89f64-8614-4c4d-bef4-420a5a9eae46%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/39c89f64-8614-4c4d-bef4-420a5a9eae46%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/O9pkFK5eMJ0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7VWW%3DB9hYsR95yHT1fcPyna29c9KrOOWcpV_qwEDkUUQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7VWW%3DB9hYsR95yHT1fcPyna29c9KrOOWcpV_qwEDkUUQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAFzuQNRnXr3nq-tfGtq4zoJpYPr2A7dxP%2B2CiEFgwsLR9A%2BS-w%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAFzuQNRnXr3nq-tfGtq4zoJpYPr2A7dxP%2B2CiEFgwsLR9A%2BS-w%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8smievA8%3D5-Sm7xGwpsG3RfJvH8-A0inVHkOtq%3Dn15Vg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.