Single low memory data node impact whole cluster

liu_wei · February 7, 2015, 1:55am

Hi,

We recently had a few incidents where a single index with low memory is
impacting the entire cluster. All the cluster related APIs are not
responding. Kibana 3 and 4 are failing to load too. From log it seems it's
doing GC and not responding to any requests. And there's no log between
2:29 to 4:07 where i restarted the node. Is there anyway to make this more
resilient?

[2015-02-05 14:29:17,199][INFO ][monitor.jvm ] [Big Wheel]
[gc][young][78379][36567] duration [864ms], collections [1]/[1.7s], total
[864ms]/[1.4h], memory [15.2gb]->[14.6gb]/[19.9gb], all_pools {[young]
[599.8mb]->[2.8mb]/[665.6mb]}{[survivor] [75.6mb]->[83.1mb]/[83.1mb]}{[old]
[14.5gb]->[14.5gb]/[19.1gb]}

[2015-02-05 14:29:23,302][WARN ][monitor.jvm ] [Big Wheel]
[gc][young][78384][36568] duration [1.4s], collections [1]/[2s], total
[1.4s]/[1.4h], memory [15.1gb]->[14.7gb]/[19.9gb], all_pools {[young]
[459.7mb]->[15.7mb]/[665.6mb]}{[survivor]
[83.1mb]->[83.1mb]/[83.1mb]}{[old] [14.5gb]->[14.6gb]/[19.1gb]}

[2015-02-05 14:29:34,990][INFO ][monitor.jvm ] [Big Wheel]
[gc][young][78395][36571] duration [900ms], collections [1]/[1.4s], total
[900ms]/[1.4h], memory [15.1gb]->[14.6gb]/[19.9gb], all_pools {[young]
[484.9mb]->[3.9mb]/[665.6mb]}{[survivor] [71.7mb]->[52.4mb]/[83.1mb]}{[old]
[14.6gb]->[14.6gb]/[19.1gb]}

[2015-02-05 14:29:45,055*][WARN ][monitor.jvm ] [Big Wheel]
[gc][young][78404][36574] duration [1.2s], collections [1]/[2s], total
[1.2s]/[1.4h], memory [15.1gb]->[14.7gb]/[19.9gb], all_pools {[young]
[472.8mb]->[2.9mb]/[665.6mb]}{[survivor] [83.1mb]->[67.6mb]/[83.1mb]}{[old]
[14.6gb]->[14.6gb]/[19.1gb]}*

*[2015-02-05 16:07:15,509][*INFO ][node ] [Pyro]
version[1.4.2], pid[9796], build[927caff/2014-12-16T14:11:12Z]

[2015-02-05 16:07:15,510][INFO ][node ] [Pyro]
initializing ...

[2015-02-05 16:07:15,638][INFO ][plugins ] [Pyro] loaded
[marvel, cloud-azure], sites [marvel, kopf]

[2015-02-05 16:07:24,844][INFO ][node ] [Pyro]
initialized

[2015-02-05 16:07:24,845][INFO ][node ] [Pyro]
starting ...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/39c89f64-8614-4c4d-bef4-420a5a9eae46%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jpountz · February 8, 2015, 7:11pm

Indeed JVMs sometimes need to "stop the world" in case of memory pressure.
You might find some advices about GC tuning here or there but these but I
would advise to avoid it as it is very hard to evaluate the impact of these
settings.

If this issue happens on a regular basis, it might mean that your cluster
is undersized and should be given more memory so that the JVM doesn't have
to run full GCs so often. Otherwise, you should look at how you could
modify elasticsearch's configuration in order to load less stuff in memory
(such as using doc values for fielddata). Another option is to run two
nodes instead of one per machine (with half the memory). Given that full
gcs are shorter on small heaps, this should limit the issue.

On Sat, Feb 7, 2015 at 2:55 AM, liu wei liu.liuwei.wei@gmail.com wrote:

Hi,

We recently had a few incidents where a single index with low memory is
impacting the entire cluster. All the cluster related APIs are not
responding. Kibana 3 and 4 are failing to load too. From log it seems it's
doing GC and not responding to any requests. And there's no log between
2:29 to 4:07 where i restarted the node. Is there anyway to make this more
resilient?

[2015-02-05 14:29:17,199][INFO ][monitor.jvm ] [Big Wheel]
[gc][young][78379][36567] duration [864ms], collections [1]/[1.7s], total
[864ms]/[1.4h], memory [15.2gb]->[14.6gb]/[19.9gb], all_pools {[young]
[599.8mb]->[2.8mb]/[665.6mb]}{[survivor] [75.6mb]->[83.1mb]/[83.1mb]}{[old]
[14.5gb]->[14.5gb]/[19.1gb]}

[2015-02-05 14:29:23,302][WARN ][monitor.jvm ] [Big Wheel]
[gc][young][78384][36568] duration [1.4s], collections [1]/[2s], total
[1.4s]/[1.4h], memory [15.1gb]->[14.7gb]/[19.9gb], all_pools {[young]
[459.7mb]->[15.7mb]/[665.6mb]}{[survivor]
[83.1mb]->[83.1mb]/[83.1mb]}{[old] [14.5gb]->[14.6gb]/[19.1gb]}

[2015-02-05 14:29:34,990][INFO ][monitor.jvm ] [Big Wheel]
[gc][young][78395][36571] duration [900ms], collections [1]/[1.4s], total
[900ms]/[1.4h], memory [15.1gb]->[14.6gb]/[19.9gb], all_pools {[young]
[484.9mb]->[3.9mb]/[665.6mb]}{[survivor] [71.7mb]->[52.4mb]/[83.1mb]}{[old]
[14.6gb]->[14.6gb]/[19.1gb]}

[2015-02-05 14:29:45,055*][WARN ][monitor.jvm ] [Big
Wheel] [gc][young][78404][36574] duration [1.2s], collections [1]/[2s],
total [1.2s]/[1.4h], memory [15.1gb]->[14.7gb]/[19.9gb], all_pools {[young]
[472.8mb]->[2.9mb]/[665.6mb]}{[survivor] [83.1mb]->[67.6mb]/[83.1mb]}{[old]
[14.6gb]->[14.6gb]/[19.1gb]}*

*[2015-02-05 16:07:15,509][*INFO ][node ] [Pyro]
version[1.4.2], pid[9796], build[927caff/2014-12-16T14:11:12Z]

[2015-02-05 16:07:15,510][INFO ][node ] [Pyro]
initializing ...

[2015-02-05 16:07:15,638][INFO ][plugins ] [Pyro] loaded
[marvel, cloud-azure], sites [marvel, kopf]

[2015-02-05 16:07:24,844][INFO ][node ] [Pyro]
initialized

[2015-02-05 16:07:24,845][INFO ][node ] [Pyro]
starting ...

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/39c89f64-8614-4c4d-bef4-420a5a9eae46%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/39c89f64-8614-4c4d-bef4-420a5a9eae46%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7VWW%3DB9hYsR95yHT1fcPyna29c9KrOOWcpV_qwEDkUUQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

liu_wei · February 13, 2015, 3:41am

Thanks for the reply. Today the problem happened again. A bad node stop
responding and bring down the whole cluster. But this time memory is ok.
Here are some details.

Again, management APIs such as _node, cat are not returning. Default the
default 9200 response. If I directly hit the master node, default 9200 is
returning 200. But the other APIs are not working.
No out of memory exception. We set HEAP at 20GB, the but the usage is
about 15GB only. ( Could it be because of this? Machine is 32GB memory)
I restarted a couple of high memory nodes, and master too, still not
recovering. Until I found on some master node logs pointing to a node
saying operation cannot be executed on bad node.
Again, the bad node's log is missing an entire time period since a
couple of hours ago. And in Marvel, the node stopped reporting status
around the same time too. Didn't see anything suspicious on Marvel events
though. Unlike the first time, there's no obvious problem(didn't see GC
log) except some index operation failing. And this time I checked the
field_data size too, it's not big, around 1GB only.

What can i do to pinpoint what's going on?

On Sun, Feb 8, 2015 at 11:11 AM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

Indeed JVMs sometimes need to "stop the world" in case of memory pressure.
You might find some advices about GC tuning here or there but these but I
would advise to avoid it as it is very hard to evaluate the impact of these
settings.

If this issue happens on a regular basis, it might mean that your cluster
is undersized and should be given more memory so that the JVM doesn't have
to run full GCs so often. Otherwise, you should look at how you could
modify elasticsearch's configuration in order to load less stuff in memory
(such as using doc values for fielddata). Another option is to run two
nodes instead of one per machine (with half the memory). Given that full
gcs are shorter on small heaps, this should limit the issue.

On Sat, Feb 7, 2015 at 2:55 AM, liu wei liu.liuwei.wei@gmail.com wrote:

Hi,

We recently had a few incidents where a single index with low memory is
impacting the entire cluster. All the cluster related APIs are not
responding. Kibana 3 and 4 are failing to load too. From log it seems it's
doing GC and not responding to any requests. And there's no log between
2:29 to 4:07 where i restarted the node. Is there anyway to make this more
resilient?

[2015-02-05 14:29:17,199][INFO ][monitor.jvm ] [Big Wheel]
[gc][young][78379][36567] duration [864ms], collections [1]/[1.7s], total
[864ms]/[1.4h], memory [15.2gb]->[14.6gb]/[19.9gb], all_pools {[young]
[599.8mb]->[2.8mb]/[665.6mb]}{[survivor] [75.6mb]->[83.1mb]/[83.1mb]}{[old]
[14.5gb]->[14.5gb]/[19.1gb]}

[2015-02-05 14:29:23,302][WARN ][monitor.jvm ] [Big Wheel]
[gc][young][78384][36568] duration [1.4s], collections [1]/[2s], total
[1.4s]/[1.4h], memory [15.1gb]->[14.7gb]/[19.9gb], all_pools {[young]
[459.7mb]->[15.7mb]/[665.6mb]}{[survivor]
[83.1mb]->[83.1mb]/[83.1mb]}{[old] [14.5gb]->[14.6gb]/[19.1gb]}

[2015-02-05 14:29:34,990][INFO ][monitor.jvm ] [Big Wheel]
[gc][young][78395][36571] duration [900ms], collections [1]/[1.4s], total
[900ms]/[1.4h], memory [15.1gb]->[14.6gb]/[19.9gb], all_pools {[young]
[484.9mb]->[3.9mb]/[665.6mb]}{[survivor] [71.7mb]->[52.4mb]/[83.1mb]}{[old]
[14.6gb]->[14.6gb]/[19.1gb]}

[2015-02-05 14:29:45,055*][WARN ][monitor.jvm ] [Big
Wheel] [gc][young][78404][36574] duration [1.2s], collections [1]/[2s],
total [1.2s]/[1.4h], memory [15.1gb]->[14.7gb]/[19.9gb], all_pools {[young]
[472.8mb]->[2.9mb]/[665.6mb]}{[survivor] [83.1mb]->[67.6mb]/[83.1mb]}{[old]
[14.6gb]->[14.6gb]/[19.1gb]}*

*[2015-02-05 16:07:15,509][*INFO ][node ] [Pyro]
version[1.4.2], pid[9796], build[927caff/2014-12-16T14:11:12Z]

[2015-02-05 16:07:15,510][INFO ][node ] [Pyro]
initializing ...

[2015-02-05 16:07:15,638][INFO ][plugins ] [Pyro]
loaded [marvel, cloud-azure], sites [marvel, kopf]

[2015-02-05 16:07:24,844][INFO ][node ] [Pyro]
initialized

[2015-02-05 16:07:24,845][INFO ][node ] [Pyro]
starting ...

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/39c89f64-8614-4c4d-bef4-420a5a9eae46%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/39c89f64-8614-4c4d-bef4-420a5a9eae46%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/O9pkFK5eMJ0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7VWW%3DB9hYsR95yHT1fcPyna29c9KrOOWcpV_qwEDkUUQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7VWW%3DB9hYsR95yHT1fcPyna29c9KrOOWcpV_qwEDkUUQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAFzuQNRnXr3nq-tfGtq4zoJpYPr2A7dxP%2B2CiEFgwsLR9A%2BS-w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

warkolm · February 16, 2015, 8:41am

Try looking into hot_threads?

On 13 February 2015 at 14:41, liu wei liu.liuwei.wei@gmail.com wrote:

Thanks for the reply. Today the problem happened again. A bad node stop
responding and bring down the whole cluster. But this time memory is ok.
Here are some details.

Again, management APIs such as _node, cat are not returning. Default
the default 9200 response. If I directly hit the master node, default 9200
is returning 200. But the other APIs are not working.

No out of memory exception. We set HEAP at 20GB, the but the usage is
about 15GB only. ( Could it be because of this? Machine is 32GB memory)

I restarted a couple of high memory nodes, and master too, still not
recovering. Until I found on some master node logs pointing to a node
saying operation cannot be executed on bad node.

Again, the bad node's log is missing an entire time period since a
couple of hours ago. And in Marvel, the node stopped reporting status
around the same time too. Didn't see anything suspicious on Marvel events
though. Unlike the first time, there's no obvious problem(didn't see GC
log) except some index operation failing. And this time I checked the
field_data size too, it's not big, around 1GB only.

What can i do to pinpoint what's going on?

On Sun, Feb 8, 2015 at 11:11 AM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

Indeed JVMs sometimes need to "stop the world" in case of memory
pressure. You might find some advices about GC tuning here or there but
these but I would advise to avoid it as it is very hard to evaluate the
impact of these settings.

If this issue happens on a regular basis, it might mean that your cluster
is undersized and should be given more memory so that the JVM doesn't have
to run full GCs so often. Otherwise, you should look at how you could
modify elasticsearch's configuration in order to load less stuff in memory
(such as using doc values for fielddata). Another option is to run two
nodes instead of one per machine (with half the memory). Given that full
gcs are shorter on small heaps, this should limit the issue.

On Sat, Feb 7, 2015 at 2:55 AM, liu wei liu.liuwei.wei@gmail.com wrote:

Hi,

We recently had a few incidents where a single index with low memory is
impacting the entire cluster. All the cluster related APIs are not
responding. Kibana 3 and 4 are failing to load too. From log it seems it's
doing GC and not responding to any requests. And there's no log between
2:29 to 4:07 where i restarted the node. Is there anyway to make this more
resilient?

[2015-02-05 14:29:17,199][INFO ][monitor.jvm ] [Big Wheel]
[gc][young][78379][36567] duration [864ms], collections [1]/[1.7s], total
[864ms]/[1.4h], memory [15.2gb]->[14.6gb]/[19.9gb], all_pools {[young]
[599.8mb]->[2.8mb]/[665.6mb]}{[survivor] [75.6mb]->[83.1mb]/[83.1mb]}{[old]
[14.5gb]->[14.5gb]/[19.1gb]}

[2015-02-05 14:29:23,302][WARN ][monitor.jvm ] [Big Wheel]
[gc][young][78384][36568] duration [1.4s], collections [1]/[2s], total
[1.4s]/[1.4h], memory [15.1gb]->[14.7gb]/[19.9gb], all_pools {[young]
[459.7mb]->[15.7mb]/[665.6mb]}{[survivor]
[83.1mb]->[83.1mb]/[83.1mb]}{[old] [14.5gb]->[14.6gb]/[19.1gb]}

[2015-02-05 14:29:34,990][INFO ][monitor.jvm ] [Big Wheel]
[gc][young][78395][36571] duration [900ms], collections [1]/[1.4s], total
[900ms]/[1.4h], memory [15.1gb]->[14.6gb]/[19.9gb], all_pools {[young]
[484.9mb]->[3.9mb]/[665.6mb]}{[survivor] [71.7mb]->[52.4mb]/[83.1mb]}{[old]
[14.6gb]->[14.6gb]/[19.1gb]}

[2015-02-05 14:29:45,055*][WARN ][monitor.jvm ] [Big
Wheel] [gc][young][78404][36574] duration [1.2s], collections [1]/[2s],
total [1.2s]/[1.4h], memory [15.1gb]->[14.7gb]/[19.9gb], all_pools {[young]
[472.8mb]->[2.9mb]/[665.6mb]}{[survivor] [83.1mb]->[67.6mb]/[83.1mb]}{[old]
[14.6gb]->[14.6gb]/[19.1gb]}*

*[2015-02-05 16:07:15,509][*INFO ][node ] [Pyro]
version[1.4.2], pid[9796], build[927caff/2014-12-16T14:11:12Z]

[2015-02-05 16:07:15,510][INFO ][node ] [Pyro]
initializing ...

[2015-02-05 16:07:15,638][INFO ][plugins ] [Pyro]
loaded [marvel, cloud-azure], sites [marvel, kopf]

[2015-02-05 16:07:24,844][INFO ][node ] [Pyro]
initialized

[2015-02-05 16:07:24,845][INFO ][node ] [Pyro]
starting ...

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/39c89f64-8614-4c4d-bef4-420a5a9eae46%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/39c89f64-8614-4c4d-bef4-420a5a9eae46%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/O9pkFK5eMJ0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7VWW%3DB9hYsR95yHT1fcPyna29c9KrOOWcpV_qwEDkUUQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7VWW%3DB9hYsR95yHT1fcPyna29c9KrOOWcpV_qwEDkUUQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAFzuQNRnXr3nq-tfGtq4zoJpYPr2A7dxP%2B2CiEFgwsLR9A%2BS-w%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAFzuQNRnXr3nq-tfGtq4zoJpYPr2A7dxP%2B2CiEFgwsLR9A%2BS-w%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8smievA8%3D5-Sm7xGwpsG3RfJvH8-A0inVHkOtq%3Dn15Vg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Single node takes down entire cluster Elasticsearch	5	2318	July 6, 2017
When one node goes down, memory usage jumps several gigabytes on other nodes Elasticsearch	7	565	July 6, 2017
Elasticsearch Memory issue Elasticsearch	10	1704	July 6, 2017
Memory problems during data index Elasticsearch	13	1559	July 6, 2017
Correct way to restart cluster / rejoin failed nodes Elasticsearch	5	1261	July 6, 2017

Single low memory data node impact whole cluster

Related topics