We are seeing the same issue here.
Our environment:
- 2 nodes
- 30GB Heap allocated to ES
- ~140GB of data
- 639 indices, 10 shards per index
- ~48M documents
After starting ES everything is good, but after a couple of hours we see
the Heap build up towards 96% on one node and 80% on the other. We then see
the GC take very long on the 96% node:
TOuKgmlzaVaFVA][elasticsearch1.trend1.bottlenose.com][inet[/192.99.45.125:
9300]]])
[2014-04-16 12:04:27,845][INFO ][discovery ]
[elasticsearch2.trend1] trend1/I3EHG_XjSayz2OsHyZpeZA
[2014-04-16 12:04:27,850][INFO ][http ] [elasticsearch2.
trend1] bound_address {inet[/0.0.0.0:9200]}, publish_address {inet[/192.99.
45.126:9200]}
[2014-04-16 12:04:27,851][INFO ][node ]
[elasticsearch2.trend1] started
[2014-04-16 12:04:32,669][INFO ][indices.store ]
[elasticsearch2.trend1] updating indices.store.throttle.max_bytes_per_sec
from [20mb] to [1gb], note, type is [MERGE]
[2014-04-16 12:04:32,669][INFO ][cluster.routing.allocation.decider]
[elasticsearch2.trend1] updating
[cluster.routing.allocation.node_initial_primaries_recoveries] from [4] to [
50]
[2014-04-16 12:04:32,670][INFO ][indices.recovery ]
[elasticsearch2.trend1] updating [indices.recovery.max_bytes_per_sec] from [
200mb] to [2gb]
[2014-04-16 12:04:32,670][INFO ][cluster.routing.allocation.decider]
[elasticsearch2.trend1] updating
[cluster.routing.allocation.node_initial_primaries_recoveries] from [4] to [
50]
[2014-04-16 12:04:32,670][INFO ][cluster.routing.allocation.decider]
[elasticsearch2.trend1] updating
[cluster.routing.allocation.node_initial_primaries_recoveries] from [4] to [
50]
[2014-04-16 15:25:21,409][WARN ][monitor.jvm ]
[elasticsearch2.trend1] [gc][old][11876][106] duration [1.1m], collections [
1]/[1.1m], total [1.1m]/[1.4m], memory [28.7gb]->[22gb]/[29.9gb], all_pools
{[young] [67.9mb]->[268.9mb]/[665.6mb]}{[survivor] [60.5mb]->[0b]/[83.1mb
]}{[old] [28.6gb]->[21.8gb]/[29.1gb]}
[2014-04-16 16:02:32,523][WARN ][monitor.jvm ] [elasticsearch2.
trend1] [gc][old][13996][144] duration [1.4m], collections [1]/[1.4m],total
[1.4m]/[3m], memory [28.8gb]->[23.5gb]/[29.9gb], all_pools {[young] [21.8mb
]->[238.2mb]/[665.6mb]}{[survivor] [82.4mb]->[0b]/[83.1mb]}{[old] [28.7gb
]->[23.3gb]/[29.1gb]}
[2014-04-16 16:14:12,386][WARN ][monitor.jvm ] [elasticsearch2.
trend1] [gc][old][14603][155] duration [1.3m], collections [2]/[1.3m],total
[1.3m]/[4.4m], memory [29.2gb]->[23.9gb]/[29.9gb], all_pools {[young] [289mb
]->[161.3mb]/[665.6mb]}{[survivor] [58.3mb]->[0b]/[83.1mb]}{[old] [28.8gb
]->[23.8gb]/[29.1gb]}
[2014-04-16 16:17:55,480][WARN ][monitor.jvm ] [elasticsearch2.
trend1] [gc][old][14745][158] duration [1.3m], collections [1]/[1.3m],total
[1.3m]/[5.7m], memory [29.7gb]->[24.1gb]/[29.9gb], all_pools {[young] [
633.8mb]->[149.7mb]/[665.6mb]}{[survivor] [68.6mb]->[0b]/[83.1mb]}{[old] [
29gb]->[24gb]/[29.1gb]}
[2014-04-16 16:21:17,950][WARN ][monitor.jvm ] [elasticsearch2.
trend1] [gc][old][14857][161] duration [1.4m], collections [1]/[1.4m],total
[1.4m]/[7.2m], memory [28.6gb]->[24.5gb]/[29.9gb], all_pools {[young] [
27.7mb]->[154.8mb]/[665.6mb]}{[survivor] [83.1mb]->[0b]/[83.1mb]}{[old] [
28.5gb]->[24.3gb]/[29.1gb]}
[2014-04-16 16:24:48,776][WARN ][monitor.jvm ] [elasticsearch2.
trend1] [gc][old][14978][164] duration [1.4m], collections [1]/[1.4m],total
[1.4m]/[8.6m], memory [29.4gb]->[24.7gb]/[29.9gb], all_pools {[young] [
475.5mb]->[125.1mb]/[665.6mb]}{[survivor] [68.9mb]->[0b]/[83.1mb]}{[old] [
28.9gb]->[24.6gb]/[29.1gb]}
[2014-04-16 16:26:54,801][WARN ][monitor.jvm ] [elasticsearch2.
trend1] [gc][old][15021][165] duration [1.3m], collections [1]/[1.3m],total
[1.3m]/[9.9m], memory [29.3gb]->[24.8gb]/[29.9gb], all_pools {[young] [
391.8mb]->[151.1mb]/[665.6mb]}{[survivor] [62.4mb]->[0b]/[83.1mb]}{[old] [
28.9gb]->[24.6gb]/[29.1gb]}
[2014-04-16 16:30:45,393][WARN ][monitor.jvm ] [elasticsearch2.
trend1] [gc][old][15170][168] duration [1.3m], collections [1]/[1.3m],total
[1.3m]/[11.3m], memory [29.4gb]->[24.6gb]/[29.9gb], all_pools {[young] [
320.3mb]->[186.7mb]/[665.6mb]}{[survivor] [75.7mb]->[0b]/[83.1mb]}{[old] [
29gb]->[24.4gb]/[29.1gb]}
[2014-04-16 16:32:57,505][WARN ][monitor.jvm ] [elasticsearch2.
trend1] [gc][old][15221][169] duration [1.3m], collections [1]/[1.3m],total
[1.3m]/[12.6m], memory [29.1gb]->[25.4gb]/[29.9gb], all_pools {[young] [25mb
]->[194.1mb]/[665.6mb]}{[survivor] [36.3mb]->[0b]/[83.1mb]}{[old] [29gb]->[
25.2gb]/[29.1gb]}
[2014-04-16 16:36:30,468][WARN ][monitor.jvm ] [elasticsearch2.
trend1] [gc][old][15352][172] duration [1.3m], collections [2]/[1.3m],total
[1.3m]/[14m], memory [29.2gb]->[23.7gb]/[29.9gb], all_pools {[young] [
176.8mb]->[192.9mb]/[665.6mb]}{[survivor] [83.1mb]->[0b]/[83.1mb]}{[old] [
28.9gb]->[23.5gb]/[29.1gb]}
[2014-04-16 16:40:10,961][WARN ][monitor.jvm ] [elasticsearch2.
trend1] [gc][old][15488][175] duration [1.3m], collections [1]/[1.3m],total
[1.3m]/[15.3m], memory [29.3gb]->[24.7gb]/[29.9gb], all_pools {[young] [
383.2mb]->[149.4mb]/[665.6mb]}{[survivor] [23.1mb]->[0b]/[83.1mb]}{[old] [
28.9gb]->[24.6gb]/[29.1gb]}
[2014-04-16 16:42:21,572][WARN ][monitor.jvm ] [elasticsearch2.
trend1] [gc][old][15530][176] duration [1.3m], collections [1]/[1.4m],total
[1.3m]/[16.7m], memory [29.7gb]->[25.9gb]/[29.9gb], all_pools {[young] [
614.8mb]->[347mb]/[665.6mb]}{[survivor] [83.1mb]->[0b]/[83.1mb]}{[old] [29gb
]->[25.5gb]/[29.1gb]}
[2014-04-16 16:48:02,263][WARN ][monitor.jvm ] [elasticsearch2.
trend1] [gc][old][15782][181] duration [1.3m], collections [1]/[1.3m],total
[1.3m]/[18.1m], memory [29.4gb]->[25.1gb]/[29.9gb], all_pools {[young] [
399mb]->[143.4mb]/[665.6mb]}{[survivor] [51.1mb]->[0b]/[83.1mb]}{[old] [29gb
]->[24.9gb]/[29.1gb]}
[2014-04-16 16:51:10,090][WARN ][monitor.jvm ] [elasticsearch2.
trend1] [gc][old][15884][183] duration [1.3m], collections [1]/[1.4m],total
[1.3m]/[19.5m], memory [29.4gb]->[25.6gb]/[29.9gb], all_pools {[young] [
554.7mb]->[194.2mb]/[665.6mb]}{[survivor] [48.1mb]->[0b]/[83.1mb]}{[old] [
28.8gb]->[25.4gb]/[29.1gb]}
[2014-04-16 16:53:19,220][WARN ][monitor.jvm ] [elasticsearch2.
trend1] [gc][old][15930][184] duration [1.3m], collections [1]/[1.3m],total
[1.3m]/[20.8m], memory [29.1gb]->[25.9gb]/[29.9gb], all_pools {[young] [
27.3mb]->[335.9mb]/[665.6mb]}{[survivor] [64.4mb]->[0b]/[83.1mb]}{[old] [
29gb]->[25.6gb]/[29.1gb]}
Which then results in the other node not being able to connect anymore:
[2014-04-16 16:36:30,433][WARN ][transport ] [elasticsearch1.
trend1] Received response for a request that has timed out, sent [79996ms]ago
, timed out [49996ms] ago, action [discovery/zen/fd/ping], node [[
elasticsearch2.trend1][I3EHG_XjSayz2OsHyZpeZA][elasticsearch2.trend1.
bottlenose.com][inet[/192.99.45.126:9300]]], id [6139231]
[2014-04-16 16:36:30,433][WARN ][transport ] [elasticsearch1.
trend1] Received response for a request that has timed out, sent [49996ms]ago
, timed out [19996ms] ago, action [discovery/zen/fd/ping], node [[
elasticsearch2.trend1][I3EHG_XjSayz2OsHyZpeZA][elasticsearch2.trend1.
bottlenose.com][inet[/192.99.45.126:9300]]], id [6140170]
[2014-04-16 16:40:10,934][WARN ][transport ] [elasticsearch1.
trend1] Received response for a request that has timed out, sent [79139ms]ago
, timed out [49139ms] ago, action [discovery/zen/fd/ping], node [[
elasticsearch2.trend1][I3EHG_XjSayz2OsHyZpeZA][elasticsearch2.trend1.
bottlenose.com][inet[/192.99.45.126:9300]]], id [6191341]
[2014-04-16 16:40:10,934][WARN ][transport ] [elasticsearch1.
trend1] Received response for a request that has timed out, sent [49138ms]ago
, timed out [19107ms] ago, action [discovery/zen/fd/ping], node [[
elasticsearch2.trend1][I3EHG_XjSayz2OsHyZpeZA][elasticsearch2.trend1.
bottlenose.com][inet[/192.99.45.126:9300]]], id [6191679]
[2014-04-16 16:42:21,506][WARN ][transport ] [elasticsearch1.
trend1] Received response for a request that has timed out, sent [83223ms]ago
, timed out [53223ms] ago, action [discovery/zen/fd/ping], node [[
elasticsearch2.trend1][I3EHG_XjSayz2OsHyZpeZA][elasticsearch2.trend1.
bottlenose.com][inet[/192.99.45.126:9300]]], id [6206126]
[2014-04-16 16:42:21,506][WARN ][transport ] [elasticsearch1.
trend1] Received response for a request that has timed out, sent [53223ms]ago
, timed out [23223ms] ago, action [discovery/zen/fd/ping], node [[
elasticsearch2.trend1][I3EHG_XjSayz2OsHyZpeZA][elasticsearch2.trend1.
bottlenose.com][inet[/192.99.45.126:9300]]], id [6206535]
[2014-04-16 16:48:02,238][WARN ][transport ] [elasticsearch1.
trend1] Received response for a request that has timed out, sent [81402ms]ago
, timed out [51402ms] ago, action [discovery/zen/fd/ping], node [[
elasticsearch2.trend1][I3EHG_XjSayz2OsHyZpeZA][elasticsearch2.trend1.
bottlenose.com][inet[/192.99.45.126:9300]]], id [6292721]
[2014-04-16 16:48:02,239][WARN ][transport ] [elasticsearch1.
trend1] Received response for a request that has timed out, sent [51403ms]ago
, timed out [21338ms] ago, action [discovery/zen/fd/ping], node [[
elasticsearch2.trend1][I3EHG_XjSayz2OsHyZpeZA][elasticsearch2.trend1.
bottlenose.com][inet[/192.99.45.126:9300]]], id [6293381]
[2014-04-16 16:51:10,053][WARN ][transport ] [elasticsearch1.
trend1] Received response for a request that has timed out, sent [82555ms]ago
, timed out [52510ms] ago, action [discovery/zen/fd/ping], node [[
elasticsearch2.trend1][I3EHG_XjSayz2OsHyZpeZA][elasticsearch2.trend1.
bottlenose.com][inet[/192.99.45.126:9300]]], id [6331039]
[2014-04-16 16:51:10,053][WARN ][transport ] [elasticsearch1.
trend1] Received response for a request that has timed out, sent [52510ms]ago
, timed out [22510ms] ago, action [discovery/zen/fd/ping], node [[
elasticsearch2.trend1][I3EHG_XjSayz2OsHyZpeZA][elasticsearch2.trend1.
bottlenose.com][inet[/192.99.45.126:9300]]], id [6331356]
[2014-04-16 16:53:19,190][WARN ][transport ] [elasticsearch1.
trend1] Received response for a request that has timed out, sent [82194ms]ago
, timed out [52194ms] ago, action [discovery/zen/fd/ping], node [[
elasticsearch2.trend1][I3EHG_XjSayz2OsHyZpeZA][elasticsearch2.trend1.
bottlenose.com][inet[/192.99.45.126:9300]]], id [6349594]
[2014-04-16 16:53:19,190][WARN ][transport ] [elasticsearch1.
trend1] Received response for a request that has timed out, sent [52194ms]ago
, timed out [22194ms] ago, action [discovery/zen/fd/ping], node [[
elasticsearch2.trend1][I3EHG_XjSayz2OsHyZpeZA][elasticsearch2.trend1.
bottlenose.com][inet[/192.99.45.126:9300]]], id [6356292]
Any insights will be greatly appreciated,
Dominiek
On Wednesday, April 16, 2014 8:57:47 PM UTC+2, Ankush Jhalani wrote:
We have a 4 node (2 client only, 2 data/master nodes with 25G memory
allocated to ES and 12 cores each) ES cluster, storing an index with 16
shards, ~200GB and 1 replica.
Recently running scan/scroll requests to dump data and other faceting
requests, the nodes disconnected from each and we had a split condition.
All requests were being run sequentially, not in parellel. From the logs we
notice that for 5-10 minutes, just the young gc was running even though old
memory and total memory used was over 75% (which I think is the default
CMSInitiatingOccupancyFraction=75). When finally old gc ran, it took 1.7
minutes which caused nodes to disconnect (3 pings of 30 seconds failed).
Below are some of the last traces from log -
[2014-04-15 22:07:33,526][INFO ][monitor.jvm ] [ny1.node2]
[gc][young][149283][9270] duration [782ms], collections [1]/[1s], total
[782ms]/[8.2m], memory [21.5gb]->[20.8gb]/[24.8gb], all_pools {[young]
[1gb]->[147mb]/[1.1gb]}{[survivor] [149.7mb]->[149.7mb]/[149.7mb]}{[old] [20
.3gb]->[20.5gb]/[23.5gb]}
[2014-04-15 22:07:35,264][INFO ][monitor.jvm ] [ny1.node2]
[gc][young][149284][9271] duration [743ms], collections [1]/[1.7s], total
[743ms]/[8.2m], memory [20.8gb]->[20.8gb]/[24.8gb], all_pools {[young]
[147mb]->[10.1mb]/[1.1gb]}{[survivor] [149.7mb]->[149.7mb]/[149.7mb]}{[old
] [20.5gb]->[20.6gb]/[23.5gb]}
[2014-04-15 22:07:36,814][INFO ][monitor.jvm ] [ny1.node2]
[gc][young][149285][9272] duration [786ms], collections [1]/[1.5s], total
[786ms]/[8.2m], memory [20.8gb]->[20.9gb]/[24.8gb], all_pools {[young]
[10.1mb]->[2.8mb]/[1.1gb]}{[survivor] [149.7mb]->[149.7mb]/[149.7mb]}{[old
] [20.6gb]->[20.8gb]/[23.5gb]}
[2014-04-15 22:07:38,880][INFO ][monitor.jvm ] [ny1.node2]
[gc][young][149287][9273] duration [835ms], collections [1]/[1s], total
[835ms]/[8.2m], memory [21.5gb]->[21.1gb]/[24.8gb], all_pools {[young]
[655.9mb]->[1.2mb]/[1.1gb]}{[survivor] [149.7mb]->[149.7mb]/[149.7mb]}{[old]
[20.7gb]->[20.9gb]/[23.5gb]}
[2014-04-15 22:09:24,215][INFO ][monitor.jvm ] [ny1.node2]
[gc][young][149290][9274] duration [786ms], collections [1]/[1.7m], total
[786ms]/[8.3m], memory [21.7gb]->[2.4gb]/[24.8gb], all_pools {[young]
[727.2mb]->[13.8mb]/[1.1gb]}{[survivor] [149.7mb]->[0b]/[149.7mb]}{[old] [2
0.8gb]->[2.4gb]/[23.5gb]}
[2014-04-15 22:09:24,215][WARN ][monitor.jvm ] [ny1.node2]
[gc][old][149290][25] duration [1.7m], collections [2]/[1.7m], total
[1.7m]/[1.7m], memory [21.7gb]->[2.4gb]/[24.8gb], all_pools {[young]
[727.2mb]->[13.8mb]/[1.1gb]}{[survivor] [149.7mb]->[0b]/[149.7mb]}{[old]
[20.8gb]
->[2.4gb]/[23.5gb]}
CPU usage was pretty low on the machine, but it's confusing why the old gc
was put on pause for so long and why did it take so much when it finally
ran? We currently use Java 7 update 10, would the latest 51 help? Would
switching to GC1 help ?
Thanks!
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/029d1fb5-f429-4d51-a11f-13ac645c78f9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.