ES v1.1 continuous young gc pauses old gc, stops the world when old gc happens and splits cluster

We're seeing the same thing. ES 1.1.0, JDK 7u55 on Ubuntu 12.04, 5 data
nodes, 3 separate masters, all are 15GB hosts with 7.5GB Heaps, storage is
SSD. Data set is ~1.6TB according to Marvel.

Our daily indices are roughly 33GB in size, with 5 shards and 2 replicas.
I'm still investigating what happened yesterday, but I do see in Marvel a
large spike in the "Indices Current Merges" graph just before the node
dies, and a corresponding increase in JVM Heap. When Heap hits 99%
everything grinds to a halt. Restarting the node "fixes" the issue, but
this is third or fourth time it's happened.

I'm still researching how to deal with this, but a couple of things I am
looking at are:

I would love to get some feedback on my ramblings. If I find anything more
I'll update this thread.

cheers
mike

On Thursday, June 19, 2014 4:05:54 PM UTC-4, Bruce Ritchie wrote:

Java 8 with G1GC perhaps? It'll have more overhead but perhaps it'll be
more consistent wrt pauses.

On Wednesday, June 18, 2014 2:02:24 PM UTC-4, Eric Brandes wrote:

I'd just like to chime in with a "me too". Is the answer just more
nodes? In my case this is happening every week or so.

On Monday, April 21, 2014 9:04:33 PM UTC-5, Brian Flad wrote:

My dataset currently is 100GB across a few "daily" indices (~5-6GB and 15
shards each). Data nodes are 12 CPU, 12GB RAM (6GB heap).

On Mon, Apr 21, 2014 at 6:33 PM, Mark Walkom ma...@campaignmonitor.com
wrote:

How big are your data sets? How big are your nodes?

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com

On 22 April 2014 00:32, Brian Flad bfla...@gmail.com wrote:

We're seeing the same behavior with 1.1.1, JDK 7u55, 3 master nodes (2
min master), and 5 data nodes. Interestingly, we see the repeated young GCs
only on a node or two at a time. Cluster operations (such as recovering
unassigned shards) grinds to a halt. After restarting a GCing node,
everything returns to normal operation in the cluster.

Brian F

On Wed, Apr 16, 2014 at 8:00 PM, Mark Walkom ma...@campaignmonitor.com
wrote:

In both your instances, if you can, have 3 master eligible nodes as it
will reduce the likelihood of a split cluster as you will always have a
majority quorum. Also look at discovery.zen.minimum_master_nodes to go with
that.
However you may just be reaching the limit of your nodes, which means the
best option is to add another node (which also neatly solves your split
brain!).

Ankush it would help if you can update java, most people recommend u25
but we run u51 with no problems.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com

On 17 April 2014 07:31, Dominiek ter Heide domin...@gmail.com wrote:

We are seeing the same issue here.

Our environment:

  • 2 nodes
  • 30GB Heap allocated to ES
  • ~140GB of data
  • 639 indices, 10 shards per index
  • ~48M documents

After starting ES everything is good, but after a couple of hours we see
the Heap build up towards 96% on one node and 80% on the other. We then see
the GC take very long on the 96% node:

TOuKgmlzaVaFVA][elasticsearch1.trend1.bottlenose.com][inet[/192.99.45.125
:9300]]])

[2014-04-16 12:04:27,845][INFO ][discovery ]
[elasticsearch2.trend1] trend1/I3EHG_XjSayz2OsHyZpeZA

[2014-04-16 12:04:27,850][INFO ][http ] [
elasticsearch2.trend1] bound_address {inet[/0.0.0.0:9200]},
publish_address {inet[/192.99.45.126:9200]}

[2014-04-16 12:04:27,851][INFO ][node ]
[elasticsearch2.trend1] started

[2014-04-16 12:04:32,669][INFO ][indices.store ]
[elasticsearch2.trend1] updating indices.store.throttle.max_bytes_per_sec
from [20mb] to [1gb], note, type is [MERGE]

[2014-04-16 12:04:32,669][INFO ][cluster.routing.allocation.decider]
[elasticsearch2.trend1] updating
[cluster.routing.allocation.node_initial_primaries_recoveries] from [4]
to [50]

[2014-04-16 12:04:32,670][INFO ][indices.recovery ]
[elasticsearch2.trend1] updating [indices.recovery.max_bytes_per_sec]
from [200mb] to [2gb]

[2014-04-16 12:04:32,670][INFO ][cluster.routing.allocation.decider]
[elasticsearch2.trend1] updating
[cluster.routing.allocation.node_initial_primaries_recoveries] from [4]
to [50]

[2014-04-16 12:04:32,670][INFO ][cluster.routing.allocation.decider]
[elasticsearch2.trend1] updating
[cluster.routing.allocation.node_initial_primaries_recoveries] from [4]
to [50]

[2014-04-16 15:25:21,409][WARN ][monitor.jvm ]
[elasticsearch2.trend1] [gc][old][11876][106] duration [1.1m],
collections [1]/[1.1m], total [1.1m]/[1.4m], memory [28.7gb]->[22gb]/[
29.9gb], all_pools {[young] [67.9mb]->[268.9mb]/[665.6mb]}{[survivor] [
60.5mb]->[0b]/[83.1mb]}{[old] [28.6gb]->[21.8gb]/[29.1gb]}

[2014-04-16 16:02:32,523][WARN ][monitor.jvm ] [
elasticsearch2.trend1] [gc][old][13996][144] duration [1.4m],
collections [1]/[1.4m], total [1.4m]/[3m], memory [28.8gb]->[23.5gb]/[
29.9gb], all_pools {[young] [21.8mb]->[238.2mb]/[665.6mb]}{[survivor] [
82.4mb]->[0b]/[83.1mb]}{[old] [28.7gb]->[23.3gb]/[29.1gb]}

[2014-04-16 16:14:12,386][WARN ][monitor.jvm ] [
elasticsearch2.trend1] [gc][old][14603][155] duration [1.3m],
collections [2]/[1.3m], total [1.3m]/[4.4m], memory [29.2gb]->[23.9gb]/[
29.9gb], all_pools {[young] [289mb]->[161.3mb]/[665.6mb]}{[survivor] [
58.3mb]->[0b]/[83.1mb]}{[old] [28.8gb]->[23.8gb]/[29.1gb]}

[2014-04-16 16:17:55,480][WARN ][monitor.jvm ] [
elasticsearch2.trend1] [gc][old][14745][158] duration [1.3m],
collections [1]/[1.3m], total [1.3m]/[5.7m], memory [29.7gb]->[24.1gb]/[
29.9gb], all_pools {[young] [633.8mb]->[149.7mb]/[665.6mb]}{[survivor] [
68.6mb]->[0b]/[83.1mb]}{[old] [29gb]->[24gb]/[29.1gb]}

[2014-04-16 16:21:17,950][WARN ][monitor.jvm ] [
elasticsearch2.trend1] [gc][old][14857][161] duration [1.4m],
collections [1]/[1.4m], total [1.4m]/[7.2m], memory [28.6gb]->[24.5gb]/[
29.9gb], all_pools {[young] [27.7mb]->[154.8mb]/[665.6mb]}{[survivor] [
83.1mb]->[0b]/[83.1mb]}{[old] [28.5gb]->[24.3gb]/[29.1gb]}

[2014-04-16 16:24:48,776][WARN ][monitor.jvm ] [
elasticsearch2.trend1] [gc][old][14978][164] duration [1.4m],
collections [1]/[1.4m], total [1.4m]/[8.6m], memory [29.4gb]->[24.7gb]/[
29.9gb], all_pools {[young] [475.5mb]->[125.1mb]/[665.6mb]}{[survivor] [
68.9mb]->[0b]/[83.1mb]}{[old] [28.9gb]->[24.6gb]/[29.1gb]}

[2014-04-16 16:26:54,801][WARN ][monitor.jvm ] [
elasticsearch2.trend1] [gc][old][15021][165] duration [1.3m],
collections [1]/[1.3m], total [1.3m]/[9.9m], memory [29.3gb]->[24.8gb]/[
29.9gb], all_pools {[young] [391.8mb]->[151.1mb]/[665.6mb]}{[survivor] [
62.4mb]->[0b]/[83.1mb]}{[old] [28.9gb]->[24.6gb]/[29.1gb]}

[2014-04-16 16:30:45,393][WARN ][monitor.jvm ] [
elasticsearch2.trend1] [gc][old][15170][168] duration [1.3m],
collections [1]/[1.3m], total [1.3m]/[11.3m], memory [29.4gb]->[24.6gb]/[
29.9gb], all_pools {[young] [320.3mb]->[186.7mb<span style="col

...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5bf568eb-103e-4fee-8bd2-ba2b5bc76178%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.