Long GC leading to Missing Shards 0.90.Beta1

Hi,

I'm testing an 8 node cluster with 0.90.0.Beta1 under a load (6-7m
queries/hr, 300k index ops/hr). Cluster has 40 shards (10 indices) with
num_replicas=2. Had it running continuously for about 5 days then ran into
the following sequence:

  • 05:02: Long (2.3 minutes) GC cycle on one node (es3) seemed to cause loss
    of connection to the masters resulting in it electing itself.
  • 05:05: Shards on es3 initially got reallocated on other nodes, and es4
    almost immediately started reporting that it was missing a number of shards.
  • es4 continued to be missing shards for 3 hours
  • 07:26: Then es3 started missing shards
  • 09:00: es4 stops reporting missing shards
  • es3 continues reporting missing shards until we restart the node 5 hours
    later
  • A couple of other long GC events 2-3 minutes occurred on es2, es3, and
    es4 after that first one which makes me think the reallocation of shards is
    causing them.

elasticsearch.yml config that may be relevant:
node.max_local_storage_nodes: 1
bootstrap.mlockall: true
gateway.recover_after_nodes: 5
gateway.recover_after_time: 5m
gateway.expected_nodes: 8
discovery.zen.minimum_master_nodes: 5
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [list of hosts]
threadpool:
search:
type: fixed
index:
type: fixed
bulk:
type: fixed
refresh:
type: cached
monitor.jvm.gc.ParNew.warn: 1000ms
monitor.jvm.gc.ParNew.info: 700ms
monitor.jvm.gc.ConcurrentMarkSweep.warn: 10s
monitor.jvm.gc.ConcurrentMarkSweep.info: 5s

A couple of questions:

  • Could this be an issue with 0.90's new shard allocation?
  • Any better config settings to avoid getting into such a stuck state in
    the future?
  • Any suggestions on avoiding that initial GC cycle? I'm logging GC events,
    and this first one, and some subsequent ones We have 24 GB allocated and
    another 20+ GB reserved for the OS. The queries we were running are all
    more_like_this queries with just a short phrase automatically taken from
    some text. One known issue is we are generating many Exceptions on the
    cluster right now because the client is sometimes sending null as the
    "like_text" parameter. I would hope that periodic exceptions wouldn't cause
    this problem, but maybe not.

Log files are available
at: http://dl.dropbox.com/u/56839351/2013-03-11.global.tar.gz (sorry for
the mess of search exceptions)

Thanks for any help
-Greg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

One other thing I'm investigating is whether the long GC cycles are just a
symptom of the shards being reallocated and if a network event caused es3
to momentarily drop out.

In which case the question becomes how to properly throttle shard
reallocations so they don't cause large GC events.

On Monday, March 11, 2013 5:37:37 PM UTC-6, Greg Brown wrote:

Hi,

I'm testing an 8 node cluster with 0.90.0.Beta1 under a load (6-7m
queries/hr, 300k index ops/hr). Cluster has 40 shards (10 indices) with
num_replicas=2. Had it running continuously for about 5 days then ran into
the following sequence:

  • 05:02: Long (2.3 minutes) GC cycle on one node (es3) seemed to cause
    loss of connection to the masters resulting in it electing itself.
  • 05:05: Shards on es3 initially got reallocated on other nodes, and es4
    almost immediately started reporting that it was missing a number of shards.
  • es4 continued to be missing shards for 3 hours
  • 07:26: Then es3 started missing shards
  • 09:00: es4 stops reporting missing shards
  • es3 continues reporting missing shards until we restart the node 5 hours
    later
  • A couple of other long GC events 2-3 minutes occurred on es2, es3, and
    es4 after that first one which makes me think the reallocation of shards is
    causing them.

elasticsearch.yml config that may be relevant:
node.max_local_storage_nodes: 1
bootstrap.mlockall: true
gateway.recover_after_nodes: 5
gateway.recover_after_time: 5m
gateway.expected_nodes: 8
discovery.zen.minimum_master_nodes: 5
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [list of hosts]
threadpool:
search:
type: fixed
index:
type: fixed
bulk:
type: fixed
refresh:
type: cached
monitor.jvm.gc.ParNew.warn: 1000ms
monitor.jvm.gc.ParNew.info: 700ms
monitor.jvm.gc.ConcurrentMarkSweep.warn: 10s
monitor.jvm.gc.ConcurrentMarkSweep.info: 5s

A couple of questions:

  • Could this be an issue with 0.90's new shard allocation?
  • Any better config settings to avoid getting into such a stuck state in
    the future?
  • Any suggestions on avoiding that initial GC cycle? I'm logging GC
    events, and this first one, and some subsequent ones We have 24 GB
    allocated and another 20+ GB reserved for the OS. The queries we were
    running are all more_like_this queries with just a short phrase
    automatically taken from some text. One known issue is we are generating
    many Exceptions on the cluster right now because the client is sometimes
    sending null as the "like_text" parameter. I would hope that periodic
    exceptions wouldn't cause this problem, but maybe not.

Log files are available at:
http://dl.dropbox.com/u/56839351/2013-03-11.global.tar.gz (sorry for the
mess of search exceptions)

Thanks for any help
-Greg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.