Long GC leading to Missing Shards 0.90.Beta1

Greg_Brown · March 11, 2013, 11:37pm

Hi,

I'm testing an 8 node cluster with 0.90.0.Beta1 under a load (6-7m
queries/hr, 300k index ops/hr). Cluster has 40 shards (10 indices) with
num_replicas=2. Had it running continuously for about 5 days then ran into
the following sequence:

05:02: Long (2.3 minutes) GC cycle on one node (es3) seemed to cause loss
of connection to the masters resulting in it electing itself.
05:05: Shards on es3 initially got reallocated on other nodes, and es4
almost immediately started reporting that it was missing a number of shards.
es4 continued to be missing shards for 3 hours
07:26: Then es3 started missing shards
09:00: es4 stops reporting missing shards
es3 continues reporting missing shards until we restart the node 5 hours
later
A couple of other long GC events 2-3 minutes occurred on es2, es3, and
es4 after that first one which makes me think the reallocation of shards is
causing them.

elasticsearch.yml config that may be relevant:
node.max_local_storage_nodes: 1
bootstrap.mlockall: true
gateway.recover_after_nodes: 5
gateway.recover_after_time: 5m
gateway.expected_nodes: 8
discovery.zen.minimum_master_nodes: 5
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [list of hosts]
threadpool:
search:
type: fixed
index:
type: fixed
bulk:
type: fixed
refresh:
type: cached
monitor.jvm.gc.ParNew.warn: 1000ms
monitor.jvm.gc.ParNew.info: 700ms
monitor.jvm.gc.ConcurrentMarkSweep.warn: 10s
monitor.jvm.gc.ConcurrentMarkSweep.info: 5s

A couple of questions:

Could this be an issue with 0.90's new shard allocation?
Any better config settings to avoid getting into such a stuck state in
the future?
Any suggestions on avoiding that initial GC cycle? I'm logging GC events,
and this first one, and some subsequent ones We have 24 GB allocated and
another 20+ GB reserved for the OS. The queries we were running are all
more_like_this queries with just a short phrase automatically taken from
some text. One known issue is we are generating many Exceptions on the
cluster right now because the client is sometimes sending null as the
"like_text" parameter. I would hope that periodic exceptions wouldn't cause
this problem, but maybe not.

Log files are available
at: http://dl.dropbox.com/u/56839351/2013-03-11.global.tar.gz (sorry for
the mess of search exceptions)

Thanks for any help
-Greg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Greg_Brown · March 12, 2013, 12:13am

One other thing I'm investigating is whether the long GC cycles are just a
symptom of the shards being reallocated and if a network event caused es3
to momentarily drop out.

In which case the question becomes how to properly throttle shard
reallocations so they don't cause large GC events.

On Monday, March 11, 2013 5:37:37 PM UTC-6, Greg Brown wrote:

Hi,

I'm testing an 8 node cluster with 0.90.0.Beta1 under a load (6-7m
queries/hr, 300k index ops/hr). Cluster has 40 shards (10 indices) with
num_replicas=2. Had it running continuously for about 5 days then ran into
the following sequence:

05:02: Long (2.3 minutes) GC cycle on one node (es3) seemed to cause
loss of connection to the masters resulting in it electing itself.

05:05: Shards on es3 initially got reallocated on other nodes, and es4
almost immediately started reporting that it was missing a number of shards.

es4 continued to be missing shards for 3 hours

07:26: Then es3 started missing shards

09:00: es4 stops reporting missing shards

es3 continues reporting missing shards until we restart the node 5 hours
later

A couple of other long GC events 2-3 minutes occurred on es2, es3, and
es4 after that first one which makes me think the reallocation of shards is
causing them.

elasticsearch.yml config that may be relevant:
node.max_local_storage_nodes: 1
bootstrap.mlockall: true
gateway.recover_after_nodes: 5
gateway.recover_after_time: 5m
gateway.expected_nodes: 8
discovery.zen.minimum_master_nodes: 5
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [list of hosts]
threadpool:
search:
type: fixed
index:
type: fixed
bulk:
type: fixed
refresh:
type: cached
monitor.jvm.gc.ParNew.warn: 1000ms
monitor.jvm.gc.ParNew.info: 700ms
monitor.jvm.gc.ConcurrentMarkSweep.warn: 10s
monitor.jvm.gc.ConcurrentMarkSweep.info: 5s

A couple of questions:

Could this be an issue with 0.90's new shard allocation?

Any better config settings to avoid getting into such a stuck state in
the future?

Any suggestions on avoiding that initial GC cycle? I'm logging GC
events, and this first one, and some subsequent ones We have 24 GB
allocated and another 20+ GB reserved for the OS. The queries we were
running are all more_like_this queries with just a short phrase
automatically taken from some text. One known issue is we are generating
many Exceptions on the cluster right now because the client is sometimes
sending null as the "like_text" parameter. I would hope that periodic
exceptions wouldn't cause this problem, but maybe not.

Log files are available at:
http://dl.dropbox.com/u/56839351/2013-03-11.global.tar.gz (sorry for the
mess of search exceptions)

Thanks for any help
-Greg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Inconsistent search cluster status and search results after long GC run Elasticsearch	5	789	July 6, 2017
Disappearing Data and Unassigned Shards Elasticsearch	5	847	July 6, 2017
Shards fail to reallocate Elasticsearch	6	583	July 6, 2017
Disappearing Shards Elasticsearch	10	411	July 6, 2017
ES bugs in 0.20.4 and 0.20.5 cause shards allocation failure and stuck in initializing state Elasticsearch	21	929	July 6, 2017

Long GC leading to Missing Shards 0.90.Beta1

Related topics