First aid: try increase
discovery.zen.ping.timeout: 60s (default: 3s)
and the Zen fault detection
discovery.zen.fd.ping_timeout: 60s (default: 30s)
discovery.zen.fd.ping_interval: 60s (default: 1s)
so the communication to the master node may take enough time to survive
certain GC stalls. This is just a workaround - I don't know how long
your GC stalls and how long a node can not respond. It is important to
analyze these numbers.
The default zen ping timout values are selected very carefully. They
assume all nodes in the cluster are alive and can respond. But if you
increase the timeout, you allow nodes may not respond, and that
assumption will influence the cluster, long response times may be the
consequence. It's just "sugar coating" the real problem. So increasing
timeout is not a proper solution.
Some hints to get closer to the cause:
-
check your code for the reason why your code can create "spikes" so GC
must step in and run into JVM stall situations. There are situations
where CMS GC can be improved, by avoiding edge cases, but it does not
always work out.
-
if you must accept the "spikes" and large heaps and CMS GC can't be
improved, try another GC algorithm which is optimized for large heaps
and short stall times (G1 GC). Note, G1 GC is not default GC now in Java
7, and is not stable. G1 takes more CPU and decreases overall
performance. It does not prevent spikes but it let the JVM respond
within small time frames, the JVM is more reactive.
-
if all GC improvement strategies fail, consider a smaller heap per
node - for example, more nodes and less heap size - so the "spikes" do
not hurt so much
Jörg
Am 17.03.13 13:54, schrieb asher frenkel:
Hi,
I have a 6 nodes cluster running ES 0.20.5, the cluster currently have
around 35M docs spread over 12 shards with 1 replica.
each node has 48GB with 24GB Heap.
I am experiencing at random time spike in heap usage followed by long GC.
after the nodes finished the long GC cluster is getting into split
brain situation with weird states where one of the cluster nodes is
member of both sides of the split.
minimum_master_nodes option does not help in this case since the node
exists in two of the different cluster states.
I appreciate any suggestion you can have to prevent this issues,
especially the split brain since it causes corruption of our index.
Thanks
Asher
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.