running a 4 node elasticsearch cluster. which has one client node and 3 data nodes. data nodes are also configured as master eligible node.
elasticsearch version is 2.3.5
lucene version is 5.5.0
JAVA is openjdk version 1.8.0_141
below are the JAVA_OPTS values with which elasticsearch is running.
-Xms10g -Xmx10g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC
when ever GC is triggered and at that time data nodes leaves the cluster. below are the GC logs:
[2017-09-28 11:52:52,186][WARN ][monitor.jvm ] [10.0.1.85] [gc][old][504680][614] duration [39.8m], collections [84]/[40.9m], total [39.8m]/[5h], memory [9.9gb]->[9.9gb]/[9.9gb], all_pools {[young] [532.5mb]->[532.4mb]/[532.5mb]}{[survivor] [66.4mb]->[66.4mb]/[66.5mb]}{[old] [9.3gb]->[9.3gb]/[9.3gb]}
[2017-09-28 11:52:53,186][WARN ][monitor.jvm ] [10.0.1.85] [gc][old][504681][615] duration [29.5s], collections [1]/[30.5s], total [29.5s]/[5h], memory [9.9gb]->[6.1gb]/[9.9gb], all_pools {[young] [532.4mb]->[47.1mb]/[532.5mb]}{[survivor] [66.4mb]->[66.4mb]/[66.5mb]}{[old] [9.3gb]->[6gb]/[9.3gb]}
during this time cluster doesn't process the queries. and queries fails with
[2017-09-28 12:34:49,232][DEBUG][action.admin.cluster.health] [10.0.1.85] no known master node, scheduling a retry
[2017-09-28 12:35:19,233][DEBUG][action.admin.cluster.health] [10.0.1.85] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2017-09-28 12:35:19,234][WARN ][rest.suppressed ] path: /_cluster/health, params: {}
MasterNotDiscoveredException[null]
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$5.onTimeout(TransportMasterNodeAction.java:226)
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:236)
at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:804)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
and data nodes are unable to discover the master nodes, below are the logs
[2017-09-28 12:33:48,614][WARN ][discovery.zen.ping.unicast] [10.0.1.85] failed to send ping to [{#zen_unicast_3#}{10.0.1.89}{10.0.1.89:9300}]
ReceiveTimeoutTransportException[[][10.0.1.89:9300][internal:discovery/zen/unicast] request_id [648779] timed out after [3750ms]]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:679)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
really require suggestions on this, how we can optimise it.