[2018-08-06T16:26:24,830][INFO ][o.e.m.j.JvmGcMonitorService] [...] [gc][866954] overhead, spent [292ms] collecting in the last [1s]
[2018-08-06T16:26:27,831][INFO ][o.e.m.j.JvmGcMonitorService] [...] [gc][866957] overhead, spent [305ms] collecting in the last [1s]
[2018-08-06T16:26:30,955][INFO ][o.e.m.j.JvmGcMonitorService] [...] [gc][866960] overhead, spent [469ms] collecting in the last [1.1s]
[2018-08-06T16:26:34,956][INFO ][o.e.m.j.JvmGcMonitorService] [...] [gc][866964] overhead, spent [251ms] collecting in the last [1s]
[2018-08-06T16:27:59,397][INFO ][o.e.d.z.ZenDiscovery ] [...] master_left [{...-master}{HVzJwsDrQieIEMIe-c1KSg}{aH617e42Q6u5OPWBOv1B_Q}{...}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2018-08-06T16:27:59,397][WARN ][o.e.d.z.ZenDiscovery ] [...] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes:
{...}{qMUnADcLRA2nCpdqxF61ZQ}{8RG1XNiCTDOGNnVJ_VT23w}{...}
{...}{S79zJu2uRciv4NuTyDKObQ}{rJHFMpUpQdeoj90-Z-Ewcw}{...}
{...}{4EwNJclTTBSYJl1bSLCaCA}{FZ38bVYnRoOXMaTP08nBaA}{...}, local
{...-master}{HVzJwsDrQieIEMIe-c1KSg}{aH617e42Q6u5OPWBOv1B_Q}{...}, master
{...}{fhsKrrlISL-k_pzLn6fKrg}{FUuWfW17TEqEXAZftHPLTQ}{...}
{...-master}{ex16_BD9T9GOaFz3hVv4cw}{NyOFn5vzSvCZ6xXYllHbqA}{...}
{...}{Oy__dkFsStOqvpj5odSfpQ}{bDS_fVH_R3ezoDGmLaIdWA}{...}
{...}{aT8vSnbmSo6z8Ezuy9zvaQ}{xzyFq-cAROe2rXcGjB4qXg}{...}
{...}{esIsbvmlRg-aZxbMdFoZvQ}{ZFr_G0rsS_iHqpkmX9vAAg}{...}
{...}{ABsIgUwvS7WT2-3Oo69BvA}{80DQYjqmSZON-ZvNT3fdUw}{...}
{...-master}{JGI5LsWiQtKuULxPjGw0zQ}{bZrMT64WSbCNBqTOLPMPGA}{...}
{...}{EHHOEIe4SYyn5YTyvlwUxQ}{TpQeXLgAT9q5cMPP5Lx8gw}{...}`
There is some log of the data node, which ping master timeout and consider the master is left. But actually the master is online. Also the master ping the data node timeout and consider the data node is left.(See the master's logs in the picture below)
[2018-08-06T16:27:58,352][INFO ][o.e.c.r.a.AllocationService] [...-master] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{...}{4EwNJclTTBSYJl1bSLCaCA}{FZ38bVYnRoOXMaTP08nBaA}{...} failed to ping, tried [3] times, each with maximum [30s] timeout]).
[2018-08-06T16:27:58,352][INFO ][o.e.c.s.MasterService ] [...-master] zen-disco-node-failed({...}{4EwNJclTTBSYJl1bSLCaCA}{FZ38bVYnRoOXMaTP08nBaA}{...}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{...}{4EwNJclTTBSYJl1bSLCaCA}{FZ38bVYnRoOXMaTP08nBaA}{...} failed to ping, tried [3] times, each with maximum [30s] timeout], reason: removed {{...}{4EwNJclTTBSYJl1bSLCaCA}{FZ38bVYnRoOXMaTP08nBaA}{...},}
This situation had occurred some times. And there was always some long GC before ping timeout. I don't know why the GC can leader to timeout pings 3 times. And whether there are some configuration or some work around method to avoid pings timeout.
The JVM conf:
-Xms8g
-Xmx8g
-XX:+UseG1GC