GC leader to timeout pings

[2018-08-06T16:26:24,830][INFO ][o.e.m.j.JvmGcMonitorService] [...] [gc][866954] overhead, spent [292ms] collecting in the last [1s]
[2018-08-06T16:26:27,831][INFO ][o.e.m.j.JvmGcMonitorService] [...] [gc][866957] overhead, spent [305ms] collecting in the last [1s]
[2018-08-06T16:26:30,955][INFO ][o.e.m.j.JvmGcMonitorService] [...] [gc][866960] overhead, spent [469ms] collecting in the last [1.1s]
[2018-08-06T16:26:34,956][INFO ][o.e.m.j.JvmGcMonitorService] [...] [gc][866964] overhead, spent [251ms] collecting in the last [1s]
[2018-08-06T16:27:59,397][INFO ][o.e.d.z.ZenDiscovery     ] [...] master_left [{...-master}{HVzJwsDrQieIEMIe-c1KSg}{aH617e42Q6u5OPWBOv1B_Q}{...}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2018-08-06T16:27:59,397][WARN ][o.e.d.z.ZenDiscovery     ] [...] master left (reason = failed to ping, tried [3] times, each with  maximum [30s] timeout), current nodes: nodes:
  {...}{qMUnADcLRA2nCpdqxF61ZQ}{8RG1XNiCTDOGNnVJ_VT23w}{...}
  {...}{S79zJu2uRciv4NuTyDKObQ}{rJHFMpUpQdeoj90-Z-Ewcw}{...}
  {...}{4EwNJclTTBSYJl1bSLCaCA}{FZ38bVYnRoOXMaTP08nBaA}{...}, local
  {...-master}{HVzJwsDrQieIEMIe-c1KSg}{aH617e42Q6u5OPWBOv1B_Q}{...}, master
  {...}{fhsKrrlISL-k_pzLn6fKrg}{FUuWfW17TEqEXAZftHPLTQ}{...}
  {...-master}{ex16_BD9T9GOaFz3hVv4cw}{NyOFn5vzSvCZ6xXYllHbqA}{...}
  {...}{Oy__dkFsStOqvpj5odSfpQ}{bDS_fVH_R3ezoDGmLaIdWA}{...}
  {...}{aT8vSnbmSo6z8Ezuy9zvaQ}{xzyFq-cAROe2rXcGjB4qXg}{...}
  {...}{esIsbvmlRg-aZxbMdFoZvQ}{ZFr_G0rsS_iHqpkmX9vAAg}{...}
  {...}{ABsIgUwvS7WT2-3Oo69BvA}{80DQYjqmSZON-ZvNT3fdUw}{...}
  {...-master}{JGI5LsWiQtKuULxPjGw0zQ}{bZrMT64WSbCNBqTOLPMPGA}{...}
  {...}{EHHOEIe4SYyn5YTyvlwUxQ}{TpQeXLgAT9q5cMPP5Lx8gw}{...}`

There is some log of the data node, which ping master timeout and consider the master is left. But actually the master is online. Also the master ping the data node timeout and consider the data node is left.(See the master's logs in the picture below)

[2018-08-06T16:27:58,352][INFO ][o.e.c.r.a.AllocationService] [...-master] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{...}{4EwNJclTTBSYJl1bSLCaCA}{FZ38bVYnRoOXMaTP08nBaA}{...} failed to ping, tried [3] times, each with maximum [30s] timeout]).
[2018-08-06T16:27:58,352][INFO ][o.e.c.s.MasterService    ] [...-master] zen-disco-node-failed({...}{4EwNJclTTBSYJl1bSLCaCA}{FZ38bVYnRoOXMaTP08nBaA}{...}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{...}{4EwNJclTTBSYJl1bSLCaCA}{FZ38bVYnRoOXMaTP08nBaA}{...} failed to ping, tried [3] times, each with maximum [30s] timeout], reason: removed {{...}{4EwNJclTTBSYJl1bSLCaCA}{FZ38bVYnRoOXMaTP08nBaA}{...},}

This situation had occurred some times. And there was always some long GC before ping timeout. I don't know why the GC can leader to timeout pings 3 times. And whether there are some configuration or some work around method to avoid pings timeout.

The JVM conf:

-Xms8g
-Xmx8g
-XX:+UseG1GC

Please don't post images of text as they are hardly readable and not searchable.

Instead paste the text and format it with </> icon. Check the preview window.

OK, I edit the description. Thx.

Your nodes are under memory pressure. You need to check this out and fix it.

What is the output of:

GET _cluster/health?v
GET _cluster/nodes?v
GET _cluster/indices?v

I will check the monitor for memory pressure. But why some GC pause can lead to ping timeout 3 times? Is there some other factor to influence the ping?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.