Why does node disconnect after three time big gc?

version: 6.2.2

[2019-01-21T12:02:02,098][INFO ][o.e.m.j.JvmGcMonitorService] [node-142] [gc][407049] overhead, spent [427ms] collecting in the last [1s]
[2019-01-21T12:02:03,483][INFO ][o.e.m.j.JvmGcMonitorService] [node-142] [gc][407050] overhead, spent [439ms] collecting in the last [1.3s]
[2019-01-21T12:02:05,484][INFO ][o.e.m.j.JvmGcMonitorService] [node-142] [gc][407052] overhead, spent [354ms] collecting in the last [1s]
...(other RemoteTransportException)
[2019-01-21T12:08:18,972][INFO ][o.e.c.r.a.AllocationService] [node-142] Cluster health status changed from [GREEN] to [YELLOW] (reason: [shards failed [[celery_log][0]] ...]).
[2019-01-21T12:08:19,003][WARN ][o.e.d.z.ZenDiscovery ] [node-142] zen-disco-failed-to-publish, current nodes: nodes:
{node-143}{gIvN8KNiQsmY2Lsy20Expg}{TUXpFaUnRTKAa5Asdf19gA}{192.168.30.143}{192.168.30.143:2810}
{node-141}{3kWs-rFaR_yZVZOpRp1BFA}{nFj2LpDWS6qNIeEUKzpBLA}{192.168.30.141}{192.168.30.141:2810}
{node-142}{aYMFcfn3RvKDVfgncxC8cQ}{yodrtCrKSaWcVI49NE9dCg}{192.168.30.142}{192.168.30.142:2810}, local, master

in my config(Elasticsearch.yml),
discovery.zen.fd.ping_interval: 10s
discovery.zen.fd.ping_timeout: 60s
discovery.zen.fd.ping_retries: 3

ping_timeout=60s, ping_retries=3, so when 3 time gc(12:02:05), there will be some wrong, 3mins or later, (12:08:19), send zen-disco-failed-to-publish logs.

this thing always happens, once or twice per day on average.

type of gc is g1
version of jvm is 1.8.0_121
version of es is 6.2.2

How many shards and indices do you have?
How many nodes, indices, GB of data?

thanks for your question.
in this cluster, 3 nodes, 33 indices, 144 shards, 197.GB of data.
and 299,881,279 docs(it's changing)
the above data comes from monitoring software cerebro of that cluster.

there is only one big index(bigger than 40GB), it's 9 primary shards and 9 replica shards. so 6 shards per node.

We do not recommend nor support G1 GC with Java8. I would therefore recommend you change back to the default CMS and see if that makes any difference. From version 6.5 onwards we do however support G1 GC with Java11.

ok . thank you for your reply!

if big data in es cluster try 31g jvm and use G1GC with java 8 update the new upate +
as i test in big date G1GC is better with big data in ES (100T+)

Which test are you referring to?

use G1GC or CMS running more time see the gc and server running status(CPU I/O indices speeding ......)

I would recommend reading @danielmitterdorfer 's response to this thread around what and how we test before we declare something supported.

OK , i will see this or maybe my test is not good .

As Daniel pointed out, there are users that are successfully using it, but that does not necessarily mean that it will work for all use cases, which is what we need to cover when declaring something supported. It is also probably more about stability than performance, which is harder too test.

can you give us other recommendations? In fact, we used CMS before, and we changed to G1 only because of this problem.

What is the full output of the cluster stats API? What does your Elasticsearch config look like? Do you have any non-default settings? Can you describe the use-case?

Hi,

what specific evidence is leading you to the conclusion that those two events (long garbage collection pauses and ping timeouts) are related or more precise: that the GC pauses cause the ping timeouts? From what you present here I don't see how they are related:

  • The GC pauses are not longer than your ping timeout.
  • The GC pauses occur several minutes before you get the message that Zen discovery has failed to publish the cluster state.
  • The log output is actually not showing GC pause times but rather the overhead per second (might be due to multiple pauses). I suggest you always look at the JVMs GC log file to analyze GC issues instead of the Elasticsearch log file. While Elasticsearch can provide some high-level hint about the GC overhead, all the specifics are actually in the JVM GC log.
  • As Christian has pointed out G1 on Java 8 is not supported (but again: based on the data you provide I don't see any evidence that GC is your problem).

Daniel

1 Like

Hi, Daniel
thanks for your reply.

Before the disconnected there are always several times gc log(as you say, gc overhead info.). As for master node, the disconnected information is that Zen discovery has failed ; as for non-master node, the disconnected information is failed to send join request to master(I will paste some logs info at the end.).
That's why I say that those two events are related. It's impossible that two things always happen together(and event the interval time always about the same) and have not any related, I think.
but I don't know which one is reason and which one is result, or both the two things are the result, and there is other reason.

thanks for your guidance, I'll look at the JVMs GC log file and find other information.
And to say, this question has been bothering us for a long time. I sincerely hope to get a professional answer(as you and your team's reply). Thank you very much again.

at the last, I would paste some non-master logs of disconnection.
[2019-01-21T05:29:19,687][INFO ][o.e.m.j.JvmGcMonitorService] [node-143] [gc][384105] overhead, spent [397ms] collecting in the last [1.3s]
[2019-01-21T05:29:26,725][INFO ][o.e.m.j.JvmGcMonitorService] [node-143] [gc][384112] overhead, spent [431ms] collecting in the last [1s]
[2019-01-21T05:29:33,728][INFO ][o.e.m.j.JvmGcMonitorService] [node-143] [gc][384119] overhead, spent [474ms] collecting in the last [1s]
...[some RemoteTransportException]
[2019-01-21T05:32:12,262][INFO ][o.e.d.z.ZenDiscovery ] [node-143] master_left [{node-142}{aYMFcfn3RvKDVfgncxC8cQ}{yodrtCrKSaWcVI49NE9dCg}{192.168.30.142}{192.168.30.142:2810}], reason [failed to ping, tried [3] times, each with maximum [1m] timeout]
[2019-01-21T05:32:12,263][WARN ][o.e.d.z.ZenDiscovery ] [node-143] master left (reason = failed to ping, tried [3] times, each with maximum [1m] timeout), current nodes: nodes:
{node-143}{gIvN8KNiQsmY2Lsy20Expg}{TUXpFaUnRTKAa5Asdf19gA}{192.168.30.143}{192.168.30.143:2810}, local
{node-142}{aYMFcfn3RvKDVfgncxC8cQ}{yodrtCrKSaWcVI49NE9dCg}{192.168.30.142}{192.168.30.142:2810}, master
{node-141}{3kWs-rFaR_yZVZOpRp1BFA}{nFj2LpDWS6qNIeEUKzpBLA}{192.168.30.141}{192.168.30.141:2810}
[2019-01-21T05:32:25,353][INFO ][o.e.d.z.ZenDiscovery ] [node-143] failed to send join request to master [{node-142}{aYMFcfn3RvKDVfgncxC8cQ}{yodrtCrKSaWcVI49NE9dCg}{192.168.30.142}{192.168.30.142:2810}], reason [RemoteTransportException[[node-142][192.168.30.142:2810][internal:discovery/zen/join]]; nested: ConnectTransportException[[node-143][192.168.30.143:2810] general node connection failure]; nested: IllegalStateException[handshake failed]; nested: TransportException[connection reset]; ]
[repeat the "failed to send join request to master" info]

GET _cluster/state
{
"cluster_name": "bi-security-online-es",
"compressed_size_in_bytes": 15084,
"version": 3429,
"state_uuid": "YLGyo4zKTGeUua2GApxK4w",
"master_node": "3kWs-rFaR_yZVZOpRp1BFA",
"blocks": {},
"nodes": {
"aYMFcfn3RvKDVfgncxC8cQ": {
"name": "node-142",
"ephemeral_id": "yodrtCrKSaWcVI49NE9dCg",
"transport_address": "192.168.30.142:2810",
"attributes": {}
},
"3kWs-rFaR_yZVZOpRp1BFA": {
"name": "node-141",
"ephemeral_id": "nFj2LpDWS6qNIeEUKzpBLA",
"transport_address": "192.168.30.141:2810",
"attributes": {}
},
"gIvN8KNiQsmY2Lsy20Expg": {
"name": "node-143",
"ephemeral_id": "TUXpFaUnRTKAa5Asdf19gA",
"transport_address": "192.168.30.143:2810",
"attributes": {}
}
},
"metadata": {
"cluster_uuid": "Np0OxNGLRHOToNDLcq0h6Q",
"templates": {
"logstash": {
"index_patterns": [
"logstash-*"
],
"order": 0,
"settings": {
"index": {
"refresh_interval": "5s"
}
},
"mappings": {
"default": {
"dynamic_templates": [
{
"message_field": {
"path_match": "message",
"mapping": {
"norms": false,
"type": "text"
},
"match_mapping_type": "string"
}
},
...
}..}

few non-default setting,
the GC is G1
indices.fielddata.cache.size: 20%
discovery.zen.fd.ping_timeout: 60s

I have to say, those settings all for this disconnection problem, but you know, it's invalid.

as to use, em, python(version is 6.3.0 and use http.port to connect), javaAPI(not RESTful API, use tcp port, v6.2.2).
there is a plugin in my cluster, it's search guard. (Judging from the log, it has nothing with it.)

Like your last information, I also think GC is not my problem.
Sometimes, I will think is it the bug of 6.2.2...ahahaha(In fact, I still trust your team, if you don't say that, I try not to upgrade the version. )

The fact that it does not log anything does not mean it could not be the cause. If you can, I would recommend spinning up an identical test cluster and try to replicate the issue there. Once you have done so, remove searchguard to see if the issue persists.