Connection problems in Elastic Stack cluster

MaxON · June 26, 2019, 11:57am

Hello All!
I got such a problem in the Elastic Stack: one of four nodes sometimes disconnects the cluster giving logs of such kind:

[2019-06-26T12:46:08,746][INFO ][o.e.m.j.JvmGcMonitorService] [node-dbtest] [gc][2614458] overhead, spent [256ms] collecting in the last [1s]
[2019-06-26T13:10:26,555][INFO ][o.e.m.j.JvmGcMonitorService] [node-dbtest] [gc][2615899] overhead, spent [266ms] collecting in the last [1s]
[2019-06-26T13:21:14,319][INFO ][o.e.m.j.JvmGcMonitorService] [node-dbtest] [gc][2616536] overhead, spent [267ms] collecting in the last [1s]
[2019-06-26T13:49:06,026][INFO ][o.e.d.z.ZenDiscovery     ] [node-dbtest] master_left [{node-nsite}{OmAfELRoQESx0Sgr1vlKWA}{HxZ5H0hNTbmWvMjoRaLqaw}{NSite}{10.1.15.205:9300}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2019-06-26T13:49:06,026][WARN ][o.e.d.z.ZenDiscovery     ] [node-dbtest] master left (reason = failed to ping, tried [3] times, each with  maximum [30s] timeout), current nodes: nodes: 
       {node-drk2}{-gjqiqV1RB6xLfTRiIbecQ}{5BVCqKtSTySjp0VkJ2UP5A}{drk2-test}{10.1.14.43:9300}
       {node-drk3}{tNRUTVLCRdKckGLHltH40w}{mM9J0toATKKFI81T_gBF8A}{drk3-test}{10.1.6.252:9300}
       {node-nsite}{OmAfELRoQESx0Sgr1vlKWA}{HxZ5H0hNTbmWvMjoRaLqaw}{NSite}{10.1.15.205:9300}, master
       {node-dbtest}{mqHEnv05Tka37zx0Pu157Q}{KT5FedtaQzC4LeG1Z5shQg}{DBTEST}{10.20.6.158:9300}, local

[2019-06-26T13:49:10,107][INFO ][o.e.c.s.ClusterApplierService] [node-dbtest] detected_master {node-nsite}{OmAfELRoQESx0Sgr1vlKWA}{HxZ5H0hNTbmWvMjoRaLqaw}{NSite}{10.1.15.205:9300}, reason: apply cluster state (from master [master {node-nsite}{OmAfELRoQESx0Sgr1vlKWA}{HxZ5H0hNTbmWvMjoRaLqaw}{NSite}{10.1.15.205:9300} committed version [204050]])
[2019-06-26T13:50:15,606][WARN ][o.e.t.TransportService   ] [node-dbtest] Received response for a request that has timed out, sent [64497ms] ago, timed out [34495ms] ago, action [internal:discovery/zen/fd/master_ping], node [{node-nsite}{OmAfELRoQESx0Sgr1vlKWA}{HxZ5H0hNTbmWvMjoRaLqaw}{NSite}{10.1.15.205:9300}], id [6529865]
[2019-06-26T14:46:07,086][INFO ][o.e.d.z.ZenDiscovery     ] [node-dbtest] master_left [{node-nsite}{OmAfELRoQESx0Sgr1vlKWA}{HxZ5H0hNTbmWvMjoRaLqaw}{NSite}{10.1.15.205:9300}], reason [transport disconnected]
[2019-06-26T14:46:07,089][WARN ][o.e.d.z.ZenDiscovery     ] [node-dbtest] master left (reason = transport disconnected), current nodes: nodes: 
       {node-drk2}{-gjqiqV1RB6xLfTRiIbecQ}{5BVCqKtSTySjp0VkJ2UP5A}{drk2-test}{10.1.14.43:9300}
       {node-drk3}{tNRUTVLCRdKckGLHltH40w}{mM9J0toATKKFI81T_gBF8A}{drk3-test}{10.1.6.252:9300}
       {node-nsite}{OmAfELRoQESx0Sgr1vlKWA}{HxZ5H0hNTbmWvMjoRaLqaw}{NSite}{10.1.15.205:9300}, master
       {node-dbtest}{mqHEnv05Tka37zx0Pu157Q}{KT5FedtaQzC4LeG1Z5shQg}{DBTEST}{10.20.6.158:9300}, local

[2019-06-26T14:46:11,023][INFO ][o.e.c.s.ClusterApplierService] [node-dbtest] detected_master {node-nsite}{OmAfELRoQESx0Sgr1vlKWA}{HxZ5H0hNTbmWvMjoRaLqaw}{NSite}{10.1.15.205:9300}, reason: apply cluster state (from master [master {node-nsite}{OmAfELRoQESx0Sgr1vlKWA}{HxZ5H0hNTbmWvMjoRaLqaw}{NSite}{10.1.15.205:9300} committed version [204747]])

The logs from the master-node:

[2019-06-26T13:27:17,742][INFO ][o.e.m.j.JvmGcMonitorService] [node-drk3] [gc][9651] overhead, spent [324ms] collecting in the last [1s]
[2019-06-26T13:48:27,787][INFO ][o.e.c.s.ClusterApplierService] [node-drk3] removed {{node-dbtest}{mqHEnv05Tka37zx0Pu157Q}{KT5FedtaQzC4LeG1Z5shQg}{DBTEST}{10.20.6.158:9300},}, reason: apply cluster state (from master [master {node-nsite}{OmAfELRoQESx0Sgr1vlKWA}{HxZ5H0hNTbmWvMjoRaLqaw}{NSite}{10.1.15.205:9300} committed version [204046]])

Moreover, filebeats also log about connection problems, though I tried to ping all the servers and everything was OK. Here are logs from one of filebeats (others log the same):

2019-06-26T14:53:22.169+0300	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 10.1.6.252:61388->10.1.15.205:5041: i/o timeout
2019-06-26T14:53:22.170+0300	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 10.1.6.252:61388->10.1.15.205:5041: i/o timeout
2019-06-26T14:53:22.277+0300	ERROR	logstash/async.go:256	Failed to publish events caused by: client is not connected
2019-06-26T14:53:24.084+0300	ERROR	pipeline/output.go:121	Failed to publish events: client is not connected
2019-06-26T14:53:24.084+0300	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://10.1.15.205:5041))
2019-06-26T14:53:24.085+0300	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://10.1.15.205:5041)) established
2019-06-26T14:53:45.267+0300	INFO	[monitoring]	log/log.go:144	Non-zero metrics in the last 30s	{"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":36750},"total":{"ticks":4026812,"time":{"ms":328},"value":4026812},"user":{"ticks":3990062,"time":{"ms":328}}},"handles":{"open":232},"info":{"ephemeral_id":"ca9df009-53c9-4b85-bd0d-72e34cff38db","uptime":{"ms":443671553}},"memstats":{"gc_next":24047120,"memory_alloc":12436536,"memory_total":71600323128,"rss":98304}},"filebeat":{"harvester":{"open_files":5,"running":4}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"batches":3,"failed":6144,"total":6144},"read":{"errors":1},"write":{"bytes":346957}},"pipeline":{"clients":1,"events":{"active":4117,"retry":8192}}},"registrar":{"states":{"current":33}}}}}
2019-06-26T14:53:54.178+0300	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 10.1.6.252:61391->10.1.15.205:5041: i/o timeout
2019-06-26T14:53:54.178+0300	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 10.1.6.252:61391->10.1.15.205:5041: i/o timeout
2019-06-26T14:53:54.271+0300	ERROR	logstash/async.go:256	Failed to publish events caused by: client is not connected
2019-06-26T14:53:55.660+0300	ERROR	pipeline/output.go:121	Failed to publish events: client is not connected
2019-06-26T14:53:55.660+0300	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://10.1.15.205:5041))
2019-06-26T14:53:55.665+0300	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://10.1.15.205:5041)) established

Logstash logs don't seem to be informative to me, but i can attach them later if it would be necessary.

Tell me, please, how can i improve the situation and make everything work?

MaxON · July 2, 2019, 5:48am

It turned out that the only thing I had to do was to update logstash input-plugins:
bin/logstash-plugin update logstash-input-beats
After that everything worked.

system · July 30, 2019, 5:48am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Several times of large gc, then will be disconnected, this is why? Elasticsearch	1	355	February 15, 2019
Lost node communication then OOM JVM crash Elasticsearch	1	493	September 6, 2019
GC on elastic search data nodes and node automatically reconnect from the cluster Elasticsearch	5	312	January 2, 2023
Why does node disconnect after three time big gc? Elasticsearch	20	1749	February 19, 2019
Elasticsearch unstable cluster Elasticsearch	27	541	February 21, 2024

Connection problems in Elastic Stack cluster

Related topics