Connection problems in Elastic Stack cluster

Hello All!
I got such a problem in the Elastic Stack: one of four nodes sometimes disconnects the cluster giving logs of such kind:

[2019-06-26T12:46:08,746][INFO ][o.e.m.j.JvmGcMonitorService] [node-dbtest] [gc][2614458] overhead, spent [256ms] collecting in the last [1s]
[2019-06-26T13:10:26,555][INFO ][o.e.m.j.JvmGcMonitorService] [node-dbtest] [gc][2615899] overhead, spent [266ms] collecting in the last [1s]
[2019-06-26T13:21:14,319][INFO ][o.e.m.j.JvmGcMonitorService] [node-dbtest] [gc][2616536] overhead, spent [267ms] collecting in the last [1s]
[2019-06-26T13:49:06,026][INFO ][o.e.d.z.ZenDiscovery     ] [node-dbtest] master_left [{node-nsite}{OmAfELRoQESx0Sgr1vlKWA}{HxZ5H0hNTbmWvMjoRaLqaw}{NSite}{10.1.15.205:9300}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2019-06-26T13:49:06,026][WARN ][o.e.d.z.ZenDiscovery     ] [node-dbtest] master left (reason = failed to ping, tried [3] times, each with  maximum [30s] timeout), current nodes: nodes: 
       {node-drk2}{-gjqiqV1RB6xLfTRiIbecQ}{5BVCqKtSTySjp0VkJ2UP5A}{drk2-test}{10.1.14.43:9300}
       {node-drk3}{tNRUTVLCRdKckGLHltH40w}{mM9J0toATKKFI81T_gBF8A}{drk3-test}{10.1.6.252:9300}
       {node-nsite}{OmAfELRoQESx0Sgr1vlKWA}{HxZ5H0hNTbmWvMjoRaLqaw}{NSite}{10.1.15.205:9300}, master
       {node-dbtest}{mqHEnv05Tka37zx0Pu157Q}{KT5FedtaQzC4LeG1Z5shQg}{DBTEST}{10.20.6.158:9300}, local

[2019-06-26T13:49:10,107][INFO ][o.e.c.s.ClusterApplierService] [node-dbtest] detected_master {node-nsite}{OmAfELRoQESx0Sgr1vlKWA}{HxZ5H0hNTbmWvMjoRaLqaw}{NSite}{10.1.15.205:9300}, reason: apply cluster state (from master [master {node-nsite}{OmAfELRoQESx0Sgr1vlKWA}{HxZ5H0hNTbmWvMjoRaLqaw}{NSite}{10.1.15.205:9300} committed version [204050]])
[2019-06-26T13:50:15,606][WARN ][o.e.t.TransportService   ] [node-dbtest] Received response for a request that has timed out, sent [64497ms] ago, timed out [34495ms] ago, action [internal:discovery/zen/fd/master_ping], node [{node-nsite}{OmAfELRoQESx0Sgr1vlKWA}{HxZ5H0hNTbmWvMjoRaLqaw}{NSite}{10.1.15.205:9300}], id [6529865]
[2019-06-26T14:46:07,086][INFO ][o.e.d.z.ZenDiscovery     ] [node-dbtest] master_left [{node-nsite}{OmAfELRoQESx0Sgr1vlKWA}{HxZ5H0hNTbmWvMjoRaLqaw}{NSite}{10.1.15.205:9300}], reason [transport disconnected]
[2019-06-26T14:46:07,089][WARN ][o.e.d.z.ZenDiscovery     ] [node-dbtest] master left (reason = transport disconnected), current nodes: nodes: 
       {node-drk2}{-gjqiqV1RB6xLfTRiIbecQ}{5BVCqKtSTySjp0VkJ2UP5A}{drk2-test}{10.1.14.43:9300}
       {node-drk3}{tNRUTVLCRdKckGLHltH40w}{mM9J0toATKKFI81T_gBF8A}{drk3-test}{10.1.6.252:9300}
       {node-nsite}{OmAfELRoQESx0Sgr1vlKWA}{HxZ5H0hNTbmWvMjoRaLqaw}{NSite}{10.1.15.205:9300}, master
       {node-dbtest}{mqHEnv05Tka37zx0Pu157Q}{KT5FedtaQzC4LeG1Z5shQg}{DBTEST}{10.20.6.158:9300}, local

[2019-06-26T14:46:11,023][INFO ][o.e.c.s.ClusterApplierService] [node-dbtest] detected_master {node-nsite}{OmAfELRoQESx0Sgr1vlKWA}{HxZ5H0hNTbmWvMjoRaLqaw}{NSite}{10.1.15.205:9300}, reason: apply cluster state (from master [master {node-nsite}{OmAfELRoQESx0Sgr1vlKWA}{HxZ5H0hNTbmWvMjoRaLqaw}{NSite}{10.1.15.205:9300} committed version [204747]])

The logs from the master-node:

[2019-06-26T13:27:17,742][INFO ][o.e.m.j.JvmGcMonitorService] [node-drk3] [gc][9651] overhead, spent [324ms] collecting in the last [1s]
[2019-06-26T13:48:27,787][INFO ][o.e.c.s.ClusterApplierService] [node-drk3] removed {{node-dbtest}{mqHEnv05Tka37zx0Pu157Q}{KT5FedtaQzC4LeG1Z5shQg}{DBTEST}{10.20.6.158:9300},}, reason: apply cluster state (from master [master {node-nsite}{OmAfELRoQESx0Sgr1vlKWA}{HxZ5H0hNTbmWvMjoRaLqaw}{NSite}{10.1.15.205:9300} committed version [204046]])

Moreover, filebeats also log about connection problems, though I tried to ping all the servers and everything was OK. Here are logs from one of filebeats (others log the same):

2019-06-26T14:53:22.169+0300	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 10.1.6.252:61388->10.1.15.205:5041: i/o timeout
2019-06-26T14:53:22.170+0300	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 10.1.6.252:61388->10.1.15.205:5041: i/o timeout
2019-06-26T14:53:22.277+0300	ERROR	logstash/async.go:256	Failed to publish events caused by: client is not connected
2019-06-26T14:53:24.084+0300	ERROR	pipeline/output.go:121	Failed to publish events: client is not connected
2019-06-26T14:53:24.084+0300	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://10.1.15.205:5041))
2019-06-26T14:53:24.085+0300	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://10.1.15.205:5041)) established
2019-06-26T14:53:45.267+0300	INFO	[monitoring]	log/log.go:144	Non-zero metrics in the last 30s	{"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":36750},"total":{"ticks":4026812,"time":{"ms":328},"value":4026812},"user":{"ticks":3990062,"time":{"ms":328}}},"handles":{"open":232},"info":{"ephemeral_id":"ca9df009-53c9-4b85-bd0d-72e34cff38db","uptime":{"ms":443671553}},"memstats":{"gc_next":24047120,"memory_alloc":12436536,"memory_total":71600323128,"rss":98304}},"filebeat":{"harvester":{"open_files":5,"running":4}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"batches":3,"failed":6144,"total":6144},"read":{"errors":1},"write":{"bytes":346957}},"pipeline":{"clients":1,"events":{"active":4117,"retry":8192}}},"registrar":{"states":{"current":33}}}}}
2019-06-26T14:53:54.178+0300	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 10.1.6.252:61391->10.1.15.205:5041: i/o timeout
2019-06-26T14:53:54.178+0300	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 10.1.6.252:61391->10.1.15.205:5041: i/o timeout
2019-06-26T14:53:54.271+0300	ERROR	logstash/async.go:256	Failed to publish events caused by: client is not connected
2019-06-26T14:53:55.660+0300	ERROR	pipeline/output.go:121	Failed to publish events: client is not connected
2019-06-26T14:53:55.660+0300	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://10.1.15.205:5041))
2019-06-26T14:53:55.665+0300	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://10.1.15.205:5041)) established

Logstash logs don't seem to be informative to me, but i can attach them later if it would be necessary.

Tell me, please, how can i improve the situation and make everything work?

It turned out that the only thing I had to do was to update logstash input-plugins:
bin/logstash-plugin update logstash-input-beats
After that everything worked.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.