I am using Elastic Stack for harvesting logs of a Kubernetes cluster. Elastic Stack is installed in the same cluster and I'm using Filebeat's Kubernetes autodiscover. Elastic Stack consists of an Elasticsearch cluster with 5 data only nodes, 3 dedicated master nodes and 2 client (coordinating) nodes; 2 Logstash instances using multiple pipelines; and 16 Filebeat instances, one on each Kubernetes node (daemonset). The Kubernetes cluster generates some fair amount of logs, peaking at a rate of around 200.000 log entries per 15 minutes (which makes it, what, 222 events/sec, to use more proper rate units?).
When I start Elastic Stack for the first time, logs get ingested fine and quickly. Then, after an hour or so, rates in all of the components drop. That is, there is a drop in Filebeat events rate, in Logstash events received and emitted rates and in Elasticsearch indexing rate. There are errors in Filebeat logs too:
2019-04-11T16:28:34.768Z ERROR [autodiscover] cfgfile/list.go:96 Error creating runner from config: Can only start an input when all related states are finished: {Id:4200643-66306 Finished:false Fileinfo:0xc000d1bba0 Source:/var/lib/docker/containers/7858e93927986b4630903a04614e646889b3f32ebe38b84550c73247a0da0c38/7858e93927986b4630903a04614e646889b3f32ebe38b84550c73247a0da0c38-json.log Offset:0 Timestamp:2019-04-11 16:26:05.646027461 +0000 UTC m=+1337.750836655 TTL:-1ns Type:docker Meta:map[] FileStateOS:4200643-66306}
and sometimes
2019-04-11T15:32:52.622Z ERROR logstash/async.go:256 Failed to publish events caused by: read tcp 100.102.132.75:57572->100.110.196.158:5044: i/o timeout
2019-04-11T15:32:52.622Z ERROR logstash/async.go:256 Failed to publish events caused by: read tcp 100.102.132.75:57572->100.110.196.158:5044: i/o timeout
2019-04-11T15:32:52.636Z ERROR logstash/async.go:256 Failed to publish events caused by: client is not connected
2019-04-11T15:32:53.966Z ERROR [autodiscover] cfgfile/list.go:96 Error creating runner from config: Can only start an input when all related states are finished: {Id:3156173-66306 Finished:false Fileinfo:0xc000c74ea0 Source:/var/lib/docker/containers/2780263029cdda7e1fe19d34c77501207d70559d45501b0c57ed14930ea6c212/2780263029cdda7e1fe19d34c77501207d70559d45501b0c57ed14930ea6c212-json.log Offset:0 Timestamp:2019-04-11 15:32:02.946308974 +0000 UTC m=+544.111177032 TTL:-1ns Type:docker Meta:map[] FileStateOS:3156173-66306}
2019-04-11T15:32:54.083Z ERROR pipeline/output.go:121 Failed to publish events: client is not connected
2019-04-11T15:32:54.083Z INFO pipeline/output.go:95 Connecting to backoff(async(tcp://logstash-1.logstash:5044))
2019-04-11T15:32:54.092Z INFO pipeline/output.go:105 Connection to backoff(async(tcp://logstash-1.logstash:5044)) established
but fiddling with output.logstash.bulk_max_size
in Filebeat config appears to be solving these network related errors. Autodiscover errors continue.
There is also a few of
2019-04-11T16:26:25.595Z ERROR kubernetes/watcher.go:266 kubernetes: Watching API error context canceled, ignoring event and moving to most recent resource version
but I've seen these in perfectly functioning Elastic Stacks in other Kubernetes clusters too, so these are probably irrelevant.
No errors in Logstash or Elasticsearch logs. Logstashes report around 10% CPU usage in the beginning, when ingestion is fine, then it drops too to around 1-2%.
If I use output.elasticsearch
in Filebeat, pointing to a loadbalancer (Kubernetes service) targeting the Elasticsearch client nodes, none of these errors occur. Nore the drop in ingestion. Logs keep ending up in Elasticsearch fine for days and days.
All of the components of Elastic Stack are currently version 7.0.0 but the problem occurs also with version 6.
Any ideas of where shall I look for tuning Elastic Stack to keep up with my logs? I've tried lots of suggestions brought up in other threads on Internet (and this forum), like tweaking pipeline.workers
in Logstash or output.logstash.worker
in Filebeat or pipeline.batch.size
in Logstash with no apparent success.