Slow ingestion using Logstash Filebeat output with Kubernetes autodiscovery

I am using Elastic Stack for harvesting logs of a Kubernetes cluster. Elastic Stack is installed in the same cluster and I'm using Filebeat's Kubernetes autodiscover. Elastic Stack consists of an Elasticsearch cluster with 5 data only nodes, 3 dedicated master nodes and 2 client (coordinating) nodes; 2 Logstash instances using multiple pipelines; and 16 Filebeat instances, one on each Kubernetes node (daemonset). The Kubernetes cluster generates some fair amount of logs, peaking at a rate of around 200.000 log entries per 15 minutes (which makes it, what, 222 events/sec, to use more proper rate units?).

When I start Elastic Stack for the first time, logs get ingested fine and quickly. Then, after an hour or so, rates in all of the components drop. That is, there is a drop in Filebeat events rate, in Logstash events received and emitted rates and in Elasticsearch indexing rate. There are errors in Filebeat logs too:


2019-04-11T16:28:34.768Z ERROR [autodiscover] cfgfile/list.go:96 Error creating runner from config: Can only start an input when all related states are finished: {Id:4200643-66306 Finished:false Fileinfo:0xc000d1bba0 Source:/var/lib/docker/containers/7858e93927986b4630903a04614e646889b3f32ebe38b84550c73247a0da0c38/7858e93927986b4630903a04614e646889b3f32ebe38b84550c73247a0da0c38-json.log Offset:0 Timestamp:2019-04-11 16:26:05.646027461 +0000 UTC m=+1337.750836655 TTL:-1ns Type:docker Meta:map[] FileStateOS:4200643-66306}

and sometimes


2019-04-11T15:32:52.622Z ERROR logstash/async.go:256 Failed to publish events caused by: read tcp 100.102.132.75:57572->100.110.196.158:5044: i/o timeout

2019-04-11T15:32:52.622Z ERROR logstash/async.go:256 Failed to publish events caused by: read tcp 100.102.132.75:57572->100.110.196.158:5044: i/o timeout

2019-04-11T15:32:52.636Z ERROR logstash/async.go:256 Failed to publish events caused by: client is not connected

2019-04-11T15:32:53.966Z ERROR [autodiscover] cfgfile/list.go:96 Error creating runner from config: Can only start an input when all related states are finished: {Id:3156173-66306 Finished:false Fileinfo:0xc000c74ea0 Source:/var/lib/docker/containers/2780263029cdda7e1fe19d34c77501207d70559d45501b0c57ed14930ea6c212/2780263029cdda7e1fe19d34c77501207d70559d45501b0c57ed14930ea6c212-json.log Offset:0 Timestamp:2019-04-11 15:32:02.946308974 +0000 UTC m=+544.111177032 TTL:-1ns Type:docker Meta:map[] FileStateOS:3156173-66306}

2019-04-11T15:32:54.083Z ERROR pipeline/output.go:121 Failed to publish events: client is not connected

2019-04-11T15:32:54.083Z INFO pipeline/output.go:95 Connecting to backoff(async(tcp://logstash-1.logstash:5044))

2019-04-11T15:32:54.092Z INFO pipeline/output.go:105 Connection to backoff(async(tcp://logstash-1.logstash:5044)) established

but fiddling with output.logstash.bulk_max_size in Filebeat config appears to be solving these network related errors. Autodiscover errors continue.

There is also a few of


2019-04-11T16:26:25.595Z ERROR kubernetes/watcher.go:266 kubernetes: Watching API error context canceled, ignoring event and moving to most recent resource version

but I've seen these in perfectly functioning Elastic Stacks in other Kubernetes clusters too, so these are probably irrelevant.

No errors in Logstash or Elasticsearch logs. Logstashes report around 10% CPU usage in the beginning, when ingestion is fine, then it drops too to around 1-2%.

If I use output.elasticsearch in Filebeat, pointing to a loadbalancer (Kubernetes service) targeting the Elasticsearch client nodes, none of these errors occur. Nore the drop in ingestion. Logs keep ending up in Elasticsearch fine for days and days.

All of the components of Elastic Stack are currently version 7.0.0 but the problem occurs also with version 6.

Any ideas of where shall I look for tuning Elastic Stack to keep up with my logs? I've tried lots of suggestions brought up in other threads on Internet (and this forum), like tweaking pipeline.workers in Logstash or output.logstash.worker in Filebeat or pipeline.batch.size in Logstash with no apparent success.

The bottleneck was the disk IOPS of Logstash persisted queue.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.