Logstash delay

Hi,

I have a problem with logs being ingested into elastic with a delay. This delay does not start from the beginning but it builds up. Right now the delay is 1 day and 10 hours.
We checked the Source device sending the logs (hob-fz-0001, 10.18.198.12), CPU is normal, Interface BW usage is OK (max observed 80Mbps in, 105Mbps out, it’s a physical 1Gbps interface)

We checked the physical interfaces of the Logstash vmware hosts : OK, no visible BW issues, max BW usage seen was 150Mbps in, 210 Mbps out.

Captures seem to suggest that the logstash is throttling the TCP connection from the source log server.

Our setup is an ELK stack with 2 logstash machines, 3 elasticsearch nodes and 1 Kibana. We use virtual machines with 2 different virtual hosts.
the config of our logstash is this completely default, nothing changed.
this is our pipeline:

- pipeline.id: zscaler
  path.config: "/logstash-7.11.2/config/syslog_zscaler.conf"
- pipeline.id: main
  path.config: "/logstash-7.11.2/config/syslog.conf"

and the config of the source in specific is the following:

filter{
	if [type] == "FortiAnalyzer"{
                if [ad.vd] =="GUEST" {
                        mutate { add_field => { "[@metadata][drop]" => "drop"}}
                }
	}
}
output {
	# Forti Analyzer
	if [type] == "FortiAnalyzer" {
		if [deviceAction] == "accept" and [ad.vd] != "GUEST" {
			microsoft-logstash-output-azure-loganalytics {
				workspace_id => 
				workspace_key => 
				custom_log_table_name => "Logstash_Fortinet"
				plugin_flush_interval => 5
			}
		} 
		elasticsearch {
			action => "index"
			hosts => ["10.18.193.68:9200","10.18.193.69:9200"]
			index => "fortianalyzer-%{+YYYY.MM.dd}"
			user => elastic
			password => "${ES_PWD}"
		}
	}
}

we use a separate input file:

input {
	tcp{
		port => 1504
		host => "0.0.0.0"
		codec => cef { delimiter => "\r\n" }
	}
	tcp{
		port => 1505
		host => "0.0.0.0"
		codec => cef { delimiter => "\r\n" }
	}
	udp{
		port => 514
		host => "10.38.193.50"
		codec => cef {  }
	}
	tcp{
		port => 514
		host => "0.0.0.0"
		codec => cef { 
			delimiter => "\r\n"
		}
	}
	tcp{
		port => 1516
		host => "0.0.0.0"
		codec => cef { 
			delimiter => 'tz="+0000"'
		}
		type => "FortiAnalyzer"
	}
	tcp{
		port => 1518
		host => "0.0.0.0"
		codec => cef { delimiter => "\n" }
		# delimiter => "\r\n" 
		type => "CheckPoint"
	}
}

in the screenshot you can see the delay.

The problem starts at the Logstash machine, so it does not happen between Logstash and the Elasticsearch nodes. I tried changing the Logstash settings to this:
pipeline.batch.delay: 100
pipeline.batch.size: 250
pipeline.workers: 4
queue.checkpoint.writes: 4096

but this had no influence on the logs. We have 10 different sources and this is the only source that has the problem. Any ideas on how to solve this problem?

Kind regards,
Tom

It is a little confusing, what is the pipeline that you are having issues?

You said that you have 10 sources, you shared 2 pipelines in pipelines.yml and 7 inputs in a input file and that one source is having issue, but it is not clear how you are ingesting your data and what is not working as expected.

Is the data of your fortianalyzer-* index that is having some delay? What is the event rate? How many shards does this index have? Does it have replicas? What is the refresh_interval?

It could be that your elasticsearch can not index the data fast enough and your Logstash will start to try to throttle the input because of the backpressure.

You can try some things to see if the ingesting rate improves, first increase the pipeline.batch.size, you set at 250, try to set it to 1000 at least.

Also, if you have replicas, try to remove the replicas to see if it speeds up and also check the index.refresh_interval, the default is 1s, try to increase it to 30s.

yes we have 10 sources, coming trough a few inputs. Some come from a CEF converter and use the same input. that's why there are less inputs then sources. the Zscaler has a separate pipeline from the others. so the fortianalyzer (the one with the problem) comes from the main pipeline.
So in the main pipeline we have multiple sources coming in, being filtered and then outputted to sentinel (some) and ELK (all). all sources except for the fortiAnalyzer logs are working correctly.

Yes it is the fortiAnalyzer-* index with the problems. the event rate is around 214, 000/min or 13 million per day. the index has 1 primary shard and 1 replica shard. the refresh_interval is 1 second.

I will try your suggestions and report back after I have the information. Thank you!