Full logstash queue

Hello there,

since Im running our Elastic Stack, I have a problem with persistent queues, which time to time gets full from unknown reason.

Elasticsearch setup:
3 master nodes
12 data nodes 32CPU 32GB RAM
avg index rate 80K/s
Data volume: 20TB

My logstash setup:
4 logstash nodes 8CPU 16GB RAM

logstash.yml

path.data: /var/lib/logstash
path.logs: /var/log/logstash
queue.drain: true
config.reload.automatic: true
xpack.monitoring.elasticsearch.hosts: [ "XXX" ]
xpack.monitoring.enabled: true
xpack.monitoring.enabled: true
xpack.monitoring.elasticsearch.username: "logstash_system"
xpack.monitoring.elasticsearch.password: "XXXX"

pipelines.yml

- pipeline.id: signpost-pipeline
  path.config: "/etc/logstash/conf.d/signpost-pipeline.conf"
  queue.type: persisted
  queue.max_bytes: 1gb
- pipeline.id: php-pipeline
  path.config: "/etc/logstash/conf.d/php-pipeline.conf"
  queue.type: persisted
  queue.max_bytes: 10gb
- pipeline.id: json-pipeline
  path.config: "/etc/logstash/conf.d/json-pipeline.conf"
  queue.type: persisted
  queue.max_bytes: 20gb

signpost-pipeline.conf

input {
	beats {
		port => '5045'
		id => 'signpost-pipeline'
		client_inactivity_timeout => 120
	}
}

filter {}

output
{
	if 'json' in [tags] {
		pipeline { send_to => "json-pipeline" }
	} else if 'php' in [tags] {
		pipeline { send_to => "php-pipeline" }
}

json-pipeline.conf

input
{
	pipeline { address => "json-pipeline" }
}

filter
{
	json {
		source => 'message'
		skip_on_invalid_json => true
	}
}

output
{
		elasticsearch {
			hosts => [ "XXX" ]
			user => logstash_internal
			password => XXX
			index => '%{[env][service]}'
			manage_template => false
			action => "create"
		}
}

PHP pipeline is just some legacy pipeline with a few documents per day which causes no troubles.

And here come the questions:

  1. Is there any way to explore the queues to debug which service is overwhelm the queue or for better debugging in general?
  2. Persistent queues definitely decrease throughput because of writing to disk. Is there a difference between 1GB queue and 20GB queue speaking of throughput? Can I increase performance with Kafka for example?
  3. Is elasticsearch output as I use it single-thread and can be multiplied to utilize more threads/cores?

In general, my main problem is I have almost full logstash queues, both Logstash and Elasticsearch data nodes are 50% idle, same IO, I have 0 pending tasks, almose empty thread pool "queues" and Im just trying to find some bottleneck here. Logstash delivers messages to Elasticsearch, but it is unbearably slow.

I appreciate any help I get here.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.