Hello there,
since Im running our Elastic Stack, I have a problem with persistent queues, which time to time gets full from unknown reason.
Elasticsearch setup:
3 master nodes
12 data nodes 32CPU 32GB RAM
avg index rate 80K/s
Data volume: 20TB
My logstash setup:
4 logstash nodes 8CPU 16GB RAM
logstash.yml
path.data: /var/lib/logstash
path.logs: /var/log/logstash
queue.drain: true
config.reload.automatic: true
xpack.monitoring.elasticsearch.hosts: [ "XXX" ]
xpack.monitoring.enabled: true
xpack.monitoring.enabled: true
xpack.monitoring.elasticsearch.username: "logstash_system"
xpack.monitoring.elasticsearch.password: "XXXX"
pipelines.yml
- pipeline.id: signpost-pipeline
  path.config: "/etc/logstash/conf.d/signpost-pipeline.conf"
  queue.type: persisted
  queue.max_bytes: 1gb
- pipeline.id: php-pipeline
  path.config: "/etc/logstash/conf.d/php-pipeline.conf"
  queue.type: persisted
  queue.max_bytes: 10gb
- pipeline.id: json-pipeline
  path.config: "/etc/logstash/conf.d/json-pipeline.conf"
  queue.type: persisted
  queue.max_bytes: 20gb
signpost-pipeline.conf
input {
	beats {
		port => '5045'
		id => 'signpost-pipeline'
		client_inactivity_timeout => 120
	}
}
filter {}
output
{
	if 'json' in [tags] {
		pipeline { send_to => "json-pipeline" }
	} else if 'php' in [tags] {
		pipeline { send_to => "php-pipeline" }
}
json-pipeline.conf
input
{
	pipeline { address => "json-pipeline" }
}
filter
{
	json {
		source => 'message'
		skip_on_invalid_json => true
	}
}
output
{
		elasticsearch {
			hosts => [ "XXX" ]
			user => logstash_internal
			password => XXX
			index => '%{[env][service]}'
			manage_template => false
			action => "create"
		}
}
PHP pipeline is just some legacy pipeline with a few documents per day which causes no troubles.
And here come the questions:
- Is there any way to explore the queues to debug which service is overwhelm the queue or for better debugging in general?
 - Persistent queues definitely decrease throughput because of writing to disk. Is there a difference between 1GB queue and 20GB queue speaking of throughput? Can I increase performance with Kafka for example?
 - Is elasticsearch output as I use it single-thread and can be multiplied to utilize more threads/cores?
 
In general, my main problem is I have almost full logstash queues, both Logstash and Elasticsearch data nodes are 50% idle, same IO, I have 0 pending tasks, almose empty thread pool "queues" and Im just trying to find some bottleneck here. Logstash delivers messages to Elasticsearch, but it is unbearably slow.
I appreciate any help I get here.