Hello there,
since Im running our Elastic Stack, I have a problem with persistent queues, which time to time gets full from unknown reason.
Elasticsearch setup:
3 master nodes
12 data nodes 32CPU 32GB RAM
avg index rate 80K/s
Data volume: 20TB
My logstash setup:
4 logstash nodes 8CPU 16GB RAM
logstash.yml
path.data: /var/lib/logstash
path.logs: /var/log/logstash
queue.drain: true
config.reload.automatic: true
xpack.monitoring.elasticsearch.hosts: [ "XXX" ]
xpack.monitoring.enabled: true
xpack.monitoring.enabled: true
xpack.monitoring.elasticsearch.username: "logstash_system"
xpack.monitoring.elasticsearch.password: "XXXX"
pipelines.yml
- pipeline.id: signpost-pipeline
path.config: "/etc/logstash/conf.d/signpost-pipeline.conf"
queue.type: persisted
queue.max_bytes: 1gb
- pipeline.id: php-pipeline
path.config: "/etc/logstash/conf.d/php-pipeline.conf"
queue.type: persisted
queue.max_bytes: 10gb
- pipeline.id: json-pipeline
path.config: "/etc/logstash/conf.d/json-pipeline.conf"
queue.type: persisted
queue.max_bytes: 20gb
signpost-pipeline.conf
input {
beats {
port => '5045'
id => 'signpost-pipeline'
client_inactivity_timeout => 120
}
}
filter {}
output
{
if 'json' in [tags] {
pipeline { send_to => "json-pipeline" }
} else if 'php' in [tags] {
pipeline { send_to => "php-pipeline" }
}
json-pipeline.conf
input
{
pipeline { address => "json-pipeline" }
}
filter
{
json {
source => 'message'
skip_on_invalid_json => true
}
}
output
{
elasticsearch {
hosts => [ "XXX" ]
user => logstash_internal
password => XXX
index => '%{[env][service]}'
manage_template => false
action => "create"
}
}
PHP pipeline is just some legacy pipeline with a few documents per day which causes no troubles.
And here come the questions:
- Is there any way to explore the queues to debug which service is overwhelm the queue or for better debugging in general?
- Persistent queues definitely decrease throughput because of writing to disk. Is there a difference between 1GB queue and 20GB queue speaking of throughput? Can I increase performance with Kafka for example?
- Is elasticsearch output as I use it single-thread and can be multiplied to utilize more threads/cores?
In general, my main problem is I have almost full logstash queues, both Logstash and Elasticsearch data nodes are 50% idle, same IO, I have 0 pending tasks, almose empty thread pool "queues" and Im just trying to find some bottleneck here. Logstash delivers messages to Elasticsearch, but it is unbearably slow.
I appreciate any help I get here.