I am using latest versions of Filebeat, Logstash and Elasticsearch on Ubuntu 18.04 machines
I have:
2 filebeat VM's, configured with 8 CPU cores and 16 GB Memory
3 logstash VM's with 24 CPU cores and 64 Gb Memory (31GB Heap)
3 Elasticsearch VM's with 16 CPU cores and 64 Gb Memory (31GB Heap)
filebeat.yml (same for both machines, they share a physical SSD on '/mnt/data', but each with their own allocated space and partitions)
filebeat.inputs:
- type: log
enabled: true
paths:
- /mnt/data/*.csv # This directory contains 502 CSV's, with a total of 780 million (780.000.000) lines and 4 columns (field_0;field_1;field_2;field_3) all of which are Integers. The directory never changes. it's historical data
tail_files: false
queue.mem:
events: 262144
flush.min_events: 32768
flush.timeout: 5s
output.logstash:
hosts:
- "ls-01-nathan"
- "ls-02-nathan"
- "ls-03-nathan"
bulk_max_size: 32768
loadbalance: true
pipelining: 8
worker: 40
http.enabled: true
monitoring.elasticsearch:
hosts: ["es-01-nathan", "es-02-nathan", "es-03-nathan"]
logstash.yml
(I removed all the comments and #-lines)
pipeline.id: ls-01-pipeline
pipeline.workers: 48
pipeline.batch.size: 131072
pipeline.batch.delay: 50
queue.type: memory
log.level: info
xpack.monitoring.enabled: true
xpack.monitoring.elasticsearch.hosts: ["http://es-01-nathan:9200", "http://es-02-nathan:9200", "http://es-03-nathan:9200"]
xpack.monitoring.elasticsearch.sniffing: true
logstash-config.conf
input {
beats {
port => 5044
}
}
filter {
csv {
columns => ["field_0", "field_1", "field_2", "field_3"]
separator => ";"
}
mutate {
remove_field => [ "field_0", "message", "host", "@timestamp", "@version"]
split => { "[log][file][path]" => "/" }
split => { "[log][file][path][-1]" => "_" }
copy => { "[log][file][path][-1][0]" => "timestamp" }
convert => {
"field_1" => "integer"
"field_2" => "integer"
"field_3" => "integer"
}
}
date {
match => [ "timestamp", "yyyyMMddHHmm" ]
target => "timestamp"
}
mutate {
remove_field => [
"log",
"agent",
"tags",
"ecs",
"input"]
}
}
output {
elasticsearch {
hosts => ["es-01-nathan:9200", "es-02-nathan-4u:9200", "es-03-nathan-4u:9200"]
index => "index"
}
}
When I change the filebeat output to output.console and check the throughput (pv -Warl) I get around 85k/s. When I am sending the same output to Logstash, with the loadbalancing enabled. I get around 40k/s.
I have tried increasing the workers and bulk_size, but 40k/s is the max I can get. I need to get it higher.