I am using filebeat to send huge numbers of logs from 2 log servers to logstash machines. My POC basically has a 24 hour window to process all the logs for one 24 hour period, 2 days in arrears.
Filebeat seems to be the bottleneck in my stack as based on activity observations.
Filebeat 5.1.1 -> Logstash 5.1.1 -> ElasticSearch 5.1.2
Initially with one machine for each component, I expanded both logstash and elasticsearch into 4 node clusers and the CPU utilisation on all the machines dropped immediately however the time it took to process all the logs did not improve.
I started to play around with these settings in Filebeat to get more performance:
spool_size:
worker:
bulk_max_size:
Current Filebeat config is:
filebeat:
config_dir: /etc/filebeat/conf.d
spool_size: 16384
output:
logstash:
# The Logstash hosts
hosts: ["10.195.36.92:5044", "10.75.10.145:5044", "10.77.149.42:5044", "10.75.26.217:5044"]
worker: 20
index: cloud-production
loadbalance: true
bulk_max_size: 4096
I actually raised these values to 65k and 16k but that caused a huge number of time outs and connection resets in the Filebeat Log
2017-01-30T14:08:14Z ERR Failed to publish events caused by: read tcp 10.75.142.89:38554->10.195.36.92:5044: i/o timeout
2017-01-30T14:08:14Z INFO Error publishing events (retrying): read tcp 10.75.142.89:38554->10.195.36.92:5044: i/o timeout
2017-01-30T14:08:36Z INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=7316381 libbeat.logstash.published_and_acked_events=18817
2017-01-30T14:09:02Z INFO Error publishing events (retrying): write tcp 10.75.142.89:54657->10.75.10.145:5044: write: connection reset by peer
I cannot see any errors on the logstash machines or the elasticsearch machines.
I can ping and telnet to all nodes without any issues. the i/o timeout and connection reset by peer seems to happen for all the nodes in the list. The number of errors seems to decrease when I decrease the pool size and the bulk size settings. How do I correlate these settings to resources, is it memory or network related?
(UPDATE)
I have reverted some settings on filebeat but I am still getting a huge number of failures in the logs.
2017-01-30T16:54:43Z INFO Error publishing events (retrying): write tcp 10.75.142.89:40819->10.195.36.92:5044: write: connection reset by peer
2017-01-30T16:54:45Z INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_errors=1 libbeat.logstash.published_but_not_acked_events=2526 libbeat.logstash.call_count.PublishEvents=2 libbeat.logstash.publish.read_errors=1 libbeat.logstash.publish.write_bytes=3555
2017-01-30T16:55:13Z ERR Failed to publish events caused by: read tcp 10.75.142.89:56921->10.75.10.145:5044: i/o timeout
2017-01-30T16:55:13Z INFO Error publishing events (retrying): read tcp 10.75.142.89:56921->10.75.10.145:5044: i/o timeout
2017-01-30T16:55:13Z ERR Failed to publish events caused by: write tcp 10.75.142.89:40818->10.195.36.92:5044: write: connection reset by peer
2017-01-30T16:55:13Z INFO Error publishing events (retrying): write tcp 10.75.142.89:40818->10.195.36.92:5044: write: connection reset by peer
2017-01-30T16:55:15Z INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=2 libbeat.logstash.published_but_not_acked_events=2526 libbeat.logstash.publish.read_errors=1 libbeat.logstash.publish.write_errors=1 libbeat.logstash.publish.write_bytes=3563
2017-01-30T16:55:43Z ERR Failed to publish events caused by: read tcp 10.75.142.89:56919->10.75.10.145:5044: i/o timeout
2017-01-30T16:55:43Z INFO Error publishing events (retrying): read tcp 10.75.142.89:56919->10.75.10.145:5044: i/o timeout
2017-01-30T16:55:43Z ERR Failed to publish events caused by: write tcp 10.75.142.89:40816->10.195.36.92:5044: write: connection reset by peer
2017-01-30T16:55:43Z INFO Error publishing events (retrying): write tcp 10.75.142.89:40816->10.195.36.92:5044: write: connection reset by peer
2017-01-30T16:55:43Z ERR Failed to publish events caused by: write tcp 10.75.142.89:40815->10.195.36.92:5044: write: connection reset by peer
2017-01-30T16:55:43Z INFO Error publishing events (retrying): write tcp 10.75.142.89:40815->10.195.36.92:5044: write: connection reset by peer
2017-01-30T16:55:45Z INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=3 libbeat.logstash.publish.write_errors=2 libbeat.logstash.publish.read_errors=1 libbeat.logstash.publish.write_bytes=3547 libbeat.logstash.published_but_not_acked_events=3789
I have forced logstash to use ipv4 as suggested in other posts.
-Djava.net.preferIPv4Stack=true
I have also reduced some values, here are both filebeat and logstash settings:
Fileabeat
################### Filebeat Configuration Example #########################
############################# Filebeat ######################################
filebeat:
config_dir: /etc/filebeat/conf.d
output:
logstash:
# The Logstash hosts
hosts: ["10.195.36.92:5044", "10.75.10.145:5044", "10.77.149.42:5044", "10.75.26.217:5044"]
worker: 10
index: cloud-production
loadbalance: true
bulk_max_size: 4096
############################# Shipper #########################################
shipper:
logging:
# To enable logging to files, to_files option has to be set to true
files:
# automatically rotated
rotateeverybytes: 10485760 # = 10MB
Logstash
path.data: /var/lib/logstash
pipeline.workers: 8
pipeline.output.workers: 6
pipeline.batch.size: 4000
path.config: /etc/logstash/conf.d
queue.type: persisted
path.queue: /mnt/logstash/queue
queue.page_capacity: 1024mb
queue.max_events: 0
path.logs: /var/log/logstash