Filebeat configuration - i/o errors and connection resets


(Arthur Francis) #1

I am using filebeat to send huge numbers of logs from 2 log servers to logstash machines. My POC basically has a 24 hour window to process all the logs for one 24 hour period, 2 days in arrears.

Filebeat seems to be the bottleneck in my stack as based on activity observations.

Filebeat 5.1.1 -> Logstash 5.1.1 -> ElasticSearch 5.1.2

Initially with one machine for each component, I expanded both logstash and elasticsearch into 4 node clusers and the CPU utilisation on all the machines dropped immediately however the time it took to process all the logs did not improve.

I started to play around with these settings in Filebeat to get more performance:
spool_size:
worker:
bulk_max_size:

Current Filebeat config is:

filebeat:
  config_dir: /etc/filebeat/conf.d
  spool_size: 16384
output:

  logstash:
    # The Logstash hosts
    hosts: ["10.195.36.92:5044", "10.75.10.145:5044", "10.77.149.42:5044", "10.75.26.217:5044"]
    worker: 20
    index: cloud-production
    loadbalance: true
    bulk_max_size: 4096

I actually raised these values to 65k and 16k but that caused a huge number of time outs and connection resets in the Filebeat Log

2017-01-30T14:08:14Z ERR Failed to publish events caused by: read tcp 10.75.142.89:38554->10.195.36.92:5044: i/o timeout
2017-01-30T14:08:14Z INFO Error publishing events (retrying): read tcp 10.75.142.89:38554->10.195.36.92:5044: i/o timeout
2017-01-30T14:08:36Z INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=7316381 libbeat.logstash.published_and_acked_events=18817 
2017-01-30T14:09:02Z INFO Error publishing events (retrying): write tcp 10.75.142.89:54657->10.75.10.145:5044: write: connection reset by peer

I cannot see any errors on the logstash machines or the elasticsearch machines.

I can ping and telnet to all nodes without any issues. the i/o timeout and connection reset by peer seems to happen for all the nodes in the list. The number of errors seems to decrease when I decrease the pool size and the bulk size settings. How do I correlate these settings to resources, is it memory or network related?

(UPDATE)

I have reverted some settings on filebeat but I am still getting a huge number of failures in the logs.

2017-01-30T16:54:43Z INFO Error publishing events (retrying): write tcp 10.75.142.89:40819->10.195.36.92:5044: write: connection reset by peer
2017-01-30T16:54:45Z INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_errors=1 libbeat.logstash.published_but_not_acked_events=2526 libbeat.logstash.call_count.PublishEvents=2 libbeat.logstash.publish.read_errors=1 libbeat.logstash.publish.write_bytes=3555
2017-01-30T16:55:13Z ERR Failed to publish events caused by: read tcp 10.75.142.89:56921->10.75.10.145:5044: i/o timeout
2017-01-30T16:55:13Z INFO Error publishing events (retrying): read tcp 10.75.142.89:56921->10.75.10.145:5044: i/o timeout
2017-01-30T16:55:13Z ERR Failed to publish events caused by: write tcp 10.75.142.89:40818->10.195.36.92:5044: write: connection reset by peer
2017-01-30T16:55:13Z INFO Error publishing events (retrying): write tcp 10.75.142.89:40818->10.195.36.92:5044: write: connection reset by peer
2017-01-30T16:55:15Z INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=2 libbeat.logstash.published_but_not_acked_events=2526 libbeat.logstash.publish.read_errors=1 libbeat.logstash.publish.write_errors=1 libbeat.logstash.publish.write_bytes=3563
2017-01-30T16:55:43Z ERR Failed to publish events caused by: read tcp 10.75.142.89:56919->10.75.10.145:5044: i/o timeout
2017-01-30T16:55:43Z INFO Error publishing events (retrying): read tcp 10.75.142.89:56919->10.75.10.145:5044: i/o timeout
2017-01-30T16:55:43Z ERR Failed to publish events caused by: write tcp 10.75.142.89:40816->10.195.36.92:5044: write: connection reset by peer
2017-01-30T16:55:43Z INFO Error publishing events (retrying): write tcp 10.75.142.89:40816->10.195.36.92:5044: write: connection reset by peer
2017-01-30T16:55:43Z ERR Failed to publish events caused by: write tcp 10.75.142.89:40815->10.195.36.92:5044: write: connection reset by peer
2017-01-30T16:55:43Z INFO Error publishing events (retrying): write tcp 10.75.142.89:40815->10.195.36.92:5044: write: connection reset by peer
2017-01-30T16:55:45Z INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=3 libbeat.logstash.publish.write_errors=2 libbeat.logstash.publish.read_errors=1 libbeat.logstash.publish.write_bytes=3547 libbeat.logstash.published_but_not_acked_events=3789

I have forced logstash to use ipv4 as suggested in other posts.
-Djava.net.preferIPv4Stack=true

I have also reduced some values, here are both filebeat and logstash settings:

Fileabeat
################### Filebeat Configuration Example #########################

############################# Filebeat ######################################
filebeat:
  config_dir: /etc/filebeat/conf.d
output:

  logstash:
    # The Logstash hosts
    hosts: ["10.195.36.92:5044", "10.75.10.145:5044", "10.77.149.42:5044", "10.75.26.217:5044"]
    worker: 10
    index: cloud-production
    loadbalance: true
    bulk_max_size: 4096

############################# Shipper #########################################

shipper:

logging:

  # To enable logging to files, to_files option has to be set to true
  files:
    # automatically rotated
    rotateeverybytes: 10485760 # = 10MB

Logstash

path.data: /var/lib/logstash
pipeline.workers: 8
pipeline.output.workers: 6
pipeline.batch.size: 4000
path.config: /etc/logstash/conf.d
queue.type: persisted
path.queue: /mnt/logstash/queue
queue.page_capacity: 1024mb
queue.max_events: 0
path.logs: /var/log/logstash

(ruflin) #2

Do you have a load balancer or something like that in the middle? Could you have a look at the Logstash logs and check if you see there something interesting. How many events per second do you expect to have?


(Arthur Francis) #3

I placed logstash and filebeat on debug and here is the output

Logstash

[2017-01-31T14:44:31,737][DEBUG][io.netty.handler.logging.LoggingHandler] [id: 0xd6375312, L:/10.75.26.217:5044 ! R:/10.81.2.165:38722] INACTIVE
[2017-01-31T14:44:31,737][DEBUG][io.netty.handler.logging.LoggingHandler] [id: 0x2b9696bd, L:/10.75.26.217:5044 ! R:/10.75.142.89:41520] INACTIVE
[2017-01-31T14:44:31,737][DEBUG][io.netty.handler.logging.LoggingHandler] [id: 0xd6375312, L:/10.75.26.217:5044 ! R:/10.81.2.165:38722] UNREGISTERED
[2017-01-31T14:44:31,737][DEBUG][io.netty.handler.logging.LoggingHandler] [id: 0x2b9696bd, L:/10.75.26.217:5044 ! R:/10.75.142.89:41520] UNREGISTERED
[2017-01-31T14:44:31,737][DEBUG][io.netty.handler.logging.LoggingHandler] [id: 0xa570f54b, L:/10.75.26.217:5044 ! R:/10.81.2.165:38714] INACTIVE
[2017-01-31T14:44:31,737][DEBUG][io.netty.handler.logging.LoggingHandler] [id: 0xa570f54b, L:/10.75.26.217:5044 ! R:/10.81.2.165:38714] UNREGISTERED
[2017-01-31T14:44:31,737][DEBUG][io.netty.handler.logging.LoggingHandler] [id: 0xcf7587af, L:/10.75.26.217:5044 ! R:/10.81.2.165:38718] INACTIVE
[2017-01-31T14:44:31,737][DEBUG][io.netty.handler.logging.LoggingHandler] [id: 0xcf7587af, L:/10.75.26.217:5044 ! R:/10.81.2.165:38718] UNREGISTERED
[2017-01-31T14:44:31,737][DEBUG][io.netty.handler.logging.LoggingHandler] [id: 0x7d097dc9, L:/10.75.26.217:5044 ! R:/10.81.2.165:38717] INACTIVE
[2017-01-31T14:44:31,738][DEBUG][io.netty.handler.logging.LoggingHandler] [id: 0x7d097dc9, L:/10.75.26.217:5044 ! R:/10.81.2.165:38717] UNREGISTERED
[2017-01-31T14:44:31,739][DEBUG][io.netty.handler.logging.LoggingHandler] [id: 0x7d097dc9, L:/10.75.26.217:5044 ! R:/10.81.2.165:38717] CLOSE`

Filebeat

2017/01/31 13:31:33.823894 client.go:194: DBG  handle error: read tcp 10.81.2.165:38221->10.75.26.217:5044: i/o timeout
2017/01/31 13:31:33.823920 sync.go:78: DBG  0 events out of 1920 events sent to logstash. Continue sending
2017/01/31 13:31:33.823934 sync.go:58: DBG  close connection
2017/01/31 13:31:33.823945 client.go:110: DBG  closing
2017/01/31 13:31:33.823995 sync.go:85: ERR Failed to publish events caused by: read tcp 10.81.2.165:38221->10.75.26.217:5044: i/o timeout
2017/01/31 13:31:33.824014 sync_worker.go:167: INFO Error publishing events (retrying): read tcp 10.81.2.165:38221->10.75.26.217:5044: i/o timeout
2017/01/31 13:31:33.824029 context.go:93: DBG  forwards msg with attempts=-1
2017/01/31 13:31:33.824052 context.go:98: DBG  message forwarded
2017/01/31 13:31:33.824070 context.go:136: DBG  events from retries queue
2017/01/31 13:31:33.824083 sync.go:107: DBG  Try to publish 1920 events to logstash with window size 1
2017/01/31 13:31:33.844976 prospector.go:155: DBG  Run prospector
2017/01/31 13:31:33.844991 prospector_log.go:62: DBG  Start next scan

(ruflin) #4

Do you have anything like a load balancer or similar between FB and LS?


(Steffen Siering) #5

which logstash-input-beats plugin version have you installed? Logstash 5.2 has been released recently, it should use a newer plugin version out of the box.


(Arthur Francis) #6

@ruflin No loadbalancers, only direct connection

@steffens Upgraded the entire stack to 5.2 same effect

UPDATE

I am currently investigating the problem with amazon. It could be network related as I ran some perftests between 2 machines and 3 out of 6 sent 0 data

also mtr shows this

mtr -c 50 --show-ips -w -r --tcp -P 22 10.81.2.165
Start: Thu Feb  2 10:09:36 2017
HOST: elk-poc-ls4                                             Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ip-10-0-0-0.eu-west-1.compute.internal (10.0.0.0)       98.0%    50    0.1   0.1   0.1   0.1   0.0
  2.|-- ip-10-0-0-0.eu-west-1.compute.internal (10.0.0.0)       72.0%    50  999.4 857.0   0.1 3003. 770.9
  3.|-- ip-10-0-0-0.eu-west-1.compute.internal (10.0.0.0)       26.0%    50    0.2   0.1   0.1   0.2   0.0
  4.|-- ip-10-0-0-0.eu-west-1.compute.internal (10.0.0.0)       98.0%    50    0.1   0.1   0.1   0.1   0.0
  5.|-- ip-10-0-0-0.eu-west-1.compute.internal (10.0.0.0)       98.0%    50    0.1   0.1   0.1   0.1   0.0
  6.|-- ip-10-0-0-0.eu-west-1.compute.internal (10.0.0.0)       98.0%    50    0.1   0.1   0.1   0.1   0.0
  7.|-- ip-10-81-2-165.eu-west-1.compute.internal (10.81.2.165)  2.0%    50  282.4 280.8 143.0 286.7  23.6

That does not look too healthy to me :frowning:


(Steffen Siering) #7

Uh, these numbers :frowning:

Kind of reminds me of this (Article from 2015).


(system) #8

This topic was automatically closed after 21 days. New replies are no longer allowed.