The load balancing algorithm followed by filebeat

George_Cherian · December 5, 2017, 2:51pm

Hi

The question I have is that , I have two Logstash Instances, running on same configuration VMs, with filebeat forwarding data to both the logstash instances. But I could find that one of the VM is indexing 22 MB data while the other VM is only indexing 9 MB.

The filebeat configs are as follows:

#=========================== Filebeat prospectors =============================

filebeat.prospectors:

# Each - is a prospector. Most options can be set at the prospector level, so
# you can use different prospectors for various configurations.
# Below are the prospector specific configurations.

- input_type: log
  enabled: true
  # Paths that should be crawled and fetched. Glob based paths.
  paths:
    - /var/log/prod_logs/XS/basatlxs01/*.txt
    #- c:\programdata\elasticsearch\logs\*

  # Exclude lines. A list of regular expressions to match. It drops the lines that are
  # matching any regular expression from the list.
  #exclude_lines: ["^DBG"]

  # Include lines. A list of regular expressions to match. It exports the lines that are
  # matching any regular expression from the list.
  #include_lines: ["^ERR", "^WARN"]

  # Exclude files. A list of regular expressions to match. Filebeat drops the files that
  # are matching any regular expression from the list. By default, no files are dropped.
  #exclude_files: [".gz$"]

  # Optional additional fields. These field can be freely picked
  # to add additional information to the crawled log files for filtering
  #fields:
  #  level: debug
  #  review: 1
  ### Multiline options

  # Mutiline can be used for log messages spanning multiple lines. This is common
  # for Java Stack Traces or C-Line Continuation

  # The regexp Pattern that has to be matched. The example pattern matches all lines starting with [
  multiline.pattern: ^([0-9]+.[0-9]+.[0-9]+)

  # Defines if the pattern set under pattern should be negated or not. Default is false.
  multiline.negate: true

  # Match can be set to "after" or "before". It is used to define if lines should be append to a pattern
  # that was (not) matched before or after or as long as a pattern is not matched based on negate.
  # Note: After is the equivalent to previous and before is the equivalent to to next in Logstash
  multiline.match: after

  clean_*: true
  # Files for the modification data is older then clean_inactive the state from the registry is removed
  # By default this is disabled.
  clean_inactive: 0

  # Removes the state for file which cannot be found on disk anymore immediately
  clean_removed: true

  filebeat.spool_size: 2048


#----------------------------- Logstash output --------------------------------
output.logstash:
  # The Logstash hosts
  hosts: [ "96.118.58.155:5044", "96.118.51.139:5044"]
  loadbalance: true
  index: filebeat
  bulk_max_size: 1024

  # Optional SSL. By default is off.
  # List of root certificates for HTTPS server verifications
  #ssl.certificate_authorities: ["/etc/pki/root/ca.pem"]

  # Certificate for SSL client authentication
  #ssl.certificate: "/etc/pki/client/cert.pem"

  # Client Certificate Key
  #ssl.key: "/etc/pki/client/cert.key"

As far as Logstash is concerned the filters and configs are same.

Please provide some input on the working of filebeat load balancing so that I can design my node accordingly.

Thanks
George

steffens · December 5, 2017, 3:45pm

Filebeat uses a shared work queue. Each Logstash endpoint has one worker reading a batch of events from the work and and forwarding the batch. If one worker is waiting on the output for longer times, it will process less events in total. If one output fails (e.g. i/o error), it will return the events into the retry queue and back-off for some time. In this case the second worker will pick-up the events from the retry queue and continue processing from retry queue and shared work queue until the other output worker is available again.

he logstash output uses slow-start (sending an ever growing subset of events) to the output. With 5.6.4 and 6.0, the slow start will be disabled, reducing the number of network round-trips + latencies. This might also affect differences in output worker throughput.

Check filebeat logs for I/O errors.

Another reason can be the batch sizes as well. If the batch sizes are < 2048, one output will see bigger batches and the other one smaller ones. You can try to increase the flush timeout in this case (spooler will flush once buffer is full).

You can enable logstash debug logging -d 'logstash' to get an idea about events being published + batch sizes.

Feel free to open an enhancement request for more fair event distribution when load-balancing. With changes to the publisher pipeline, we can consider some different (configurable) load balancing + retry strategies.

George_Cherian · December 6, 2017, 6:58am

Tried to increase the flush timeout in /etc/filebeat/filebeat.yml as shown below:

# to be sent to the outputs.
flush.min_events: 2048

# Maximum duration after which events are available to the outputs,
# if the number of events stored in the queue is < min_flush_events.
flush.timeout: 3s

But when start the service for filebeat it still showing 1 sec in log.

2017/12/06 06:54:50.003712 async.go:63: INFO Flush Interval set to: 1s

Please correct me if anything else needs to be changed.

steffens · December 6, 2017, 2:01pm

Which beat version are you using? The settings you try are 6.0 only and need to be put in the queue.mem namespace.
The Flush Interval log message looks like 5.x. I think this message can be ignored here, as you need to configure the spoolers flush_interval (see 5.x docs, spooler has been removed in 6.0).

system · January 3, 2018, 2:02pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.