Performance tunning

Hi, running on 5.6.2

My topology is as follows.

3 Filebeat agents on 3 servers.
1 Logstash instance.
3 Elastic data nodes.

Looking through the Kibana monitoring, Logstash indicates it's receiving about 1,500 events per second and emitting 1,500 events per second which indicates that Logstash is not the bottle neck.

But regardless how many logs I write I can't seem to get to send to Logstash more then 1,500 from the 3 Filebeat agents.

My application does 4 log entries at 2,500 bytes total per "business" request.

What parameters can I try to tweak on Filebeat?

What do your configs look like?

Here you go:

filebeat.prospectors:
- input_type: log
  paths:
    - /var/lib/mesos/slave/slaves/*/frameworks/*/executors/*/runs/latest/stdout
    - /var/lib/mesos/slave/slaves/*/frameworks/*/executors/*/runs/latest/stderr
  fields:
    source_type: "framework"
  fields_under_root: true
  tail_files: false
  harvester_buffer_size: 32768

- input_type: log
  paths:
    - /var/log/mesos/*.log
    - /var/log/dcos/dcos.log
  fields:
    source_type: "dcos"
  fields_under_root: true
  tail_files: true

output.logstash:
  hosts: ["logstash.marathon.l4lb.thisdcos.directory:5043"]

What's the depth of those paths? ie how many subdirectories for each of the wildcards.

1 on the first
3 on the second (can also be more, but currently 3)
1 to N on the 3rd (currently 7)

This is based on mesos/marathon handling of containers.

At first try to get an idea on how fast filebeat can actually read in events. For this have filebeat with console output only. e.g. run filebeat in terminal via

rm -f /path/to/registry/file; filebeat -E output.logstash.enabled=false -E output.console.enabled | pv -Warl >/dev/null

Also test with most recent 6.0 release. We've made some improvements, which should improve throughput.

by default the go-runtime creates as many OS worker threads as CPUs are available. With hyper-threading and resource contention with other processes in the system I sometimes didn't find this to be the most optimum. Sometimes reducing the number of OS workers actually helps. Try running with -E max_procs=N to find a good value of OS threads.
If throughput filebeat->logstash is much lower then filebeat->console, the loss of event rate is due to backpressure by network and/or Logstash.

Unfortunantly I cannot try 6.0 Im running Elastic 5.6.2 on dc/os.

On default install (Centos) where is the registry and what does that do if I delete it (im on production system)?

Ok just to be sure I did the right thing...

/usr/share/filebeat/bin/filebeat -c /etc/filebeat/filebeat.yml -E output.logstash.enabled=false -E output.console.enabled | pv -Warl >/dev/null

So what this means, is use my config and override the Logstash output?

PV results...

When idle:
[0.00 /s] [7.89 /s]

When perf testing:
[8.15k/s] [2.98k/s]
[0.00 /s] [4.89k/s]
[8.79k/s] [3.50k/s]
[11.4k/s] [1.67k/s]
[12.4k/s] [2.93k/s]
[10.7k/s] [3.19k/s]

Also network is 1G.

Also cpu and network utilization on logstash is low. There's no events queued.

These rates vary too much, as you are benchmarking on live logs. That is, the results vary due to the log lines writing rates varying quite a lot.

The registry file keeps track of the last lines processed. That is, for benchmarking you want to have a copy of your logs and a separate registry file, which you can safely delete between runs. Then you will also get more robust rates.

Actually, it's a the same log. I stop and restart filebeat on the same log on the same machine no new logs. I seen it rise to 15K.

did you delete the regsitry file between restarts? This high variance in rates looks very fishy too me.
Potential reasons for such high variance I can think of:

  • processing live logs with log producer having a varying rate in logs beeing written
  • some resouce overload scenario like:
    • some kind of network share or network based storage with network or storage being under pressure
    • memory pressure + too much swapping
    • not enough CPU available for beats due to container or VM environment with much too many active VMs
    • too many processes doing disk IO, putting disk and file cache under pressure => run with time and check I/O ops being low while benchmarking. Low IO ops indicate file under test is mostly cached

Yes I deleted the registry on each restart. I'm thinking it could be the blockstorage on Openstack.

On prod I will have 10G network and possibly local ephemeral disks.

Hi @steffens @warkolm

I noticed one thing....

The logs for...

- input_type: log
  paths:
    - /var/log/mesos/*.log
    - /var/log/dcos/dcos.log
  fields:
    source_type: "dcos"  

Process very fast. I shutoff filebeat for a bit and then started up again and logstash wen't all the way to 3K events per second.

So maybe it is the wildcard path?

Huh? You have a few thousands of files in that directory?

Harvesting/pushing files is done concurrently to the routine checking for new files. If you have only 1 or 2 CPUs assigned to filebeat, this might affect throughput.

Hi sorry for confusion. If you look at my original config in the posts above... I have 2 harvesters.

1st harvester: For the containers, this has nested paths which @warkolm asked how many paths there was.
2nd harvester: DC/OS Mesos logs, single log file.

I stopped filebeat and let the logs accumulate a bit, most of them where the dc/os mesos logs as the container applications where not realy active at that time. When I restarted filebeat and looked at logstash metrics, it was able to process 3K logs per second.

So I was wondering if the slowdown for the containers is due to the nested paths that @warkolm mentioned . DC/OS and Mesos will create new folders every time we launch a new container or restart existing one.

So far I don't think it's logstash or elasticsearch as the events received and emited by logstash are on par with each other and match.

Unless you have a few hundreds of path/sub-paths (also account for files in sub-directories), I don't think scanning for files should affect the throughput much.

One problem with the tuning efforts I have is the inconsistencies in read throughput, even when testing with console output. Anyways, you might consider increasing the spool size significantly, so you can buffer more events during read-throughput-peaks. On the output size set loadbalance: true and increase the number of workers (e.g. worker: 3). A batch of events in the spooler is split into N = spooler size / bulk_max_size batches, which are finally load-balanced onto the configured workers. That is N should be a multiple of workers. Only after all sub-batches have been processed, will the next set of batches be forwarded to the outputs (lock-step loadbalancing). Filebeat 6.0 has better support for async publishing and load balancing.

Ok I will try... Thanks

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.