Performance tunning

javadevmtl · November 19, 2017, 6:34pm

Hi, running on 5.6.2

My topology is as follows.

3 Filebeat agents on 3 servers.
1 Logstash instance.
3 Elastic data nodes.

Looking through the Kibana monitoring, Logstash indicates it's receiving about 1,500 events per second and emitting 1,500 events per second which indicates that Logstash is not the bottle neck.

But regardless how many logs I write I can't seem to get to send to Logstash more then 1,500 from the 3 Filebeat agents.

My application does 4 log entries at 2,500 bytes total per "business" request.

What parameters can I try to tweak on Filebeat?

warkolm · November 19, 2017, 8:16pm

What do your configs look like?

javadevmtl · November 19, 2017, 11:06pm

Here you go:

filebeat.prospectors:
- input_type: log
  paths:
    - /var/lib/mesos/slave/slaves/*/frameworks/*/executors/*/runs/latest/stdout
    - /var/lib/mesos/slave/slaves/*/frameworks/*/executors/*/runs/latest/stderr
  fields:
    source_type: "framework"
  fields_under_root: true
  tail_files: false
  harvester_buffer_size: 32768

- input_type: log
  paths:
    - /var/log/mesos/*.log
    - /var/log/dcos/dcos.log
  fields:
    source_type: "dcos"
  fields_under_root: true
  tail_files: true

output.logstash:
  hosts: ["logstash.marathon.l4lb.thisdcos.directory:5043"]

warkolm · November 19, 2017, 11:44pm

What's the depth of those paths? ie how many subdirectories for each of the wildcards.

javadevmtl · November 19, 2017, 11:56pm

1 on the first
3 on the second (can also be more, but currently 3)
1 to N on the 3rd (currently 7)

This is based on mesos/marathon handling of containers.

steffens · November 21, 2017, 12:23am

At first try to get an idea on how fast filebeat can actually read in events. For this have filebeat with console output only. e.g. run filebeat in terminal via

rm -f /path/to/registry/file; filebeat -E output.logstash.enabled=false -E output.console.enabled | pv -Warl >/dev/null

Also test with most recent 6.0 release. We've made some improvements, which should improve throughput.

by default the go-runtime creates as many OS worker threads as CPUs are available. With hyper-threading and resource contention with other processes in the system I sometimes didn't find this to be the most optimum. Sometimes reducing the number of OS workers actually helps. Try running with -E max_procs=N to find a good value of OS threads.
If throughput filebeat->logstash is much lower then filebeat->console, the loss of event rate is due to backpressure by network and/or Logstash.

javadevmtl · November 21, 2017, 2:32pm

Unfortunantly I cannot try 6.0 Im running Elastic 5.6.2 on dc/os.

On default install (Centos) where is the registry and what does that do if I delete it (im on production system)?

javadevmtl · November 21, 2017, 8:44pm

Ok just to be sure I did the right thing...

/usr/share/filebeat/bin/filebeat -c /etc/filebeat/filebeat.yml -E output.logstash.enabled=false -E output.console.enabled | pv -Warl >/dev/null

So what this means, is use my config and override the Logstash output?

PV results...

When idle:
[0.00 /s] [7.89 /s]

When perf testing:
[8.15k/s] [2.98k/s]
[0.00 /s] [4.89k/s]
[8.79k/s] [3.50k/s]
[11.4k/s] [1.67k/s]
[12.4k/s] [2.93k/s]
[10.7k/s] [3.19k/s]

Also network is 1G.

javadevmtl · November 22, 2017, 3:08am

Also cpu and network utilization on logstash is low. There's no events queued.

steffens · November 23, 2017, 9:41am

These rates vary too much, as you are benchmarking on live logs. That is, the results vary due to the log lines writing rates varying quite a lot.

The registry file keeps track of the last lines processed. That is, for benchmarking you want to have a copy of your logs and a separate registry file, which you can safely delete between runs. Then you will also get more robust rates.

javadevmtl · November 24, 2017, 3:23pm

Actually, it's a the same log. I stop and restart filebeat on the same log on the same machine no new logs. I seen it rise to 15K.

steffens · November 28, 2017, 1:38pm

did you delete the regsitry file between restarts? This high variance in rates looks very fishy too me.
Potential reasons for such high variance I can think of:

processing live logs with log producer having a varying rate in logs beeing written
some resouce overload scenario like:
- some kind of network share or network based storage with network or storage being under pressure
- memory pressure + too much swapping
- not enough CPU available for beats due to container or VM environment with much too many active VMs
- too many processes doing disk IO, putting disk and file cache under pressure => run with time and check I/O ops being low while benchmarking. Low IO ops indicate file under test is mostly cached

javadevmtl · November 28, 2017, 8:31pm

Yes I deleted the registry on each restart. I'm thinking it could be the blockstorage on Openstack.

On prod I will have 10G network and possibly local ephemeral disks.

javadevmtl · November 29, 2017, 11:38pm

Hi @steffens @warkolm

I noticed one thing....

The logs for...

- input_type: log
  paths:
    - /var/log/mesos/*.log
    - /var/log/dcos/dcos.log
  fields:
    source_type: "dcos"

Process very fast. I shutoff filebeat for a bit and then started up again and logstash wen't all the way to 3K events per second.

So maybe it is the wildcard path?

steffens · November 30, 2017, 6:32pm

Huh? You have a few thousands of files in that directory?

Harvesting/pushing files is done concurrently to the routine checking for new files. If you have only 1 or 2 CPUs assigned to filebeat, this might affect throughput.

javadevmtl · November 30, 2017, 8:07pm

Hi sorry for confusion. If you look at my original config in the posts above... I have 2 harvesters.

1st harvester: For the containers, this has nested paths which @warkolm asked how many paths there was.
2nd harvester: DC/OS Mesos logs, single log file.

I stopped filebeat and let the logs accumulate a bit, most of them where the dc/os mesos logs as the container applications where not realy active at that time. When I restarted filebeat and looked at logstash metrics, it was able to process 3K logs per second.

So I was wondering if the slowdown for the containers is due to the nested paths that @warkolm mentioned . DC/OS and Mesos will create new folders every time we launch a new container or restart existing one.

So far I don't think it's logstash or elasticsearch as the events received and emited by logstash are on par with each other and match.

steffens · November 30, 2017, 10:50pm

Unless you have a few hundreds of path/sub-paths (also account for files in sub-directories), I don't think scanning for files should affect the throughput much.

One problem with the tuning efforts I have is the inconsistencies in read throughput, even when testing with console output. Anyways, you might consider increasing the spool size significantly, so you can buffer more events during read-throughput-peaks. On the output size set loadbalance: true and increase the number of workers (e.g. worker: 3). A batch of events in the spooler is split into N = spooler size / bulk_max_size batches, which are finally load-balanced onto the configured workers. That is N should be a multiple of workers. Only after all sub-batches have been processed, will the next set of batches be forwarded to the outputs (lock-step loadbalancing). Filebeat 6.0 has better support for async publishing and load balancing.

javadevmtl · December 1, 2017, 2:05am

Ok I will try... Thanks

system · December 29, 2017, 2:05am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Filebeat - logstash performances troubleshooting Beats filebeat	15	4669	April 17, 2017
Speed limitations of filebeat? Beats filebeat	14	15028	July 5, 2017
Filebeat can't keep up with logs volume Beats filebeat	4	740	November 21, 2021
Filebeat 6.2 throughput and general performance Beats filebeat	7	4473	April 3, 2018
Filebeat 7.6.2_linux_x86_64 is not keeping up with the log entries added to .log files Beats filebeat	3	398	March 31, 2021

Performance tunning

Related topics