Filebeat 6.2 throughput and general performance

Hello!

I'm having some difficulties to maximize performances and I'd like to receive some advices.
I'm using Filebeat (v. 6.2.2) to send output directly to Elasticsearch (v. 6.2.2), on a Linux Ubuntu machine with 16 cores and 32GB RAM. All the configuration files are written below.
The goal is to store around 42000 logs, from 3 different servers:

  • First Server (the heaviest): 12200 logs, each log contains between 100k and 300k lines, 750GB
  • Second Server: 14900 logs, each log contains between 30k and 150k lines, 250G
  • Third Server: 14900 logs, each log contains between 30k and 150k lines, 250G

At the moment, the throughput starts at 15000 events per second, but falls to 5000 eps: as an example, I tried to send 480 logs (around 19 millions events) and it took 45 minutes.

I tried to resume my doubts in the following questions:

  1. Does the number of harvester started have effects on the performance? Is it better to harvest batches of 500 logs instead of 4000 logs? The main difference would be the number of lines to process at the same time (19millions of events against 500millions)
  2. What is the rule of thumb to set the bulk_max_size option in the filebeat.yml? After a couple of tries it seems that a smaller number (such as 5000) is better than a big one, even if I thought that a larger number would mean more lines proceeded at the same time (and with logs of 50K lines it looked like a good idea, not splitting each log in too many batches).
  3. As you can see, I have set up 1000 workers: I realize it's an absurd number but with a number too low it was very slow. Since I'm working on a single node on a single cluster, what is the ideal number of workers?
  4. Referencing the third question, should I use more than one node? I'm planning to use only one index with 200 shards, and using more than one index is not an eligible option.
  5. Will modify the queue.mem parameter provide any improvement?
  6. In your opinion, what is the best way to calculate throughput? I was mainly using the Filbeat logs and direct experience (try to load a certain number of events, see how much it took)

Thank you in advance, any help will be very appreciated!

elasticsearch.yml

bootstrap.memory_lock: true
indices.memory.index_buffer_size: 50%
indices.memory.min_index_buffer_size: 192mb

jvm.options

-Xms15g
-Xmx15g

filebeat.yml

filebeat.prospectors:

- type: log
  enabled: true
  paths:
     - /path/to/logs/*.json
  json.keys_under_root: true

#-------------------------- Elasticsearch output ------------------------------
output.elasticsearch:
  hosts: ["localhost:9200"]
  index: "my_index"
  loadbalance: true
  worker: 1000
  bulk_max_size: 5000
  compression_level: 0

Settings for my_index

  "settings": {
    "number_of_shards": 20,
    "number_of_replicas": 0,
    "codec": "best_compression",
    "refresh_interval": "30s",
    "translog.sync_interval": "1m",
    "translog.flush_threshold_size": "1gb",
    "translog.durability" : "async",
    "merge.scheduler.max_thread_count": "1"}

For testing, I'm creating an index with only 20 shards but as I said before I'll need many more, would this be a problem for indexing speed? The max_thread_count is set to 1 since my index is on spinning platter drives.

If you have a single Elasticsearch node with spinning disks, disk I/O and iowait may be the bottleneck. Have you measured these while indexing?

Yes I did and it never reaches its maximum throughput.
I measured it with dstat.

What does iostat give you?

Using iostat -mdc vbb 3, iowait had some peaks towards 44% (but not many, I would say 20 more or less) and most of the time was below 5%.

Harvesters started: 16 (each file has min 50000 events)
Total number of events: 3,098,556
Disk Usage once uploaded on ES, with codec = best_compression: 850MB
Time: 4 minutes

From what I can tell, it's not a disk I/O problem. Could this be a problem within the use of the heap space? Or some other configurations in the filebeat.yml?

Filebeat config (I changed the number of workers since it actually doesn't change performances)

workers: 3
bulk_max_size: 100000

Elasticsearch config same as before.

Can you show the output of iostat -x?

Why would you set this so large? Increased bulk sized does not necessarily mean improved performance, and this seems far beyond what I would expect to be optimal.

Initially I thought that it was best to avoid splitting a single log in too many batches: for example, if my log has 300k events, it would have been divided in 3 batches rather than 30.
Anyway, now it's set to 10000 (number of workers = 16).

Could you please answer this doubt of mine? For Filebeat, is it better to process 4000 harvester at the same time or 4 new harvester every minute (thus taking at least 1000 minutes to process all logs)?

At the moment I'm sending data from 4 new logs every minute, this is what iostat -x says (the ES node is in vdb, and there's some activity in vdd since it's copying files from vdd to vdb):

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2,27    0,00    0,29    0,58    0,00   96,85
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vdb               0,00     3,18    7,56    5,25   585,49  2118,46   422,26     2,10  163,89   10,96  384,36   3,39   4,35
vdd               0,00     0,03    4,79    0,04   808,54     0,31   334,85     0,02    3,57    3,47   14,93   2,12   1,02

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.