Filebeat extremely slow startup for large number of files

We use Filebeat to tail hundreds of directories, each containing up to a few thousand log files. Whenever we need to restart Filebeat, it often takes days for Filebeat to being sending logs to Logstash again.

I am configuring scan_frequency, registry.flush and setting clean_removed to help reduce the size of the registry file, however the restarts are still taking several days.

I've been following the suggested in the following links and am still seeing no improvements.


https://www.elastic.co/guide/en/beats/filebeat/current/configuration-general-options.html#_registry_flush

My Filebeat config:

    filebeat.inputs:

    # Inputs
    - type: log
      enabled: true
      close_inactive: 1s
      scan_frequency: 120
      harvester_limit: 512
      exclude_lines: ["#"]
      clean_removed: true

      paths:
        - "/datadrive/*/data/*/*.txt"
        - "/datadrive/*/*/data/*/*.txt"

    registry.flush: 600s
    logging.level: info

    output.logstash:
      hosts: [<logstash hosts here>]
      loadbalance: true

During this time, it looks there is a significant amount of data being written to the disk. OS disk is sda, files data is on sdc:

    user@LinuxVM:~$ sudo iostat -dx
    Linux 5.3.0-1032-azure (ST0003) 	07/10/20 	_x86_64_	(16 CPU)

        Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
        loop0            0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     1.00     0.00   1.00   0.00
        sda              0.32   58.20      9.33  18598.55     0.10    39.91  22.93  40.68    0.29    7.86   0.42    28.72   319.58   1.06   6.17
        sdb             10.63    1.06   1084.36      8.53     0.22     1.07   2.06  50.22    3.86    2.24   0.01   101.97     8.04   5.04   5.89
        sdc              0.00    0.00      0.17      0.00     0.00     0.00   0.00   0.00    0.16    0.00   0.00    42.34     4.00   0.41   0.00

CPU and RAM does not appear to be the bottleneck. I'm only seeing about 10% CPU usage on all 16 cores of the machine and only 1GB ram being used out of 32GB available.

In our current setup we have about 200000 files in the directories.

The log files don't seem to show anything interesting. Just the regular Non-zero metrics at 30s intervals.

    {
        "monitoring": {
            "metrics": {
                "beat": {
                    "cpu": {
                        "system": {
                            "ticks": 9090,
                            "time": {
                                "ms": 1892
                            }
                        },
                        "total": {
                            "ticks": 74380,
                            "time": {
                                "ms": 15239
                            },
                            "value": 74380
                        },
                        "user": {
                            "ticks": 65290,
                            "time": {
                                "ms": 13347
                            }
                        }
                    },
                    "handles": {
                        "limit": {
                            "hard": 4096,
                            "soft": 1024
                        },
                        "open": 8
                    },
                    "info": {
                        "ephemeral_id": "143e23e1-0673-4e0a-b3b2-19e79cbebb85",
                        "uptime": {
                            "ms": 150027
                        }
                    },
                    "memstats": {
                        "gc_next": 208400000,
                        "memory_alloc": 106170432,
                        "memory_total": 22410059200
                    },
                    "runtime": {
                        "goroutines": 26
                    }
                },
                "filebeat": {
                    "events": {
                        "added": 99,
                        "done": 99
                    },
                    "harvester": {
                        "open_files": 0,
                        "running": 0
                    }
                },
                "libbeat": {
                    "config": {
                        "module": {
                            "running": 0
                        }
                    },
                    "pipeline": {
                        "clients": 1,
                        "events": {
                            "active": 1,
                            "filtered": 99,
                            "total": 99
                        }
                    }
                },
                "registrar": {
                    "states": {
                        "current": 78143,
                        "update": 99
                    },
                    "writes": {
                        "success": 99,
                        "total": 99
                    }
                },
                "system": {
                    "load": {
                        "1": 1.85,
                        "15": 0.85,
                        "5": 0.85,
                        "norm": {
                            "1": 0.1156,
                            "15": 0.0531,
                            "5": 0.0531
                        }
                    }
                }
            }
        }

Any help would be much appreciated as we've been tackling this for several weeks.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.