Heartbeat scheduled jobs already active

Hi,

I'm running Beats version 5.4.2 and running into a constant issue which persists through restart of both heartbeat and the host system.

Heartbeat will work for a period (there doesn't seem to be a pattern as to how long), before it stops starting new scheduled jobs.

I notice in the logs there are messages for each individual monitored IP address:

2019-05-16T01:01:00+10:00 INFO Scheduled job 'icmp-ip@x.x.x.x' already active.

There are over 2000 IP addresses that heartbeat is monitoring and this has been working for over a year now with no changes to the system.

Here is my heartbeat.yml config file:

heartbeat.monitors:

- type: icmp
  name: group1
  schedule: '0 1-59/3 * * * * *'
  watch.poll_file:
    path: /data/beats/heartbeat/monitors/group1.json
    interval: 5s

- type: icmp
  name: group2
  schedule: '0 2-59/3 * * * * *'
  watch.poll_file:
    path: /data/beats/heartbeat/monitors/group2.json
    interval: 5s

- type: icmp
  name: group3
  schedule: '0 3-59/3 * * * * *'
  watch.poll_file:
    path: /data/beats/heartbeat/monitors/group3.json
    interval: 5s

#================================ Outputs =====================================

#------------------------------- File output -----------------------------------
output.file:
  # Boolean flag to enable or disable the output module.
  enabled: true

  # Path to the directory where to save the generated files. The option is
  # mandatory.
  path: "/path/to/event/file"

  # Name of the generated files. The default is `heartbeat` and it generates
  # files: `heartbeat`, `heartbeat.1`, `heartbeat.2`, etc.
  filename: heartbeat

  # Maximum size in kilobytes of each file. When this size is reached, and on
  # every heartbeat restart, the files are rotated. The default value is 10240
  # kB.
  rotate_every_kb: 10000

  # Maximum number of files under path. When this number of files is reached,
  # the oldest file is deleted and the rest are shifted from last to first. The
  # default is 7 files.
  number_of_files: 7


#================================ Logging =====================================

#logging.level: debug
logging.to_files: true
logging.files:
  path: /path/to/log/destination
  name: heartbeat.log
  keepfiles: 7

It's a fairly simple setup, however there are a large volume of IP addresses being monitored. I had to actually increase the read/write socket buffer memory limits on the host system in order for all the pings to go through:

net.core.rmem_default = 8388608
net.core.rmem_max = 16777216
net.core.wmem_default = 8388608
net.core.wmem_max = 16777216

The host system is CentOS 7

Please let me know if any more details are needed.

This can be ameliorated by decreasing the timeout value. Try setting the timeout option to a lower value (maybe 2s). The default is quite long and can tie up resources.

The true fix is to offset the timing to be from the finish of the last check rather than on a strict schedule. However, that wouldn't apply in this case since you're using cron. It would instead apply only to the @every syntax.

I've opened up two issues to address these concerns.


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.