Count of record drops on the hour

Hello,

I am running a single node ELK cluster including filebeat. My main pipeline involves Filebeat > Logstash > Elastic and I'm running no log mutations or alterations in logstash. I am seeing periodic drops, approximately every 1 hour 15, of the count of records. My server has high throughput however it has plenty of memory and CPU power to cope. Are there any processes this could be linked too?

Any help or pointers diagnosing this?

Thanks.

Welcome to the forum @HuwT . I hope I or others can help you!! Your thread is currently also the most interesting puzzle.

Lots of CPU and RAM are great, but often it's all about the IO. Please add your disk setup too.

But generally, a bit more detailed info would help. And a few more numbers. A "single node ELK cluster including filebeat" means everything is running on same node - filebeat, logstash, and elasticsearch and kibana? All separate processes, or via docker or similar, or virtual machines, or ... ? What is generating your docs, where is it, and how are they (supposed to be getting) to elasticsearch?

Your filebeat and logstash configs might be helpful. Obfuscate as needed.

The flatlining at zero on the your "Count of records" is indeed strange. It's based on @timestamp. Have you validated there really are docs with timestamps in those flatlining periods? If so, and they are not reaching elasticsearch, I would be surprised if one or more of filebeat/logstash/elasticsearch were not telling you in their logs, somewhere? That elasticsearch has so low CPU load for similar periods suggests its before data reaches ES, you can confirm by looking at ingest rate (another screen in kibana). and logstash heap flatlining is also maybe indication its not doing much for periods. So I'd be starting troubleshooting at filebeat.

Wild 1 in a 100 guess - once someone posted a not dis-similar report and it boiled down to log rotation, log files just got rotated in a way the rest of the solution was not configured to support. The every-75 minute thing does suggest something local.

Hello Kevin,

Thanks for taking the time to reply to my thread!

I will address each of your questions one by one:

Elastic is currently writing to an 80TB hard disk setup while the system OS and services (logstash and filebeat) are running on SSD

My ELK is a single virtualised system on ubuntu linux. Elastic, kibana, logstash and filebeat are all running as services on the same node. The system is collecting logs from throughout our system including many network devices like firewalls and proxies whish are all sending their logs to the ELK over the network via syslog.

Filebeat.yml

filebeat.inputs:

- type: filestream

  id: my-filestream-id

  enabled: false

  paths:
    - /var/log/*.log

- type: syslog
  format: auto
  protocol.udp:
    host: "0.0.0.0:###"

filebeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: false

setup.template.settings:
  index.number_of_shards: 1

setup.kibana:
  host: "localhost:5601"
  username: ###
  password: ###

output.logstash:
  hosts: ["localhost:5044"]
  worker: 4
  bulk_max_size: 4096

processors:
  - add_host_metadata:
      when.not.contains.tags: forwarded
  - add_cloud_metadata: ~
  - add_docker_metadata: ~
  - add_kubernetes_metadata: ~

monitoring.enabled: true
monitoring.elasticsearch.hosts: ["http://localhost:9200"]
monitoring.elasticsearch.username: "xxx"
monitoring.elasticsearch.password: "xxx"

setup.ilm.overwrite: true
setup.template.overwrite: true

http.enabled: true
http.host: "localhost"
http.port: 5044

queue.mem:
  events: 32768
  flush.min_events: 4096
  flush.timeout: 250ms


panw filebeat module:

* module: panw
  panos:
  enabled: true

  #Set which input to use between syslog (default) or file.

  #var.input:
  var.input: syslog
  var.syslog_host: x.x.x.x
  var.syslog_port: xxxx

  #var.paths:
  var.paths: {"/var/log/panw.log"} 

Logstash:

input {
  beats {
    port => 5044
    id => "filebeat_input"
  }
}

output {
  # Route SNMP events to a dedicated index (and optional ingest pipeline)
  if [event][module] == "snmp" {
    stdout { codec => rubydebug }
    elasticsearch {
      hosts => ["http://localhost:9200"]
      user => "xxx"
      password => "xxx"
      manage_template => false
      index => "snmp-traps-%{+YYYY.MM.dd}"
      # optional: if you created an ingest pipeline for SNMP parsing
      # pipeline => "snmp-ingest-pipeline"
    }

  # Existing behavior for events that have a beat pipeline
  } else if [@metadata][pipeline] {
    elasticsearch {
      hosts => ["http://localhost:9200"]
      user => "xxx"
      password => "xxx"
      manage_template => false
      index => "%{[@metadata][beat]}-%{[@metadata][version]}-%{+YYYY.MM.dd}"
      pipeline => "%{[@metadata][pipeline]}"
    }

  # Fallback for other events
  } else {
    elasticsearch {
      hosts => ["http://localhost:9200"]
      user => "xxx"
      password => "xxx"
      manage_template => false
      index => "%{[@metadata][beat]}-%{[@metadata][version]}-%{+YYYY.MM.dd}"
    }
  }
}

output {
  syslog {
    host => "x.x.x.x"
    port => xxx
    protocol => tcp
  }
}

This would certainly point to something upstream of elastic

I believe the issue can be put down to filebeat, as the winlogbeat logs i am also collecting from the system are not having the same issue and they go straight to logstash.

I will continue looking into filebeat, however the log files in /var/log/filebeat are rather light.

If you mean some spinning disks then that’s not ideal. I don’t think I’ll ever use a spinning disk again. I’d check your IO performance with tools like iostat.

That’s also not ideal in production.

Is it the filestream data (from /var/log/…) or the syslog data that is flatlining ? Or both?