Beats Output Metrics

Hi,

I've been trying to tune my filebeat cluster for optimum performance and I have a question with some of the metrics. The docs show this:

But I'm a little confused about .output.events.total, for me it is always increasing, which implies that my output cannot ingest the logs quick enough. It is tracking exactly the same as .output.events.acked however, which contradicts the description of this metric. Could someone please clarify? My hunch (hope) is that the description of .output.events.total is actually incorrect and it is indeed the total number of output events recorded.

Thanks,
Dave

Hi @dwjvaughan

First important... What version are you on?

What do you mean by filebeat cluster...

These fields are the number of events reported over the 30s period.

"output": {
	"events": {
		"acked": 110,
		"active": 0,
		"batches": 3,
		"total": 110
	},

That is not correct... .output.events.total is not the running sum total of events since the process started... it is the total number of events over the 30s

If both total and acked are increasing together, that would imply that your throughput is increasing... how long does that go on?. If you have many files open filebeat can take a little time to get up to full speed.

Hey @stephenb,

Thanks for the response, yeah by "cluster" I mean it's a horizontally scaled cluster. Multiple (20 at the moment) filebeat containers running behind a load balancer. All listening on UDP. Version of filebeat is the latest (8.15.3). I use this as a log aggregator to capture logs from our services then send it on to logstash. This used to be a logstash cluster, but filebeat is a little lighter, and I can utilise the compression_level when sending on.

For a bit more context - I checked this last week - over a 1.5 hour period from a fresh start, the .output.events.total (the sum from all containers) was 28.5m. I checked to see what was ultimately ingested into Elasticsearch for that same period and it matched. I can't imagine it's possible that the service was processing 28.5m logs in 30s period, it must've been the total since startup. Possibly the this due to the way I'm capturing the metrics? Here is a section of the config:

http:
  enabled: true
  host: 0.0.0.0
  port: 9600

logging:
  level: warning
  to_stderr: true
  metrics:
    enabled: false

queue.mem:
  events: 32768

filebeat:
  inputs:
    - type: httpjson
      interval: 1m
      request.url: http://localhost:9600/stats
      fields_under_root: true
      processors:
        - decode_json_fields:
            fields: ["message"]
            target: "stats"
            process_array: true
        - add_fields:
            target: ''
            fields:
              logstash.tags:
                - some-tags
    - type: httpjson
      interval: 1m
      request.url: http://localhost:9600/inputs?type=udp
      fields_under_root: true
      processors:
        - decode_json_fields:
            fields: ["message"]
            target: "input"
            process_array: true
        - add_fields:
            target: ''
            fields:
              logstash.tags:
                - other-tags
    - type: udp
      max_message_size: 1MiB
      read_buffer: 100MiB
      host: "0.0.0.0:1414"
      fields_under_root: true
      fields:
        logstash:
          tags:
            - some-tags
    - type: udp
      max_message_size: 1MiB
      read_buffer: 1GiB
      host: "0.0.0.0:1415"
      fields_under_root: true
      fields:
        logstash:
          tags:
            - some-tags
      processors:
        - decode_json_fields:
            fields: ["message"]
            process_array: false
            max_depth: 1
            target: ''
            overwrite_keys: true
            add_error_key: true
    - type: tcp
      host: "0.0.0.0:1416"
      fields_under_root: true
      fields:
        logstash:
          tags:
            - some-tags
      processors:
        - decode_json_fields:
            fields: ["message"]
            process_array: false
            max_depth: 1
            target: ''
            overwrite_keys: true
            add_error_key: true
    - type: http_endpoint
      enabled: true
      listen_address: 0.0.0.0
      listen_port: 8080
      fields_under_root: true
      processors:
        - move_fields:
            from: "json"
            to: ""
            ignore_missing: true
        - drop_fields:
            fields: ["json"]
        - add_fields:
            target: ''
            fields:
              logstash.tags:             
                - some-tags

processors:
  ... a bunch of processors    

output:
  elasticsearch:
    enabled: false
  # console:
  #   pretty: true
  logstash:
    enabled: true
    workers: 8
    bulk_max_size: 8192
    compression_level: 9
    hosts:
      - logstash
    ssl:
      certificate_authorities: 
        - /usr/share/filebeat/ca.beats.pem
      certificate: /usr/share/filebeat/client.beats.pem
      key: /usr/share/filebeat/client.beats.key.decrypted
      verification_mode: none
      supported_protocols: 
        - TLSv1.2
      cipher_suite:
        - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
        - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
    pipelining: 0
    loadbalance: false
    ttl: 60s

I also have another filebeat "cluster" that reads logs from S3, I see the same behaviour with the .output.events.total (capturing the metrics the same way).

Exactly where are you seeing these reading?

From.the raw filebeat logs or somewhere else?

Are you looking at monitoring data or the filebeat logs?

Can you show the exact documents please?

I ran my test on 8:14 I will check 8:15 but I do not believe the behavior is changed.

BTW

2,850,000 events / 20 instances/ 30s = 4,750 EPS per instance which entirely realistic.

Yeah, so these metrics are being ingested into ELK, by calling http://localhost:9600/stats:


I've checked again today at the current running total is over 2 billion.

This is how I'm getting the metrics: Configure an HTTP endpoint for metrics | Filebeat Reference [8.15] | Elastic

Ahhh @dwjvaughan

Ok so we are not talking about the same thing....

What you are referring to is not what you linked to in the Original Post

What you linked to in the Original Post is the metrics captured in the logs by filebeat, which are per 30s.

When you hit http://localhost:9600/stats

You are pulling stats from the stats API which are running totals...

Not explained really well but those are running totals.

So then back to what do you really want to do...

The more correct way to monitor filebeat is directly ingest the metrics with

Then you will be able to see the rate etc...

Yup, you're completely right. I guess I assumed it was the same metrics data.

I don't think I can use internal collection as I don't have direct access to Elasticsearch in this scenario, only to Logstash. It doesn't appear to have a Logstash output available. That being said, I'm happy with the HTTP stats endpoint, now that I know the full details. Thanks for your help :slight_smile:

Dave