Beats Output Metrics

dwjvaughan · October 25, 2024, 11:29am

Hi,

I've been trying to tune my filebeat cluster for optimum performance and I have a question with some of the metrics. The docs show this:

But I'm a little confused about .output.events.total, for me it is always increasing, which implies that my output cannot ingest the logs quick enough. It is tracking exactly the same as .output.events.acked however, which contradicts the description of this metric. Could someone please clarify? My hunch (hope) is that the description of .output.events.total is actually incorrect and it is indeed the total number of output events recorded.

Thanks,
Dave

stephenb · October 27, 2024, 7:04pm

Hi @dwjvaughan

First important... What version are you on?

What do you mean by filebeat cluster...

These fields are the number of events reported over the 30s period.

"output": {
	"events": {
		"acked": 110,
		"active": 0,
		"batches": 3,
		"total": 110
	},

That is not correct... .output.events.total is not the running sum total of events since the process started... it is the total number of events over the 30s

If both total and acked are increasing together, that would imply that your throughput is increasing... how long does that go on?. If you have many files open filebeat can take a little time to get up to full speed.

dwjvaughan · October 28, 2024, 9:32am

Hey @stephenb,

Thanks for the response, yeah by "cluster" I mean it's a horizontally scaled cluster. Multiple (20 at the moment) filebeat containers running behind a load balancer. All listening on UDP. Version of filebeat is the latest (8.15.3). I use this as a log aggregator to capture logs from our services then send it on to logstash. This used to be a logstash cluster, but filebeat is a little lighter, and I can utilise the compression_level when sending on.

For a bit more context - I checked this last week - over a 1.5 hour period from a fresh start, the .output.events.total (the sum from all containers) was 28.5m. I checked to see what was ultimately ingested into Elasticsearch for that same period and it matched. I can't imagine it's possible that the service was processing 28.5m logs in 30s period, it must've been the total since startup. Possibly the this due to the way I'm capturing the metrics? Here is a section of the config:

http:
  enabled: true
  host: 0.0.0.0
  port: 9600

logging:
  level: warning
  to_stderr: true
  metrics:
    enabled: false

queue.mem:
  events: 32768

filebeat:
  inputs:
    - type: httpjson
      interval: 1m
      request.url: http://localhost:9600/stats
      fields_under_root: true
      processors:
        - decode_json_fields:
            fields: ["message"]
            target: "stats"
            process_array: true
        - add_fields:
            target: ''
            fields:
              logstash.tags:
                - some-tags
    - type: httpjson
      interval: 1m
      request.url: http://localhost:9600/inputs?type=udp
      fields_under_root: true
      processors:
        - decode_json_fields:
            fields: ["message"]
            target: "input"
            process_array: true
        - add_fields:
            target: ''
            fields:
              logstash.tags:
                - other-tags
    - type: udp
      max_message_size: 1MiB
      read_buffer: 100MiB
      host: "0.0.0.0:1414"
      fields_under_root: true
      fields:
        logstash:
          tags:
            - some-tags
    - type: udp
      max_message_size: 1MiB
      read_buffer: 1GiB
      host: "0.0.0.0:1415"
      fields_under_root: true
      fields:
        logstash:
          tags:
            - some-tags
      processors:
        - decode_json_fields:
            fields: ["message"]
            process_array: false
            max_depth: 1
            target: ''
            overwrite_keys: true
            add_error_key: true
    - type: tcp
      host: "0.0.0.0:1416"
      fields_under_root: true
      fields:
        logstash:
          tags:
            - some-tags
      processors:
        - decode_json_fields:
            fields: ["message"]
            process_array: false
            max_depth: 1
            target: ''
            overwrite_keys: true
            add_error_key: true
    - type: http_endpoint
      enabled: true
      listen_address: 0.0.0.0
      listen_port: 8080
      fields_under_root: true
      processors:
        - move_fields:
            from: "json"
            to: ""
            ignore_missing: true
        - drop_fields:
            fields: ["json"]
        - add_fields:
            target: ''
            fields:
              logstash.tags:             
                - some-tags

processors:
  ... a bunch of processors    

output:
  elasticsearch:
    enabled: false
  # console:
  #   pretty: true
  logstash:
    enabled: true
    workers: 8
    bulk_max_size: 8192
    compression_level: 9
    hosts:
      - logstash
    ssl:
      certificate_authorities: 
        - /usr/share/filebeat/ca.beats.pem
      certificate: /usr/share/filebeat/client.beats.pem
      key: /usr/share/filebeat/client.beats.key.decrypted
      verification_mode: none
      supported_protocols: 
        - TLSv1.2
      cipher_suite:
        - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
        - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
    pipelining: 0
    loadbalance: false
    ttl: 60s

I also have another filebeat "cluster" that reads logs from S3, I see the same behaviour with the .output.events.total (capturing the metrics the same way).

stephenb · October 28, 2024, 2:09pm

Exactly where are you seeing these reading?

From.the raw filebeat logs or somewhere else?

Are you looking at monitoring data or the filebeat logs?

Can you show the exact documents please?

I ran my test on 8:14 I will check 8:15 but I do not believe the behavior is changed.

BTW

2,850,000 events / 20 instances/ 30s = 4,750 EPS per instance which entirely realistic.

dwjvaughan · October 28, 2024, 2:32pm

Yeah, so these metrics are being ingested into ELK, by calling http://localhost:9600/stats:

I've checked again today at the current running total is over 2 billion.

This is how I'm getting the metrics: Configure an HTTP endpoint for metrics | Filebeat Reference [8.15] | Elastic

stephenb · October 28, 2024, 2:55pm

Ahhh @dwjvaughan

Ok so we are not talking about the same thing....

What you are referring to is not what you linked to in the Original Post

What you linked to in the Original Post is the metrics captured in the logs by filebeat, which are per 30s.

When you hit http://localhost:9600/stats

You are pulling stats from the stats API which are running totals...

Not explained really well but those are running totals.

So then back to what do you really want to do...

The more correct way to monitor filebeat is directly ingest the metrics with

Then you will be able to see the rate etc...

dwjvaughan · October 28, 2024, 3:59pm

Yup, you're completely right. I guess I assumed it was the same metrics data.

I don't think I can use internal collection as I don't have direct access to Elasticsearch in this scenario, only to Logstash. It doesn't appear to have a Logstash output available. That being said, I'm happy with the HTTP stats endpoint, now that I know the full details. Thanks for your help

Dave

Topic		Replies	Views
Question about filebeat monitor metrics Beats filebeat	1	300	October 8, 2018
Filebeat Stats and Metrics Beats elastic-stack-monitoring , elastic-stack-alerting , filebeat	2	2640	May 7, 2020
How filebeat calculate metrics returned by HTTP endpoint Beats filebeat	2	321	April 19, 2019
Filebeat Metrics Beats filebeat	1	1059	September 20, 2017
Filebeat incorrect metrics Beats filebeat	3	482	April 9, 2019

Beats Output Metrics

Related topics