Harvester stats always 0, no harvester logs at all but logs shipping

Alex_Stuck · March 22, 2023, 2:54am

Hey there, new to filebeats.
I'm watching a large directory at the root (/log/*/**) and am sending the logs to another team's LS that I don't control. We are seeing what looks like a major lag at time from some systems so I was trying to check some filebeats stats. First I enabled logging of the json to the logfile. Here I only see 2 harvester stats and they are both always empty:

,"harvester":{"open_files":0,"running":0}}

while all other stats are populated. Example:

filebeat[16823]: 2023-03-22T02:35:47.325Z INFO [monitoring] map[file.line:185 file.name:log/log.go] Non-zero metrics in the last 30s#011{"monitoring": {"metrics": {"beat":{"cgroup":{"cpuacct":{"total":{"ns":58728253790}},"memory":{"mem":{"usage":{"bytes":85065728}}}},"cpu":{"system":{"ticks":74870,"time":{"ms":2370}},"total":{"ticks":1662970,"time":{"ms":58730},"value":1662970},"user":{"ticks":1588100,"time":{"ms":56360}}},"handles":{"limit":{"hard":65536,"soft":65536},"open":3992},"info":{"ephemeral_id":"5bfbfffa-4e4c-4f38-8e33-e10a8c8bfe53","uptime":{"ms":870099},"version":"8.4.1"},"memstats":{"gc_next":494489760,"memory_alloc":363654816,"memory_total":134692356448,"rss":762454016},"runtime":{"goroutines":11983}},"filebeat":{"events":{"active":-16,"added":331760,"done":331776},"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":331776,"active":4096,"batches":162,"total":331776},"read":{"bytes":5670},"write":{"bytes":54160640}},"pipeline":{"clients":3979,"events":{"active":4158,"published":331776,"total":331768},"queue":{"acked":331776}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":4.43,"15":4.63,"5":4.41,"norm":{"1":0.2769,"15":0.2894,"5":0.2756}}}}, "ecs.version": "1.6.0"}}

The I tried the http stats option and when I grab the stats via curl I get more harvester stuff, but again all 0:

"harvester": {
  "closed": 0,
  "open_files": 0,
  "running": 0,
  "skipped": 0,
  "started": 0

How is this possible given that the other team is seeing logs, just very delayed at times.
I might have to see the individual harvester for each logfile but I cannot seem to get that info.
Assuming each harvester is a thread and I have around 5000 files, should I not see at least a few thousand threads ? I know not all files log frequently enough to always have a harvester needed but I have many (f.e. 80 node infoblox grid) that log 100s a second. At least I should see 100 threads then. I barely see 46 threads right now. As u can see in the stats nofiles is always around 4000.

I must be missing something. We have already tweaked the LS output to add more workers and up the bulk size but it doesn't look like that is the issue here. I'm looking to find the bottleneck and the harvester is a black hole here for me.

When I run the raw fs throughput test I get about 20k/s

[20.2k/s] [20.2k/s]
[21.5k/s] [20.7k/s]
[ 22k/s] [ 21k/s]
[22.2k/s] [21.3k/s]
[23.4k/s] [21.4k/s]
[ 21k/s] [21.4k/s]
[22.8k/s] [21.3k/s]
[22.3k/s] [21.4k/s]
[21.4k/s] [21.4k/s]

relevant beats config:

filebeat.inputs:
- type: filestream
  id: log
  enabled: true
  paths:
    - /log/**/*

output.logstash:
  hosts: ["beats-receiver.domain.com:5044"]
  ssl.enabled: true
  bulk_max_size: 10000
  worker: 4

They do not have an ES backend so we can't use x-pack monitoring.

Any help would be appreciated.

Thx

Alex_Stuck · March 22, 2023, 2:56am

Almost forgot to mention the version: filebeat 8.4.1-1

Alex_Stuck · March 22, 2023, 3:38am

Per this post

There should be logs like this:

INFO	log/harvester.go:228	Harvester started for file: /var/log/application/cron/search_2020-04-11.log

I definitely do not have any such log to speak of, yet lsof shows that filebeat has opened 4000 files. Am I sending files without a harvester ? I don't think that's even possible. I even copied his log settings to be sure:

logging.level: info
#logging.to_syslog: true
logging.to_files: true
logging.files:
  path: /var/log/filebeat
  name: filebeat
  keepfiles: 2
  permissions: 0644

The most I get is this:

{"log.level":"info","@timestamp":"2023-03-22T03:30:56.768Z","log.logger":"crawler","log.origin":{"file.name":"beater/crawler.go","file.line":71},"message":"Loading Inputs: 1","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-22T03:30:56.768Z","log.logger":"crawler","log.origin":{"file.name":"beater/crawler.go","file.line":117},"message":"starting input, keys present on the config: [filebeat.inputs.0.enabled filebeat.inputs.0.id filebeat.inputs.0.paths.0 filebeat.inputs.0.type]","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-22T03:30:57.169Z","log.logger":"crawler","log.origin":{"file.name":"beater/crawler.go","file.line":148},"message":"Starting input (ID: 17036298274984827102)","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-22T03:30:57.169Z","log.logger":"input.filestream","log.origin":{"file.name":"compat/compat.go","file.line":113},"message":"Input 'filestream' starting","service.name":"filebeat","id":"log","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-22T03:30:57.169Z","log.logger":"crawler","log.origin":{"file.name":"beater/crawler.go","file.line":106},"message":"Loading and starting Inputs completed. Enabled inputs: 1","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-22T03:30:57.170Z","log.origin":{"file.name":"cfgfile/reload.go","file.line":164},"message":"Config reloader started","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-22T03:30:57.170Z","log.origin":{"file.name":"cfgfile/reload.go","file.line":224},"message":"Loading of config files completed.","service.name":"filebeat","ecs.version":"1.6.0"}

the systemd unit file service section looks like this:

[Service]

UMask=0027
Environment="GODEBUG='madvdontneed=1'"
Environment="BEAT_LOG_OPTS="
Environment="BEAT_CONFIG_OPTS=-c /etc/filebeat/filebeat.yml"
Environment="BEAT_PATH_OPTS=--path.home /usr/share/filebeat --path.config /etc/filebeat --path.data /var/lib/filebeat --path.logs /var/log/filebeat"
ExecStart=/usr/share/filebeat/bin/filebeat --environment systemd $BEAT_LOG_OPTS $BEAT_CONFIG_OPTS $BEAT_PATH_OPTS
Restart=always

How can I enable the same output that the user in this post has ?

thx

stephenb · March 22, 2023, 5:43am

Hi @Alex_Stuck Welcome to the community!
Not sure about the stats that looks like perhaps a bug.

Are you saying that you a getting 20K events per sec?

That is pretty good for a single filebeat.

Couple things.

1 I would probably update to 8.5.3 or 8.6.2... see if that fixes the bug.

(Insert long discussion)

Filebeat was / originally designed to be a thin edge shipper... It is not a "Wide Pipe"

With respect to scaling...
Think you have choices...seems like this host alot of logs

Try adding another filestream input, divide up the paths

Run more than filebeat and divide the paths etc.

OR if that is pretty sizable uost you can run logstash right on that box.. you can get more scale

Just thinking...

With respect to the logs the changed in 8.x those looks older did you try. Filebeat does not log as much as it used to.

logging.level: debug

Alex_Stuck · March 22, 2023, 7:06pm

Hi stephenb

Thx for your reply.

Oops, I did not realize my version was behind. I updated to 8.6.2
There is 0 difference though. I still do not get a single log about Harvesters.
Here the complete config right now:

logging.level: debug
logging.to_syslog: true

filebeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: false

filebeat.inputs:
- type: filestream
  id: log
  enabled: true
  paths:
    - /log/**/*

output.logstash:
  hosts: ["beats-host.domain.com:5044"]
  ssl.enabled: true

processors:
  - add_fields:
      target: "@socat"
      fields:
        token: "1234zxc"
  - add_host_metadata:
      when.not.contains.tags: forwarded
  - add_cloud_metadata: ~
  - add_docker_metadata: ~
  - add_kubernetes_metadata: ~

And the corresponding logs:

2023-03-22T18:52:55.046865+00:00 hostname filebeat[9957]: 2023-03-22T18:52:55.046Z INFO map[file.line:296 file.name:instance/beat.go] Setup Beat: filebeat; Version: 8.6.2 {"ecs.version": "1.6.0"}
2023-03-22T18:52:55.050416+00:00 hostname filebeat[9957]: 2023-03-22T18:52:55.050Z WARN [cfgwarn] map[file.line:102 file.name:tlscommon/config.go] DEPRECATED: Treating the CommonName field on X.509 certificates as a host name when no Subject Alternative Names are present is going to be removed. Please update your certificates if needed. Will be removed in version: 8.0.0#011{"ecs.version": "1.6.0"}
2023-03-22T18:52:55.050737+00:00 hostname filebeat[9957]: 2023-03-22T18:52:55.050Z INFO [publisher] map[file.line:113 file.name:pipeline/module.go] Beat name: syslog-c01n01.usw2.csstg.aws.infra.sie.sony.com#011{"ecs.version": "1.6.0"}
2023-03-22T18:52:55.050819+00:00 hostname filebeat[9957]: 2023-03-22T18:52:55.050Z INFO [modules] map[file.line:120 file.name:fileset/modules.go] Enabled modules/filesets: #011{"ecs.version": "1.6.0"}
2023-03-22T18:52:55.050868+00:00 hostname filebeat[9957]: 2023-03-22T18:52:55.050Z WARN map[file.line:164 file.name:beater/filebeat.go] Filebeat is unable to load the ingest pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the ingest pipelines or are using Logstash pipelines, you can ignore this warning. {"ecs.version": "1.6.0"}
2023-03-22T18:52:55.050954+00:00 hostname filebeat[9957]: 2023-03-22T18:52:55.050Z INFO [monitoring] map[file.line:145 file.name:log/log.go] Starting metrics logging every 30s#011{"ecs.version": "1.6.0"}
2023-03-22T18:52:55.051024+00:00 hostname filebeat[9957]: 2023-03-22T18:52:55.050Z INFO map[file.line:486 file.name:instance/beat.go] filebeat start running. {"ecs.version": "1.6.0"}
2023-03-22T18:52:55.055236+00:00 hostname filebeat[9957]: 2023-03-22T18:52:55.055Z INFO map[file.line:134 file.name:memlog/store.go] Finished loading transaction log file for '/var/lib/filebeat/registry/filebeat'. Active transaction id=549 {"ecs.version": "1.6.0"}
2023-03-22T18:52:55.055316+00:00 hostname filebeat[9957]: 2023-03-22T18:52:55.055Z WARN map[file.line:290 file.name:beater/filebeat.go] Filebeat is unable to load the ingest pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the ingest pipelines or are using Logstash pipelines, you can ignore this warning. {"ecs.version": "1.6.0"}
2023-03-22T18:52:55.056028+00:00 hostname filebeat[9957]: 2023-03-22T18:52:55.056Z INFO [registrar] map[file.line:109 file.name:registrar/registrar.go] States Loaded from registrar: 0#011{"ecs.version": "1.6.0"}
2023-03-22T18:52:55.056049+00:00 hostname filebeat[9957]: 2023-03-22T18:52:55.056Z INFO [crawler] map[file.line:71 file.name:beater/crawler.go] Loading Inputs: 1#011{"ecs.version": "1.6.0"}
2023-03-22T18:52:55.056078+00:00 hostname filebeat[9957]: 2023-03-22T18:52:55.056Z INFO [crawler] map[file.line:117 file.name:beater/crawler.go] starting input, keys present on the config: [filebeat.inputs.0.enabled filebeat.inputs.0.id filebeat.inputs.0.paths.0 filebeat.inputs.0.type]#011{"ecs.version": "1.6.0"}
2023-03-22T18:52:55.061376+00:00 hostname filebeat[9957]: 2023-03-22T18:52:55.061Z INFO [crawler] map[file.line:148 file.name:beater/crawler.go] Starting input (ID: 17036298274984827102)#011{"ecs.version": "1.6.0"}
2023-03-22T18:52:55.061550+00:00 hostname filebeat[9957]: 2023-03-22T18:52:55.061Z INFO [input.filestream] map[file.line:113 file.name:compat/compat.go] Input 'filestream' starting#011{"ecs.version": "1.6.0"}
2023-03-22T18:52:55.061584+00:00 hostname filebeat[9957]: 2023-03-22T18:52:55.061Z INFO [crawler] map[file.line:106 file.name:beater/crawler.go] Loading and starting Inputs completed. Enabled inputs: 1#011{"ecs.version": "1.6.0"}
2023-03-22T18:52:55.061630+00:00 hostname filebeat[9957]: 2023-03-22T18:52:55.061Z INFO map[file.line:164 file.name:cfgfile/reload.go] Config reloader started {"ecs.version": "1.6.0"}
2023-03-22T18:52:55.061910+00:00 hostname filebeat[9957]: 2023-03-22T18:52:55.061Z INFO map[file.line:224 file.name:cfgfile/reload.go] Loading of config files completed. {"ecs.version": "1.6.0"}
2023-03-22T18:52:55.452251+00:00 hostname filebeat[9957]: 2023-03-22T18:52:55.452Z INFO [publisher_pipeline_output] map[file.line:139 file.name:pipeline/client_worker.go] Connecting to backoff(async(tcp://beats-receiver.domain.com:5044))#011{"ecs.version": "1.6.0"}
2023-03-22T18:52:55.669091+00:00 hostname filebeat[9957]: 2023-03-22T18:52:55.669Z INFO [publisher_pipeline_output] map[file.line:147 file.name:pipeline/client_worker.go] Connection to backoff(async(tcp://beats-receiver.domain.com:5044)) established#011{"ecs.version": "1.6.0"}
2023-03-22T18:53:25.054710+00:00 hostname filebeat[9957]: 2023-03-22T18:53:25.054Z INFO [monitoring] map[file.line:187 file.name:log/log.go] Non-zero metrics in the last 30s#011{"monitoring": {"metrics": {"beat":{"cgroup":{"cpu":{"cfs":{"period":{"us":100000}},"id":"filebeat.service"},"cpuacct":{"id":"filebeat.service","total":{"ns":9385492333}},"memory":{"id":"filebeat.service","mem":{"limit":{"bytes":9223372036854771712},"usage":{"bytes":199901184}}}},"cpu":{"system":{"ticks":860,"time":{"ms":860}},"total":{"ticks":9380,"time":{"ms":9380},"value":9380},"user":{"ticks":8520,"time":{"ms":8520}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":66},"info":{"ephemeral_id":"001eb5b3-6315-4f26-9819-7d7b88e2f0a5","name":"filebeat","uptime":{"ms":30063},"version":"8.6.2"},"memstats":{"gc_next":185376216,"memory_alloc":98187056,"memory_sys":231123000,"memory_total":2485463496,"rss":247955456},"runtime":{"goroutines":191}},"filebeat":{"events":{"active":4098,"added":20482,"done":16384},"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0},"reloads":1,"scans":1},"output":{"events":{"acked":16384,"active":4096,"batches":10,"total":20480},"read":{"bytes":5768},"type":"logstash","write":{"bytes":24437830}},"pipeline":{"clients":54,"events":{"active":4098,"published":20480,"retry":2048,"total":20482},"queue":{"acked":16384,"max_events":4096}}},"registrar":{"states":{"current":0}},"system":{"cpu":{"cores":16},"load":{"1":0.05,"15":0.05,"5":0.04,"norm":{"1":0.0031,"15":0.0031,"5":0.0025}}}}, "ecs.version": "1.6.0"}}

As you can see there are no references to any harvesters being spun up and the stats seem to confirm that. Yet lsof shows that all relevant filehandles are open.
It also appears that the debug option still doesn't actually turn on debug.

Q1. Are you saying you don't expect logs regarding the Harvester anymore in later versions ?
Q2. Are you saying that FB is not actually designed to handle monitoring of a large dir with a few thousand logs, some of them updating at a fairly high rate ?
Q3. Even if I split this up further, the logs are still a large blackhole if they don't give me any information about each harvester. I do need that info.

latest version:
filebeat-8.6.2-1.x86_64

thx for ur help so far

stephenb · March 22, 2023, 10:37pm

No I am not directly saying that... I just haven't been looking at it. I think filestream do not report as traditional harvestors... as I do not see harvesters with type filestream and I do with type log.

@andrewkroh Any thoughts? ^^^

Well I am saying that a single filebeat is very unlike to process 50K EPS, yes IMHO it is a fairly thin edge shipper... if it is a BIG box you are monitoring perhaps logstash might make more sense with more pipelines...

@andrewkroh Any thoughts? ^^^

Q3. Even if I split this up further, the logs are still a large blackhole if they don't give me any information about each harvester. I do need that info.

@andrewkroh Any thoughts? ^^^

But something else seems interesting / odd.....

2023-03-22T18:53:25.054710+00:00 hostname filebeat[9957]: 2023-03-22T18:53:25.054Z INFO [monitoring] map[file.line:187 file.name:log/log.go] Non-zero metrics in the last 30s#011{"monitoring": {"metrics": {"beat":{"cgroup":{"cpu":{"cfs":{"period":{"us":100000}},"id":"filebeat.service"},"cpuacct":{"id":"filebeat.service","total":{"ns":9385492333}},"memory":{"id":"filebeat.service","mem":{"limit":{"bytes":9223372036854771712},"usage":{"bytes":199901184}}}},"cpu":{"system":{"ticks":860,"time":{"ms":860}},"total":{"ticks":9380,"time":{"ms":9380},"value":9380},"user":{"ticks":8520,"time":{"ms":8520}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":66},"info":{"ephemeral_id":"001eb5b3-6315-4f26-9819-7d7b88e2f0a5","name":"filebeat","uptime":{"ms":30063},"version":"8.6.2"},"memstats":{"gc_next":185376216,"memory_alloc":98187056,"memory_sys":231123000,"memory_total":2485463496,"rss":247955456},"runtime":{"goroutines":191}},"filebeat":{"events":{"active":4098,"added":20482,"done":16384},"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0},"reloads":1,"scans":1},"output":{"events":{"acked":16384,"active":4096,"batches":10,"total":20480},"read":{"bytes":5768},"type":"logstash","write":{"bytes":24437830}},"pipeline":{"clients":54,"events":{"active":4098,"published":20480,"retry":2048,"total":20482},"queue":{"acked":16384,"max_events":4096}}},"registrar":{"states":{"current":0}},"system":{"cpu":{"cores":16},"load":{"1":0.05,"15":0.05,"5":0.04,"norm":{"1":0.0031,"15":0.0031,"5":0.0025}}}}, "ecs.version": "1.6.0"}}

That says you only shipped ~20K events in 30s not sure if you are only pointing to a subset or something but that seems quite low, Your events look to about 1kb in size...

Have you tried to ship directly to elasticsearch ...

Alex_Stuck · March 22, 2023, 10:59pm

Thanks.
Ah so the old Harvester logs are only part of the "log" output ?

Does this mean we don't use harvesters anymore or they just aren't logging ?
In that case where can I find information about each harvester or whatever they are now called ? (Per reference config they still seem to be called harvesters though)
Just to be 100% sure, can you please point me to the exact metric you were referring to when you said I only shipped 20K events in 30s ? I'm trying to prove that logs are perhaps sitting there waiting for hours before FB sends them. If so then that is my problem but if I could prove that a log was sent and acked at a certain time, yet they eventually see it in splunk 8h later then the burden is not on my end. That is why I'm asking about the individual harvester logs.
To sum it up - I need to be able to see when exactly a certain log was shipped and acked by the other side so I can determine if the lag is on my end.

thx

stephenb · March 23, 2023, 1:40am

Ok so you are trying to debug latency... its not really about the harvesters per se...

Filebeat puts a timestamp when it is read into the queue (I believe)

You can use an ingest processor ion the elastic side and set the ingest timestamp...
That is your latency (careful to account for all timezones etc... all dates are stored in UTC and will use the timezone of the local system to interpret on read / write.

What do you mean "Acked" by the other side?

What is your ingest architecture?

You can turn on beats monitoring?

If you look at the log I showed above it says for the last 30s you output actually 16K Acked 4K are still active (in progress)

.... 
Non-zero metrics in the last 30s
.....
"output":{"events":{"acked":16384,"active":4096,"batches":10,"total":20480},
"read":{"bytes":5768},"type":"logstash","write":{"bytes":24437830}},
"pipeline":{"clients":54,"events"{"active":4098,"published":20480,"retry":2048,"total":20482},
"queue":{"acked":16384,"max_events":4096}}},

It you want to test read speed you can also set the output to a file that will show you how fast you are actually reading... if that is fast then it is backpressure somewhere else...

First Sniff ... this feels like back pressure latency something ... Filebeat can process more that 500 EPS...

leandrojmp · March 23, 2023, 3:16am

You said something that may indicate that you do not have any performance issue, but some misconfigured pipeline or misinterpretation on what is this lag.

If so then that is my problem but if I could prove that a log was sent and acked at a certain time, yet they eventually see it in splunk 8h later then the burden is not on my end.

If this lag you are experiencing is always an offset of 8 hours, this suggests that some of the dates being used to show the logs are in a different timezone or are not being parsed correctly.

Can you provide a little more context about this lag?

The ingestion lags from time to time or it is a constant lag? The lag is according to what?

Is it happening with some random system each time or is it always the same systems?

stephenb · March 23, 2023, 3:53am

Yup I should have added timezone issues that is the other common problem... @leandrojmp knows his stuff!

Alex_Stuck · March 23, 2023, 7:32am

Hi Leandro

Certainly. I can give more info.
We have a rsyslog receiver that writes incoming logs in this template format

/log/business_unit_a/%$year%/%$month%/%$day%/%fromhost%/%syslogfacility-text%
.
.
/log/business_unit_z/%$year%/%$month%/%$day%/%fromhost%/%syslogfacility-text%

There are several BUs and a few thousand files under /log, depending on how many facilities each source uses. We have about 2000 log sources, some only use one facility and some 2 or 3 so you can easily double that number to know how many files we have.

Once at midnight UTC all the old files stop being written to since rsyslog writes to a new path based on the new date.
Once a day I move the directories from the previous day to an efs export for longer storage so we only have about 24h worth of logs on the "/log" volume.

I have a splunk universal forwarder watching the entire "/log" dir and that works perfectly fine. All data gets into our splunk instance near RT without any tuning at all.

However, there is another team that needs this data and they wanted us to send it with filebeat. They have a Logstash/Kafka/Nifi custom layer cake there that I don't have control over. They just gave me the LS endpoint. They do not currently have ES as an option.
So I tried to do the exact same thing with filebeat that I did with the splunk forwarder:

filebeat.inputs:
- type: filestream
  id: log
  enabled: true
  paths:
    - /log/**/*

It looked fine until someone started noticing that some logs are sometimes way behind. We checked all timestamps already. They are all UTC. It's not always 8h. I'm told some were even more than that. When I check randomly they say "Oh right now it's fine" so it's very hard to troubleshoot.

I was told that it's probably my harvesters not catching up or the filehandles not being closed early enough but it's all wild guesses since I have little visibility into how filebeat is doing. That's why I started looking for those Harvester logs but as I mentioned there aren't any. All references I find online are for the "log" output which has been replaced by the "filestream" output. The only stats that I get are from the http option with curl. Here a current prd output:

{
  "beat": {
    "cgroup": {
      "cpu": {
        "cfs": {
          "period": {
            "us": 100000
          },
          "quota": {
            "us": 0
          }
        },
        "id": "filebeat.service",
        "stats": {
          "periods": 0,
          "throttled": {
            "ns": 0,
            "periods": 0
          }
        }
      },
      "cpuacct": {
        "id": "filebeat.service",
        "total": {
          "ns": 33460390036502
        }
      },
      "memory": {
        "id": "filebeat.service",
        "mem": {
          "limit": {
            "bytes": 9223372036854771712
          },
          "usage": {
            "bytes": 11593932800
          }
        }
      }
    },
    "cpu": {
      "system": {
        "ticks": 1483730,
        "time": {
          "ms": 1483730
        }
      },
      "total": {
        "ticks": 33460260,
        "time": {
          "ms": 33460260
        },
        "value": 33460260
      },
      "user": {
        "ticks": 31976530,
        "time": {
          "ms": 31976530
        }
      }
    },
    "handles": {
      "limit": {
        "hard": 65536,
        "soft": 65536
      },
      "open": 3839
    },
    "info": {
      "ephemeral_id": "f9028d68-b966-450f-bb30-0d7c46053149",
      "name": "filebeat",
      "uptime": {
        "ms": 18755449
      },
      "version": "8.4.1"
    },
    "memstats": {
      "gc_next": 542585200,
      "memory_alloc": 387455688,
      "memory_sys": 728834216,
      "memory_total": 2691891003952,
      "rss": 767959040
    },
    "runtime": {
      "goroutines": 11523
    }
  },
  "filebeat": {
    "events": {
      "active": 4184,
      "added": 188287064,
      "done": 188282880
    },
    "harvester": {
      "closed": 0,
      "open_files": 0,
      "running": 0,
      "skipped": 0,
      "started": 0
    },
    "input": {
      "log": {
        "files": {
          "renamed": 0,
          "truncated": 0
        }
      },
      "netflow": {
        "flows": 0,
        "packets": {
          "dropped": 0,
          "received": 0
        }
      }
    }
  },
  "libbeat": {
    "config": {
      "module": {
        "running": 0,
        "starts": 0,
        "stops": 0
      },
      "reloads": 1,
      "scans": 1
    },
    "output": {
      "events": {
        "acked": 188280832,
        "active": 4096,
        "batches": 91936,
        "dropped": 0,
        "duplicates": 0,
        "failed": 0,
        "toomany": 0,
        "total": 188284928
      },
      "read": {
        "bytes": 3223114,
        "errors": 0
      },
      "type": "logstash",
      "write": {
        "bytes": 30518006886,
        "errors": 0
      }
    },
    "pipeline": {
      "clients": 3825,
      "events": {
        "active": 4256,
        "dropped": 0,
        "failed": 0,
        "filtered": 0,
        "published": 188284948,
        "retry": 2048,
        "total": 188285088
      },
      "queue": {
        "acked": 188280832,
        "max_events": 4096
      }
    }
  },
  "registrar": {
    "states": {
      "cleanup": 0,
      "current": 0,
      "update": 0
    },
    "writes": {
      "fail": 0,
      "success": 0,
      "total": 0
    }
  },
  "system": {
    "cpu": {
      "cores": 16
    },
    "load": {
      "1": 4.61,
      "15": 4.25,
      "5": 4.18,
      "norm": {
        "1": 0.2881,
        "15": 0.2656,
        "5": 0.2613
      }
    }
  }
}

As you can see all the harvester stats are 0.

Does this mean there are no harvesters ? The docs would have you believe that there is still a harvester spun up for each file but where are the logs ? It would have to have at least a few hundred harvesters running at any time. I see barely 50-60 filebeat threads at any given time. Max harvesters is set to 0 so unlimited.
I did a throughput test as detailed by someone else on this forum by running this:

#!/bin/bash
systemctl stop filebeat
ulimit -n 65565
rm -rf /tmp/filebeat-test
/usr/share/filebeat/bin/filebeat -c ./filebeat-test.yml | pv -Warl >/dev/null

With this tmp config

logging.level: info
logging.to_syslog: false
filebeat.registry.path: /tmp/registry-test
output.console:
  pretty: false


filebeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: false

filebeat.inputs:
- type: filestream
  id: log
  enabled: true
  paths:
    - /log/**/*
  
processors:
...

Here the output

[20.9k/s] [19.8k/s]
[  21k/s] [20.2k/s]
[23.2k/s] [20.6k/s]
[21.1k/s] [20.9k/s]
[22.4k/s] [21.4k/s]
[21.9k/s] [21.3k/s]

If I understand this right, this is supposed to measure the speed at which filebeat can read files off the raw fs. Let me know if I did this wrong.
Since I cannot have more than one output at the same time, I obviously cannot add the LS target back, so this seems to only be good for measuring how fast FB can read files from the fs. No idea if 20k is good, bad or average... Is there a similar test to see at which rate fb is able to pump those 20k/s to the LS endpoint ?
Unless I can grab this stat here and do some math on it ?

 "type": "logstash",
      "write": {
        "bytes": 30518006886,
        "errors": 0

So I pushed 30g in 30s to the LS target ? About 1GB/s then ?

I saw that you can send extra monitoring but you need ES on the other end. We only have LS rn.

# ============================= X-Pack Monitoring ==============================
# Filebeat can export internal metrics to a central Elasticsearch monitoring
# cluster.  This requires xpack monitoring to be enabled in Elasticsearch.  The
# reporting is disabled by default.

# Set to true to enable the monitoring reporter.
#monitoring.enabled: false

If I ask them to set this up, would that show all the stats you could possibly want or is there still a need to capture the local stats somewhere else ?

If I could prove that I have more data waiting to be sent than they can receive that would help. They don't think so but, like I said, we don't have anything to prove or disprove that right now.

Some more musings:
They think I need to tinker with "close_*" options for the harvester but

It seems we are not even using harvesters anymore (at least the stats suggest that)

 "harvester": {
      "closed": 0,
      "open_files": 0,
      "running": 0,
      "skipped": 0,
      "started": 0

After the files have been moved to another volume it should take no more than 5mins and any harvester should close the handle since these files have no more input. That is already the default anyway.
I have set nofiles to the max of 65536 and only get about 4-7k open at any time so it would not be an OS limit.
If 4-7k open files is too many for filbeat to handle I'd expect a ton of errors but I see nothing but clean logs and the occasional monitor stats - that's it. This is why I'm a little reluctant to believe that filebeat cannot handle all the files. Am I not correct in assuming that there would be errors if that were the case ?
Next time they tell me that logfile xyz is lagging I'd need explicit information about that very file like harvester stuff or whatever but the "filestream" output doesn't seem to track individual files/harvesters anymore. All the other numbers are always global so it's hard to track it down to individual logs that lag when others don't.

Hope this helps and thank you.

leandrojmp · March 23, 2023, 2:14pm

If all your files are in UTC and the lag is not always 8h and sometimes it fix itself without any intervention, then it is probably not an timezone issue.

But again, you have at least two pieces in this puzzle, your filebeat reading the logs and the logstash receiving it and it is almost impossible to troubleshoot looking just to the filebeat.

From now I would forget about the harvester, from what I looked into the code the filestream input works differently and you will not see any log or mention about harvester, but someone from elastic needs to confirm this and provide more insight.

You will need to ask the other team for some Logstash configuration and specs.

First, are they using persistent queue in the logstash configuration that is receiving this log? If they are, is the queue getting filled or it is almost in real time? This can be checked by looking into the queue directory and if you have more than one page file, this mean that the queue is actively being used.

Second, what is the pipeline.workers and pipeline.batch.size configuration for the receiving pipeline?

Third, what is the output of this receiving pipeline?

From what you shared I didn't see nothing that would indicate an issue with Filebeat, and since it is not a timezone issue, it could be a backpressure issue where Logstash can not output to their destination fast enough and tell filebeat to stop sending logs for a moment.

There are some things you can do to test, but it depends on what you can change since this seems to be in production.

One would be to spin-up another Logstash that you can control on another server that would just receive the logs and write it into files, this would help you test if filebeat can send the logs in real time.

You can also do this in paralell if you run another instance of filebeat using different paths for the configuration and register, for example, download the filebeat tar.gz file and unpack it on a different path, this would allow you to read the same files and output them to this another Logstash.

Alex_Stuck · March 23, 2023, 8:16pm

Leandro
Thank you. I asked the team your questions. They are working to get an ES backend going so we can enable X-Pack.
In the meantime, I'm looking for a full reference document that explains the numerous monitor stats that FB outputs. I can't find it. Tough to debug when we don't know for sure what all the counters actually mean.
I started graphing some of them and noticed that some fields are not returned very often, only infrequently.
F.e. pipeline.queue.max_events
The last time this field was returned was 12:05 and it's 13:11 right now. Am I supposed to assume that a value not returned is equal to 0 ?
A full explanation of all these values would help.
Thx

Alex_Stuck · March 23, 2023, 9:07pm

The more I try to grab these values the more I see inconsistencies. Here another example. Normally, libbeat ouput shows these:

"output":{"events":{"acked":348160,"active":4096,"batches":170,"total":348160}

That will go on for hours and then, suddenly, you get a lot more one time

 output:
        events:
          acked: 417275904
          active: 2048
          batches: 203749
          dropped: 0
          duplicates: 0
          failed: 0
          toomany: 0
          total: 417277952

Is this expected ? Those values are 0 anyway proving that they do get exposed even if they are 0. Why aren't they exposed consistently while other values like "acked" are ?

Alex_Stuck · March 24, 2023, 12:34am

Could someone also explain what #011 is supposed to be and how I can map this to an actual file ?

filebeat[11483]: 2023-03-24T00:33:09.004Z INFO [input.filestream] map[file.line:321 file.name:filestream/input.go] Reader was closed. Closing.#011{"ecs.version": "1.6.0"}

Cause I see 1000s of them sometimes and it's always the same #011. As if it's trying to close a handle and can't ? It logs a new line every few seconds. I'd understand if it was a different handle number each time but no. I'll see how long this goes on.

stephenb · March 24, 2023, 1:27am

@Alex_Stuck

Appreciate your frustration...

1st) It would help if you put the whole log line instead of just snippets we can pull it out and look at it.

2nd) Confirmed, Looking at the code, there is no harvester for filestream, those 0 are just an artifact.

3rd) The first snippet you provide is per batch / the every 30s metric report
That 30s period you were doing about 10K EPS. Since we don't know what you were shipping to Console, File, Logstash it is hard to say how performant that is. 10K EPS is not trivial for Filebeat that is a solid number

But it is not clear what your output was for that...

Just for reference, some of the highest internal filebeat benchmarks I have ever seen internally are:

15K EPS / JSON and 45K EPS for RAW logs and that was with the engineers fully dedicated to tuning and a lot of CPU and RAM for the Filebeat. Filebeat box with 16 CPU and 30 GB of RAM.

4th) The second log you show is a different set it is the Cumulative Total Since Startup you can see that here:

github.com

elastic/beats/blob/b8e0449d12b1a4eb8072e4fb6142516aebe9af51/libbeat/outputs/metrics.go#LL24


      
          // KIND, either express or implied.  See the License for the
          // specific language governing permissions and limitations
          // under the License.
          
          package outputs
          
          import "github.com/elastic/elastic-agent-libs/monitoring"
          
          // Stats implements the Observer interface, for collecting metrics on common
          // outputs events.
          type Stats struct {
          	//
          	// Output event stats
          	//
          	batches *monitoring.Uint // total number of batches processed by output
          	events  *monitoring.Uint // total number of events processed by output
          
          	acked      *monitoring.Uint // total number of events ACKed by output
          	failed     *monitoring.Uint // total number of events failed in output
          	active     *monitoring.Uint // events sent and waiting for ACK/fail from output
          	duplicates *monitoring.Uint // events sent and waiting for ACK/fail from output

Will need to check on the #011 not sure what that is, it could be a bug in the logger, I am not sure that is actually related to a file.

If you want to see what is actually being read you will need to look at the registry.

At this point, you seem to be focused on the reading of files, and having many does require extra attention, but we actually do not know if that is the issue or not.

Can you provide a guestimate of the current number of files that are being read in parallel?

Do have a sense of the rate at which logs are being written to this server?

In that log directory can you run a du and estimate the growth in bytes per 30 mins or per 24 hours or something which we can try to understand the rate of telemetry?

In short, if it is 100K EPS or something a single filebeat will not keep up with that.... period.

And yes monitoring the beats is also good, there are OOTB visualizations and they are pretty easy to build.

And after all that, I agree with @leandrojmp, configuring another target under your control to export will the next step in understanding what is going on... in the end filebeat may not be the right approach.

Alex_Stuck · March 24, 2023, 1:46am

TY. While I have your attention. 2 questions

Is there a difference in performance between this

  paths:
    - /log/**/*

and this

  paths:
    - /log/dir1/**/*
    - /log/dir2/**/*
    - /log/dirn/**/*

In the end FB has to do the same work in terms of readers and all right ? I'd say there should not be a difference but I let you confirm that.

What is the difference between "pipeline.events.failed: 0" and the counter missing entirely ? Sorry, if you answered that but at a glance I didn't see it.
I'm talking to someone about how a missing counter is different to a counter that says 0. I mean if no events failed there should always be a counter right ? I do not get this counter most of the time. When I do it's 0. I don't know how to interpret that.
I will look at your responses in detail later on.

TY

stephenb · March 24, 2023, 2:01am

I do not know I do not think there is much difference in actual read / process / write performance.
I don't know, show me an example... The full log line one showing it one not. I would not make assumptions.

Alex_Stuck · March 24, 2023, 6:00am

Ok I have some rsyslog impstats. If I can believe them we easily get up to 60k EPS at peaktime. Less than 100K though Is 100K the limit that you propose here ? Btw, this instance has 10 cores and 124G RAM.

The /log dir grows about 1.3ish TB a day and then gets cleaned again.

Regarding the inconsistent monitor stats, here an example of 3 consecutive monitor json outputs

2023-03-21T01:33:50.664497+00:00 myhost.mydomain filebeat[22465]: 2023-03-21T01:33:50.664Z INFO [monitoring] map[file.line:185 file.name:log/log.go] Non-zero metrics in the last 30s#011{"monitoring": {"metrics": {"beat":{"cgroup":{"cpuacct":{"total":{"ns":49927616244}},"memory":{"mem":{"usage":{"bytes":-13004800}}}},"cpu":{"system":{"ticks":6874400,"time":{"ms":2790}},"total":{"ticks":141570920,"time":{"ms":50060},"value":141570920},"user":{"ticks":134696520,"time":{"ms":47270}}},"handles":{"limit":{"hard":65536,"soft":65536},"open":5225},"info":{"ephemeral_id":"c90a4cf6-d65b-4f36-93f3-a3e64832085f","uptime":{"ms":87480215},"version":"8.4.1"},"memstats":{"gc_next":698566816,"memory_alloc":493210904,"memory_total":11267533417952,"rss":957952000},"runtime":{"goroutines":15681}},"filebeat":{"events":{"active":-7,"added":268281,"done":268288},"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":266240,"active":4096,"batches":130,"total":266240},"read":{"bytes":4550},"write":{"bytes":41753466}},"pipeline":{"clients":5213,"events":{"active":4278,"published":266240,"total":266267},"queue":{"acked":266240}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":5.89,"15":6.7,"5":6.41,"norm":{"1":0.3681,"15":0.4188,"5":0.4006}}}}, "ecs.version": "1.6.0"}}
2023-03-21T01:34:20.657408+00:00 myhost.mydomain filebeat[22465]: 2023-03-21T01:34:20.657Z INFO [monitoring] map[file.line:185 file.name:log/log.go] Non-zero metrics in the last 30s#011{"monitoring": {"metrics": {"beat":{"cgroup":{"cpuacct":{"total":{"ns":64297858876}},"memory":{"mem":{"usage":{"bytes":-392941568}}}},"cpu":{"system":{"ticks":6895380,"time":{"ms":20980}},"total":{"ticks":141634980,"time":{"ms":64060},"value":141634980},"user":{"ticks":134739600,"time":{"ms":43080}}},"handles":{"limit":{"hard":65536,"soft":65536},"open":5038},"info":{"ephemeral_id":"c90a4cf6-d65b-4f36-93f3-a3e64832085f","uptime":{"ms":87510184},"version":"8.4.1"},"memstats":{"gc_next":610474320,"memory_alloc":531531264,"memory_total":11271132387080,"rss":989036544},"runtime":{"goroutines":15120}},"filebeat":{"events":{"active":-38,"added":239579,"done":239617},"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":241664,"active":2048,"batches":117,"total":239616},"read":{"bytes":4130},"write":{"bytes":36978961}},"pipeline":{"clients":5026,"events":{"active":4206,"failed":1,"published":241664,"total":241593},"queue":{"acked":241664}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":5.87,"15":6.68,"5":6.37,"norm":{"1":0.3669,"15":0.4175,"5":0.3981}}}}, "ecs.version": "1.6.0"}}
2023-03-21T01:34:50.663405+00:00 myhost.mydomain filebeat[22465]: 2023-03-21T01:34:50.663Z INFO [monitoring] map[file.line:185 file.name:log/log.go] Non-zero metrics in the last 30s#011{"monitoring": {"metrics": {"beat":{"cgroup":{"cpuacct":{"total":{"ns":50017202526}},"memory":{"mem":{"usage":{"bytes":102334464}}}},"cpu":{"system":{"ticks":6898450,"time":{"ms":3070}},"total":{"ticks":141684980,"time":{"ms":50000},"value":141684980},"user":{"ticks":134786530,"time":{"ms":46930}}},"handles":{"limit":{"hard":65536,"soft":65536},"open":4597},"info":{"ephemeral_id":"c90a4cf6-d65b-4f36-93f3-a3e64832085f","uptime":{"ms":87540161},"version":"8.4.1"},"memstats":{"gc_next":633145488,"memory_alloc":364434712,"memory_total":11274948516744,"rss":1031913472},"runtime":{"goroutines":12797}},"filebeat":{"events":{"active":45,"added":245805,"done":245760},"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":245760,"active":4096,"batches":121,"total":247808},"read":{"bytes":4200},"write":{"bytes":39607563}},"pipeline":{"clients":4596,"events":{"active":4238,"published":245760,"total":245792},"queue":{"acked":245760}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":7.09,"15":6.77,"5":6.66,"norm":{"1":0.4431,"15":0.4231,"5":0.4163}}}}, "ecs.version": "1.6.0"}}

Look for "failed" and you'll see that it is only send 1 of the 3 times. Many other counters are just like f.e. "toomany" or "duplicates"
I can also prove that by grepping through the entire file (too large to paste here) but the stats clearly show this. If I compare a constant key like "acked" with "failed" I see it there too

[root@myhost eng]# cat ./fb_mon.log | grep acked | wc -l
14576
[root@myhost eng]# cat ./fb_mon.log | grep failed | wc -l
37
If each json post contained the same set of keys then these 2 number should match.

Lastly, regarding open files an interesting observation. Both Splunk UFD and FB are monitoring the same dir but clearly handle open files differently

[root@myhost eng]# lsof /log | grep filebeat | wc -l
2072
[root@myhost eng]# lsof /log | grep filebeat | wc -l
2069
[root@myhost eng]# lsof /log | grep filebeat | wc -l
2071
[root@myhost eng]# lsof /log | grep filebeat | wc -l
2070
[root@myhost eng]# lsof /log | grep filebeat | wc -l
2070
[root@myhost eng]# lsof /log | grep filebeat | wc -l
2070
[root@myhost eng]# lsof /log | grep filebeat | wc -l
2070
[root@myhost eng]# lsof /log | grep filebeat | wc -l
2067
[root@myhost eng]# lsof /log | grep filebeat | wc -l
2067
[root@myhost eng]#
[root@myhost eng]#
[root@myhost eng]# lsof /log | grep splunk | wc -l
453
[root@myhost eng]# lsof /log | grep splunk | wc -l
454
[root@myhost eng]# lsof /log | grep splunk | wc -l
455
[root@myhost eng]# lsof /log | grep splunk | wc -l
449
[root@myhost eng]# lsof /log | grep splunk | wc -l
463
[root@myhost eng]# lsof /log | grep splunk | wc -l
454
[root@myhost eng]# lsof /log | grep splunk | wc -l
450
[root@myhost eng]# lsof /log | grep splunk | wc -l
448
[root@myhost eng]# lsof /log | grep splunk | wc -l
484
[root@myhost eng]#

Not that I don't have enough handle on the OS side. It's up to the max of 65k and I never see more than 7k max opened by FB.

stephenb · March 24, 2023, 6:40am

Going to be Away for a few days and have not had time to go through all this..

Good stuff.

My initial reaction 1.3 TB/day , single filebeat instance is not the right tool for the job.

Generally that volume is collected from a variety of hosts not a single host.

Perhaps multiple instances of filebeat or even logstash would be a better more resilient/ scalable fit.

See what @leandrojmp thinks...

Topic		Replies	Views
Harvester started for file but not reading logs Beats	3	2533	July 23, 2019
Filebeat running but not sending logs to logstash Beats filebeat	2	333	June 20, 2019
Filebeat6.2.1 logs are not collected intermittently Beats filebeat	6	340	November 26, 2019
Filebeat is not processing the logs which suppose to be sending to LogStash hosted in same Machine Beats filebeat	1	284	August 14, 2020
Filebeat Log Beats filebeat	7	400	August 22, 2018

Harvester stats always 0, no harvester logs at all but logs shipping

Related topics