Hi Leandro
Certainly. I can give more info.
We have a rsyslog receiver that writes incoming logs in this template format
/log/business_unit_a/%$year%/%$month%/%$day%/%fromhost%/%syslogfacility-text%
.
.
/log/business_unit_z/%$year%/%$month%/%$day%/%fromhost%/%syslogfacility-text%
There are several BUs and a few thousand files under /log, depending on how many facilities each source uses. We have about 2000 log sources, some only use one facility and some 2 or 3 so you can easily double that number to know how many files we have.
- Once at midnight UTC all the old files stop being written to since rsyslog writes to a new path based on the new date.
- Once a day I move the directories from the previous day to an efs export for longer storage so we only have about 24h worth of logs on the "/log" volume.
I have a splunk universal forwarder watching the entire "/log" dir and that works perfectly fine. All data gets into our splunk instance near RT without any tuning at all.
However, there is another team that needs this data and they wanted us to send it with filebeat. They have a Logstash/Kafka/Nifi custom layer cake there that I don't have control over. They just gave me the LS endpoint. They do not currently have ES as an option.
So I tried to do the exact same thing with filebeat that I did with the splunk forwarder:
filebeat.inputs:
- type: filestream
id: log
enabled: true
paths:
- /log/**/*
It looked fine until someone started noticing that some logs are sometimes way behind. We checked all timestamps already. They are all UTC. It's not always 8h. I'm told some were even more than that. When I check randomly they say "Oh right now it's fine" so it's very hard to troubleshoot.
I was told that it's probably my harvesters not catching up or the filehandles not being closed early enough but it's all wild guesses since I have little visibility into how filebeat is doing. That's why I started looking for those Harvester logs but as I mentioned there aren't any. All references I find online are for the "log" output which has been replaced by the "filestream" output. The only stats that I get are from the http option with curl. Here a current prd output:
{
"beat": {
"cgroup": {
"cpu": {
"cfs": {
"period": {
"us": 100000
},
"quota": {
"us": 0
}
},
"id": "filebeat.service",
"stats": {
"periods": 0,
"throttled": {
"ns": 0,
"periods": 0
}
}
},
"cpuacct": {
"id": "filebeat.service",
"total": {
"ns": 33460390036502
}
},
"memory": {
"id": "filebeat.service",
"mem": {
"limit": {
"bytes": 9223372036854771712
},
"usage": {
"bytes": 11593932800
}
}
}
},
"cpu": {
"system": {
"ticks": 1483730,
"time": {
"ms": 1483730
}
},
"total": {
"ticks": 33460260,
"time": {
"ms": 33460260
},
"value": 33460260
},
"user": {
"ticks": 31976530,
"time": {
"ms": 31976530
}
}
},
"handles": {
"limit": {
"hard": 65536,
"soft": 65536
},
"open": 3839
},
"info": {
"ephemeral_id": "f9028d68-b966-450f-bb30-0d7c46053149",
"name": "filebeat",
"uptime": {
"ms": 18755449
},
"version": "8.4.1"
},
"memstats": {
"gc_next": 542585200,
"memory_alloc": 387455688,
"memory_sys": 728834216,
"memory_total": 2691891003952,
"rss": 767959040
},
"runtime": {
"goroutines": 11523
}
},
"filebeat": {
"events": {
"active": 4184,
"added": 188287064,
"done": 188282880
},
"harvester": {
"closed": 0,
"open_files": 0,
"running": 0,
"skipped": 0,
"started": 0
},
"input": {
"log": {
"files": {
"renamed": 0,
"truncated": 0
}
},
"netflow": {
"flows": 0,
"packets": {
"dropped": 0,
"received": 0
}
}
}
},
"libbeat": {
"config": {
"module": {
"running": 0,
"starts": 0,
"stops": 0
},
"reloads": 1,
"scans": 1
},
"output": {
"events": {
"acked": 188280832,
"active": 4096,
"batches": 91936,
"dropped": 0,
"duplicates": 0,
"failed": 0,
"toomany": 0,
"total": 188284928
},
"read": {
"bytes": 3223114,
"errors": 0
},
"type": "logstash",
"write": {
"bytes": 30518006886,
"errors": 0
}
},
"pipeline": {
"clients": 3825,
"events": {
"active": 4256,
"dropped": 0,
"failed": 0,
"filtered": 0,
"published": 188284948,
"retry": 2048,
"total": 188285088
},
"queue": {
"acked": 188280832,
"max_events": 4096
}
}
},
"registrar": {
"states": {
"cleanup": 0,
"current": 0,
"update": 0
},
"writes": {
"fail": 0,
"success": 0,
"total": 0
}
},
"system": {
"cpu": {
"cores": 16
},
"load": {
"1": 4.61,
"15": 4.25,
"5": 4.18,
"norm": {
"1": 0.2881,
"15": 0.2656,
"5": 0.2613
}
}
}
}
As you can see all the harvester stats are 0.
-
Does this mean there are no harvesters ? The docs would have you believe that there is still a harvester spun up for each file but where are the logs ? It would have to have at least a few hundred harvesters running at any time. I see barely 50-60 filebeat threads at any given time. Max harvesters is set to 0 so unlimited.
-
I did a throughput test as detailed by someone else on this forum by running this:
#!/bin/bash
systemctl stop filebeat
ulimit -n 65565
rm -rf /tmp/filebeat-test
/usr/share/filebeat/bin/filebeat -c ./filebeat-test.yml | pv -Warl >/dev/null
With this tmp config
logging.level: info
logging.to_syslog: false
filebeat.registry.path: /tmp/registry-test
output.console:
pretty: false
filebeat.config.modules:
path: ${path.config}/modules.d/*.yml
reload.enabled: false
filebeat.inputs:
- type: filestream
id: log
enabled: true
paths:
- /log/**/*
processors:
...
Here the output
[20.9k/s] [19.8k/s]
[ 21k/s] [20.2k/s]
[23.2k/s] [20.6k/s]
[21.1k/s] [20.9k/s]
[22.4k/s] [21.4k/s]
[21.9k/s] [21.3k/s]
If I understand this right, this is supposed to measure the speed at which filebeat can read files off the raw fs. Let me know if I did this wrong.
Since I cannot have more than one output at the same time, I obviously cannot add the LS target back, so this seems to only be good for measuring how fast FB can read files from the fs. No idea if 20k is good, bad or average... Is there a similar test to see at which rate fb is able to pump those 20k/s to the LS endpoint ?
Unless I can grab this stat here and do some math on it ?
"type": "logstash",
"write": {
"bytes": 30518006886,
"errors": 0
So I pushed 30g in 30s to the LS target ? About 1GB/s then ?
I saw that you can send extra monitoring but you need ES on the other end. We only have LS rn.
# ============================= X-Pack Monitoring ==============================
# Filebeat can export internal metrics to a central Elasticsearch monitoring
# cluster. This requires xpack monitoring to be enabled in Elasticsearch. The
# reporting is disabled by default.
# Set to true to enable the monitoring reporter.
#monitoring.enabled: false
If I ask them to set this up, would that show all the stats you could possibly want or is there still a need to capture the local stats somewhere else ?
If I could prove that I have more data waiting to be sent than they can receive that would help. They don't think so but, like I said, we don't have anything to prove or disprove that right now.
Some more musings:
They think I need to tinker with "close_*" options for the harvester but
- It seems we are not even using harvesters anymore (at least the stats suggest that)
"harvester": {
"closed": 0,
"open_files": 0,
"running": 0,
"skipped": 0,
"started": 0
-
After the files have been moved to another volume it should take no more than 5mins and any harvester should close the handle since these files have no more input. That is already the default anyway.
-
I have set nofiles to the max of 65536 and only get about 4-7k open at any time so it would not be an OS limit.
-
If 4-7k open files is too many for filbeat to handle I'd expect a ton of errors but I see nothing but clean logs and the occasional monitor stats - that's it. This is why I'm a little reluctant to believe that filebeat cannot handle all the files. Am I not correct in assuming that there would be errors if that were the case ?
-
Next time they tell me that logfile xyz is lagging I'd need explicit information about that very file like harvester stuff or whatever but the "filestream" output doesn't seem to track individual files/harvesters anymore. All the other numbers are always global so it's hard to track it down to individual logs that lag when others don't.
Hope this helps and thank you.