MetricBeat Stats
The following table summarizes some experiments running the metricbeat agent on
various platforms and with various configurations to determine the amount of
storage within ElasticSearch to store the metrics over periods of time. The
idea here is to get a per-monitored-host storage requirement so we can
appropriately size an Elastic cluster to serve as an infrastructure monitoring
platform.
Stats Collection Methodology
Since I already use Ansible to configuration manage hosts and have playbooks for
managing ElasticSearch, I employed them to spin up a new VM host with
ElasticSearch. The node was configured as a master/ingest/data node. The VM
also has Kibana installed to allow me to easily observe index creation and
document arrival into the index.
Once initially provisioned, we have ElasticSearch with no indicies ready to
receive some data. I then used Ansible to provision the metricbeat agent to
a host and configure the metricbeat modules and metricsets. Next, I run the
metricbeat agent for 5 minutes and observe the created index size in Kibana. I
take note of the number of documents and size of the index. I record this
information into a spreadsheet, where simple cell-based multiplcation calculates
the extrapolated storage for 1 hour, 1 day, 1 month, and 1 year increments.
After each 5-minute run, I delete the index, reconfigure metricbeat, and run the
agent for another 5-minute run. I performed this with metricbeat running on
Linux (the same node running ElasticSearch/Kibana) and with metricbeat running
on Windows.
Results
Platform |
Modules |
Period (s) |
# of Docs (5m) |
Size (MB) (5m) |
# of Docs (1hr) |
Size (MB) (1hr) |
# of Docs (1d) |
Size (MB) (1d) |
# of Docs (1mo) |
Size (MB) (1mo) |
Size (GB) (1mo) |
# of Docs (1y) |
Size (MB) (1y) |
Size (GB) (1y) |
Linux |
system |
5 |
7617 |
6.1 |
91404 |
73.2 |
2.1937e+06 |
1756.8 |
6.58109e+07 |
52704 |
51.4688 |
8.00699e+08 |
641232 |
626.203 |
Linux |
system |
10 |
3673 |
3 |
44076 |
36 |
1.05782e+06 |
864 |
3.17347e+07 |
25920 |
25.3125 |
3.86106e+08 |
315360 |
307.969 |
Linux |
system |
20 |
1838 |
1.5 |
22056 |
18 |
529344 |
432 |
1.58803e+07 |
12960 |
12.6562 |
1.93211e+08 |
157680 |
153.984 |
Linux |
system |
10 |
3828 |
3.2 |
45936 |
38.4 |
1.10246e+06 |
921.6 |
3.30739e+07 |
27648 |
27 |
4.02399e+08 |
336384 |
328.5 |
Linux |
system |
10 |
3937 |
3.2 |
47244 |
38.4 |
1.13386e+06 |
921.6 |
3.40157e+07 |
27648 |
27 |
4.13857e+08 |
336384 |
328.5 |
Linux |
system |
10 |
4206 |
3.3 |
50472 |
39.6 |
1.21133e+06 |
950.4 |
3.63398e+07 |
28512 |
27.8438 |
4.42135e+08 |
346896 |
338.766 |
Linux |
system |
10 |
4271 |
3.4 |
51252 |
40.8 |
1.23005e+06 |
979.2 |
3.69014e+07 |
29376 |
28.6875 |
4.48968e+08 |
357408 |
349.031 |
Linux |
system |
10 |
4173 |
3.3 |
50076 |
39.6 |
1.20182e+06 |
950.4 |
3.60547e+07 |
28512 |
27.8438 |
4.38666e+08 |
346896 |
338.766 |
Windows |
windows |
10 |
5790 |
1 |
69480 |
12 |
1.66752e+06 |
288 |
5.00256e+07 |
8640 |
8.4375 |
6.08645e+08 |
105120 |
102.656 |
Windows |
windows |
10 |
444 |
0.12 |
5328 |
1.44 |
127872 |
34.56 |
3.83616e+06 |
1036.8 |
1.0125 |
4.66733e+07 |
12614.4 |
12.3187 |
The first 3 rows correspond to a "default" metricbeat configuration, where the
only variation is the metricbeat reporting period. The default reporting period
is 10 seconds (row 2), but I also experimented with cutting that time in half
for more resolution (row 1) and doubling that time for storage considerations
(row 3). The configuration below is representative of these cases.
logging:
files: {keepfiles: 2, name: metricbeat.log, path: /var/log/}
level: warning
to_files: true
to_syslog: false
metricbeat.modules:
- enabled: true
metricsets: [cpu, load, memory, network, process, process_summary]
module: system
period: 10s
processes: [.*]
output:
elasticsearch:
enabled: true
hosts: ['http://elastic.example.com:9200']
index: metricbeat-%{[beat.version]}-default
password: changeme
username: elastic
setup:
dashboards: {enabled: true}
kibana: {host: 'http://kibana.example.com:5601'}
template: {name: 'metricbeat-%{[beat.version]}', pattern: 'metricbeat-%{[beat.version]}-*'}
The next 5 rows correspond to a default 10-second reporting period, but for each
row I added an additional metricset. I incrementally added the uptime
, core
,
diskio
, filesystem
, and fsstat
metricsets.
The last 2 rows correspond to metricbeat running on Windows. The first Windows
row is the metricbeat windows
module collecting the service
metricset. The
configuration looks like this:
logging:
files:
keepfiles: 2
name: metricbeat.log
path: C:\Metricbeat\logs
level: debug
to_files: true
to_syslog: false
metricbeat.modules:
- enabled: true
metricsets:
- service
module: windows
period: 10s
output:
elasticsearch:
enabled: true
hosts:
- http://elastic.example.com:9200
index: metricbeat-%{[beat.version]}
password: changeme
username: elastic
setup:
template:
name: metricbeat-%{[beat.version]}
pattern: metricbeat-%{[beat.version]}-*
The last Windows row uses the (still beta) perfmon
metricset and is configured
like this:
logging:
files:
keepfiles: 2
name: metricbeat.log
path: C:\Metricbeat\logs
level: debug
to_files: true
to_syslog: false
metricbeat.modules:
- enabled: true
metricsets:
- perfmon
module: windows
perfmon.counters:
- instance_label: processor_name
instance_name: total
measurement_label: processor.time.total.pct
query: \Processor(_Total)\% Processor Time
- instance_label: physical_disk.name
instance_name: total
measurement_label: physical_disk.time.total.pct
query: \PhysicalDisk(_Total)\% Disk Time
- instance_label: physical_disk.name
instance_name: total
measurement_label: physical_disk.time.read.pct
query: \PhysicalDisk(_Total)\% Disk Read Time
- instance_label: physical_disk.name
instance_name: total
measurement_label: physical_disk.time.write.pct
query: \PhysicalDisk(_Total)\% Disk Write Time
- instance_label: logical_disk.name
instance_name: total
measurement_label: logical_disk.space.free.pct
query: \LogicalDisk(_Total)\% Free Space
- instance_label: logical_disk.name
instance_name: total
measurement_label: logical_disk.space.free.mb
query: \LogicalDisk(_Total)\Free Megabytes
- instance_label: paging_file.name
instance_name: total
measurement_label: paging_file.usage.pct
query: \Paging File(_Total)\% Usage
- instance_label: memory.name
instance_name: total
measurement_label: memory.available.mbytes
query: \Memory()\Available MBytes
- instance_label: numa_node.name
instance_name: total
measurement_label: numa_node.memory.available.mbytes
query: \NUMA Node Memory(_Total)\Available MBytes
- instance_label: numa_node.name
instance_name: total
measurement_label: numa_node.memory.total.mbytes
query: \NUMA Node Memory(_Total)\Total MBytes
- instance_label: system_info.name
instance_name: total
measurement_label: system.processes.count
query: \System()\Processes
- instance_label: system_info.name
instance_name: total
measurement_label: system.threads.count
query: \System()\Threads
- instance_label: system_info.name
instance_name: total
measurement_label: system.uptime.seconds
query: \System()\System Up Time
- instance_label: network_interface.name
instance_name: vmxnet3
measurement_label: network.interface.in.bytes
query: \Network Interface(vmxnet3 Ethernet Adapter)\Bytes Received/sec
- instance_label: network_interface.name
instance_name: vmxnet3
measurement_label: network.interface.out.bytes
query: \Network Interface(vmxnet3 Ethernet Adapter)\Bytes Sent/sec
perfmon.group_measurements_by_instance: true
perfmon.ignore_non_existent_counters: true
period: 10s
output:
elasticsearch:
enabled: true
hosts:
- http://elastic.example.com:9200
index: metricbeat-%{[beat.version]}
password: changeme
username: elastic
setup:
template:
name: metricbeat-%{[beat.version]}
pattern: metricbeat-%{[beat.version]}-*
Conclusions
Metricbeat is just once piece of the holistic Application and Infrastructure
monitoring solution. We'll need this metric information along with logging and
application performance (APM) data to paint a good picture of the healt and
status of applications and infrastructure, and to alarm/alert when the data
tells us something is abnormal.
Running metricbeat on Linux and collecting all but the raid
and socket
metricsets at a 10-second collection period account for nearly 1GB of data per
day per monitored host. For Windows, it is currently about 300MB of data per
day per monitored host. If you have 1000+ hosts to monitor, this could easily
top 1TB per day. And again, this is not accounting for other data sources such
as logs, APM telemetry, etc. Furthermore, this isn't taking into account
ElasticSearch data high availability with multiple primaries and replicas keeping
the data resilient/accessible.
The moral of the story here is you will need to clearly define the hosts you
want to monitor and come up with some sort of data retention strategy. You'll
likely need to make use of Curator and/or ILM policies to ensure your not
hanging on to perishable data for too long. Another strategy may be to make
use of rollup indexes to summarize metric data over time.
I'm certainly open to hearing stories from the community on how they've used
ElasticSearch as the heart of an Application and Infrastructure monitoring platform
and the strategies they've employed to balance functionality and resource
utilization.