MetricBeat Stats
The following table summarizes some experiments running the metricbeat agent on
various platforms and with various configurations to determine the amount of
storage within ElasticSearch to store the metrics over periods of time.  The
idea here is to get a per-monitored-host storage requirement so we can
appropriately size an Elastic cluster to serve as an infrastructure monitoring
platform.
Stats Collection Methodology
Since I already use Ansible to configuration manage hosts and have playbooks for
managing ElasticSearch, I employed them to spin up a new VM host with
ElasticSearch.  The node was configured as a master/ingest/data node.  The VM
also has Kibana installed to allow me to easily observe index creation and
document arrival into the index.
Once initially provisioned, we have ElasticSearch with no indicies ready to
receive some data.  I then used Ansible to provision the metricbeat agent to
a host and configure the metricbeat modules and metricsets.  Next, I run the
metricbeat agent for 5 minutes and observe the created index size in Kibana.  I
take note of the number of documents and size of the index.  I record this
information into a spreadsheet, where simple cell-based multiplcation calculates
the extrapolated storage for 1 hour, 1 day, 1 month, and 1 year increments.
After each 5-minute run, I delete the index, reconfigure metricbeat, and run the
agent for another 5-minute run.  I performed this with metricbeat running on
Linux (the same node running ElasticSearch/Kibana) and with metricbeat running
on Windows.
Results
| Platform | 
Modules | 
Period (s) | 
# of Docs (5m) | 
Size (MB) (5m) | 
# of Docs (1hr) | 
Size (MB) (1hr) | 
# of Docs (1d) | 
Size (MB) (1d) | 
# of Docs (1mo) | 
Size (MB) (1mo) | 
Size (GB) (1mo) | 
# of Docs (1y) | 
Size (MB) (1y) | 
Size (GB) (1y) | 
| Linux | 
system | 
5 | 
7617 | 
6.1 | 
91404 | 
73.2 | 
2.1937e+06 | 
1756.8 | 
6.58109e+07 | 
52704 | 
51.4688 | 
8.00699e+08 | 
641232 | 
626.203 | 
| Linux | 
system | 
10 | 
3673 | 
3 | 
44076 | 
36 | 
1.05782e+06 | 
864 | 
3.17347e+07 | 
25920 | 
25.3125 | 
3.86106e+08 | 
315360 | 
307.969 | 
| Linux | 
system | 
20 | 
1838 | 
1.5 | 
22056 | 
18 | 
529344 | 
432 | 
1.58803e+07 | 
12960 | 
12.6562 | 
1.93211e+08 | 
157680 | 
153.984 | 
| Linux | 
system | 
10 | 
3828 | 
3.2 | 
45936 | 
38.4 | 
1.10246e+06 | 
921.6 | 
3.30739e+07 | 
27648 | 
27 | 
4.02399e+08 | 
336384 | 
328.5 | 
| Linux | 
system | 
10 | 
3937 | 
3.2 | 
47244 | 
38.4 | 
1.13386e+06 | 
921.6 | 
3.40157e+07 | 
27648 | 
27 | 
4.13857e+08 | 
336384 | 
328.5 | 
| Linux | 
system | 
10 | 
4206 | 
3.3 | 
50472 | 
39.6 | 
1.21133e+06 | 
950.4 | 
3.63398e+07 | 
28512 | 
27.8438 | 
4.42135e+08 | 
346896 | 
338.766 | 
| Linux | 
system | 
10 | 
4271 | 
3.4 | 
51252 | 
40.8 | 
1.23005e+06 | 
979.2 | 
3.69014e+07 | 
29376 | 
28.6875 | 
4.48968e+08 | 
357408 | 
349.031 | 
| Linux | 
system | 
10 | 
4173 | 
3.3 | 
50076 | 
39.6 | 
1.20182e+06 | 
950.4 | 
3.60547e+07 | 
28512 | 
27.8438 | 
4.38666e+08 | 
346896 | 
338.766 | 
| Windows | 
windows | 
10 | 
5790 | 
1 | 
69480 | 
12 | 
1.66752e+06 | 
288 | 
5.00256e+07 | 
8640 | 
8.4375 | 
6.08645e+08 | 
105120 | 
102.656 | 
| Windows | 
windows | 
10 | 
444 | 
0.12 | 
5328 | 
1.44 | 
127872 | 
34.56 | 
3.83616e+06 | 
1036.8 | 
1.0125 | 
4.66733e+07 | 
12614.4 | 
12.3187 | 
 The first 3 rows correspond to a "default" metricbeat configuration, where the
only variation is the metricbeat reporting period.  The default reporting period
is 10 seconds (row 2), but I also experimented with cutting that time in half
for more resolution (row 1) and doubling that time for storage considerations
(row 3).  The configuration below is representative of these cases.
logging:
  files: {keepfiles: 2, name: metricbeat.log, path: /var/log/}
  level: warning
  to_files: true
  to_syslog: false
metricbeat.modules:
- enabled: true
  metricsets: [cpu, load, memory, network, process, process_summary]
  module: system
  period: 10s
  processes: [.*]
output:
  elasticsearch:
    enabled: true
    hosts: ['http://elastic.example.com:9200']
    index: metricbeat-%{[beat.version]}-default
    password: changeme
    username: elastic
setup:
  dashboards: {enabled: true}
  kibana: {host: 'http://kibana.example.com:5601'}
  template: {name: 'metricbeat-%{[beat.version]}', pattern: 'metricbeat-%{[beat.version]}-*'}
The next 5 rows correspond to a default 10-second reporting period, but for each
row I added an additional metricset.  I incrementally added the uptime, core,
diskio, filesystem, and fsstat metricsets.
The last 2 rows correspond to metricbeat running on Windows.  The first Windows
row is the metricbeat windows module collecting the service metricset.  The
configuration looks like this:
logging:
    files:
        keepfiles: 2
        name: metricbeat.log
        path: C:\Metricbeat\logs
    level: debug
    to_files: true
    to_syslog: false
metricbeat.modules:
-   enabled: true
    metricsets:
    - service
    module: windows
    period: 10s
output:
    elasticsearch:
        enabled: true
        hosts:
        - http://elastic.example.com:9200
        index: metricbeat-%{[beat.version]}
        password: changeme
        username: elastic
setup:
    template:
        name: metricbeat-%{[beat.version]}
        pattern: metricbeat-%{[beat.version]}-*
The last Windows row uses the (still beta) perfmon metricset and is configured
like this:
logging:
    files:
        keepfiles: 2
        name: metricbeat.log
        path: C:\Metricbeat\logs
    level: debug
    to_files: true
    to_syslog: false
metricbeat.modules:
-   enabled: true
    metricsets:
    - perfmon
    module: windows
    perfmon.counters:
    -   instance_label: processor_name
        instance_name: total
        measurement_label: processor.time.total.pct
        query: \Processor(_Total)\% Processor Time
    -   instance_label: physical_disk.name
        instance_name: total
        measurement_label: physical_disk.time.total.pct
        query: \PhysicalDisk(_Total)\% Disk Time
    -   instance_label: physical_disk.name
        instance_name: total
        measurement_label: physical_disk.time.read.pct
        query: \PhysicalDisk(_Total)\% Disk Read Time
    -   instance_label: physical_disk.name
        instance_name: total
        measurement_label: physical_disk.time.write.pct
        query: \PhysicalDisk(_Total)\% Disk Write Time
    -   instance_label: logical_disk.name
        instance_name: total
        measurement_label: logical_disk.space.free.pct
        query: \LogicalDisk(_Total)\% Free Space
    -   instance_label: logical_disk.name
        instance_name: total
        measurement_label: logical_disk.space.free.mb
        query: \LogicalDisk(_Total)\Free Megabytes
    -   instance_label: paging_file.name
        instance_name: total
        measurement_label: paging_file.usage.pct
        query: \Paging File(_Total)\% Usage
    -   instance_label: memory.name
        instance_name: total
        measurement_label: memory.available.mbytes
        query: \Memory()\Available MBytes
    -   instance_label: numa_node.name
        instance_name: total
        measurement_label: numa_node.memory.available.mbytes
        query: \NUMA Node Memory(_Total)\Available MBytes
    -   instance_label: numa_node.name
        instance_name: total
        measurement_label: numa_node.memory.total.mbytes
        query: \NUMA Node Memory(_Total)\Total MBytes
    -   instance_label: system_info.name
        instance_name: total
        measurement_label: system.processes.count
        query: \System()\Processes
    -   instance_label: system_info.name
        instance_name: total
        measurement_label: system.threads.count
        query: \System()\Threads
    -   instance_label: system_info.name
        instance_name: total
        measurement_label: system.uptime.seconds
        query: \System()\System Up Time
    -   instance_label: network_interface.name
        instance_name: vmxnet3
        measurement_label: network.interface.in.bytes
        query: \Network Interface(vmxnet3 Ethernet Adapter)\Bytes Received/sec
    -   instance_label: network_interface.name
        instance_name: vmxnet3
        measurement_label: network.interface.out.bytes
        query: \Network Interface(vmxnet3 Ethernet Adapter)\Bytes Sent/sec
    perfmon.group_measurements_by_instance: true
    perfmon.ignore_non_existent_counters: true
    period: 10s
output:
    elasticsearch:
        enabled: true
        hosts:
        - http://elastic.example.com:9200
        index: metricbeat-%{[beat.version]}
        password: changeme
        username: elastic
setup:
    template:
        name: metricbeat-%{[beat.version]}
        pattern: metricbeat-%{[beat.version]}-*
Conclusions
Metricbeat is just once piece of the holistic Application and Infrastructure
monitoring solution.  We'll need this metric information along with logging and
application performance (APM) data to paint a good picture of the healt and
status of applications and infrastructure, and to alarm/alert when the data
tells us something is abnormal.
Running metricbeat on Linux and collecting all but the raid and socket
metricsets at a 10-second collection period account for nearly 1GB of data per
day per monitored host.  For Windows, it is currently about 300MB of data per
day per monitored host.  If you have 1000+ hosts to monitor, this could easily
top 1TB per day.  And again, this is not accounting for other data sources such
as logs, APM telemetry, etc.  Furthermore, this isn't taking into account
ElasticSearch data high availability with multiple primaries and replicas keeping
the data resilient/accessible.
The moral of the story here is you will need to clearly define the hosts you
want to monitor and come up with some sort of data retention strategy.  You'll
likely need to make use of Curator and/or ILM policies to ensure your not
hanging on to perishable data for too long.  Another strategy may be to make
use of rollup indexes to summarize metric data over time.
I'm certainly open to hearing stories from the community on how they've used
ElasticSearch as the heart of an Application and Infrastructure monitoring platform
and the strategies they've employed to balance functionality and resource
utilization.