Metricbeat Storage in ElasticSearch index

Greetings all,

I'm designing a monitoring and alerting platform and intend to base it largely on ElasticSearch and leverage Metricbeat for per host metrics collection. I've looked up some general (and often conflicting) ElasticSearch "best practices" when it comes to cluster, index, and shard configuration. And I know the issue I'm facing is the general answer "it depends". It depends on how much data you want to keep, how hot/warm/cold you want it, if you'll do roll-up indices, which metricbeat modules and metricsets you will use, etc.

Given that, I still need to start planning and get an initial cluster sized "the best I can". Assuming I know the number of hosts (machines/VMs) I want to monitor via metricbeat, and assuming I use the default out-of-the-box configuration for the system module, approximately how much data can I expect to ingest into ElasticSearch per monitored host per hour (or day, or week)?

For example, assume my default metricbeat config looks like this:

                - module: system
                    - cpu
                    - load
                    - memory
                    - network
                    - process
                    - process_summary
                    # - uptime
                    # - core
                    # - diskio
                    # - filesystem
                    # - fsstat
                    # - raid
                    # - socket
                  enabled: true
                  period: 10s
                  processes: ['.*']

How much ES index storage does that translate to on a per-host basis? I know there is a lot of fine tuning that can be done (e.g. filtering top_N processes, etc., etc.). But what does the above "default" configuration typically yield (assuming ES/Beats 7.x).

Are there any "rules of thumb" guides out there that generalize the storage volume generated by the various metricbeat modules/configurations?



MetricBeat Stats

The following table summarizes some experiments running the metricbeat agent on
various platforms and with various configurations to determine the amount of
storage within ElasticSearch to store the metrics over periods of time. The
idea here is to get a per-monitored-host storage requirement so we can
appropriately size an Elastic cluster to serve as an infrastructure monitoring

Stats Collection Methodology

Since I already use Ansible to configuration manage hosts and have playbooks for
managing ElasticSearch, I employed them to spin up a new VM host with
ElasticSearch. The node was configured as a master/ingest/data node. The VM
also has Kibana installed to allow me to easily observe index creation and
document arrival into the index.

Once initially provisioned, we have ElasticSearch with no indicies ready to
receive some data. I then used Ansible to provision the metricbeat agent to
a host and configure the metricbeat modules and metricsets. Next, I run the
metricbeat agent for 5 minutes and observe the created index size in Kibana. I
take note of the number of documents and size of the index. I record this
information into a spreadsheet, where simple cell-based multiplcation calculates
the extrapolated storage for 1 hour, 1 day, 1 month, and 1 year increments.

After each 5-minute run, I delete the index, reconfigure metricbeat, and run the
agent for another 5-minute run. I performed this with metricbeat running on
Linux (the same node running ElasticSearch/Kibana) and with metricbeat running
on Windows.


Platform Modules Period (s) # of Docs (5m) Size (MB) (5m) # of Docs (1hr) Size (MB) (1hr) # of Docs (1d) Size (MB) (1d) # of Docs (1mo) Size (MB) (1mo) Size (GB) (1mo) # of Docs (1y) Size (MB) (1y) Size (GB) (1y)
Linux system 5 7617 6.1 91404 73.2 2.1937e+06 1756.8 6.58109e+07 52704 51.4688 8.00699e+08 641232 626.203
Linux system 10 3673 3 44076 36 1.05782e+06 864 3.17347e+07 25920 25.3125 3.86106e+08 315360 307.969
Linux system 20 1838 1.5 22056 18 529344 432 1.58803e+07 12960 12.6562 1.93211e+08 157680 153.984
Linux system 10 3828 3.2 45936 38.4 1.10246e+06 921.6 3.30739e+07 27648 27 4.02399e+08 336384 328.5
Linux system 10 3937 3.2 47244 38.4 1.13386e+06 921.6 3.40157e+07 27648 27 4.13857e+08 336384 328.5
Linux system 10 4206 3.3 50472 39.6 1.21133e+06 950.4 3.63398e+07 28512 27.8438 4.42135e+08 346896 338.766
Linux system 10 4271 3.4 51252 40.8 1.23005e+06 979.2 3.69014e+07 29376 28.6875 4.48968e+08 357408 349.031
Linux system 10 4173 3.3 50076 39.6 1.20182e+06 950.4 3.60547e+07 28512 27.8438 4.38666e+08 346896 338.766
Windows windows 10 5790 1 69480 12 1.66752e+06 288 5.00256e+07 8640 8.4375 6.08645e+08 105120 102.656
Windows windows 10 444 0.12 5328 1.44 127872 34.56 3.83616e+06 1036.8 1.0125 4.66733e+07 12614.4 12.3187

The first 3 rows correspond to a "default" metricbeat configuration, where the
only variation is the metricbeat reporting period. The default reporting period
is 10 seconds (row 2), but I also experimented with cutting that time in half
for more resolution (row 1) and doubling that time for storage considerations
(row 3). The configuration below is representative of these cases.

  files: {keepfiles: 2, name: metricbeat.log, path: /var/log/}
  level: warning
  to_files: true
  to_syslog: false
- enabled: true
  metricsets: [cpu, load, memory, network, process, process_summary]
  module: system
  period: 10s
  processes: [.*]
    enabled: true
    hosts: ['']
    index: metricbeat-%{[beat.version]}-default
    password: changeme
    username: elastic
  dashboards: {enabled: true}
  kibana: {host: ''}
  template: {name: 'metricbeat-%{[beat.version]}', pattern: 'metricbeat-%{[beat.version]}-*'}

The next 5 rows correspond to a default 10-second reporting period, but for each
row I added an additional metricset. I incrementally added the uptime, core,
diskio, filesystem, and fsstat metricsets.

The last 2 rows correspond to metricbeat running on Windows. The first Windows
row is the metricbeat windows module collecting the service metricset. The
configuration looks like this:

        keepfiles: 2
        name: metricbeat.log
        path: C:\Metricbeat\logs
    level: debug
    to_files: true
    to_syslog: false
-   enabled: true
    - service
    module: windows
    period: 10s
        enabled: true
        index: metricbeat-%{[beat.version]}
        password: changeme
        username: elastic
        name: metricbeat-%{[beat.version]}
        pattern: metricbeat-%{[beat.version]}-*

The last Windows row uses the (still beta) perfmon metricset and is configured
like this:

        keepfiles: 2
        name: metricbeat.log
        path: C:\Metricbeat\logs
    level: debug
    to_files: true
    to_syslog: false
-   enabled: true
    - perfmon
    module: windows
    -   instance_label: processor_name
        instance_name: total
        query: \Processor(_Total)\% Processor Time
    -   instance_label:
        instance_name: total
        query: \PhysicalDisk(_Total)\% Disk Time
    -   instance_label:
        instance_name: total
        query: \PhysicalDisk(_Total)\% Disk Read Time
    -   instance_label:
        instance_name: total
        measurement_label: physical_disk.time.write.pct
        query: \PhysicalDisk(_Total)\% Disk Write Time
    -   instance_label:
        instance_name: total
        query: \LogicalDisk(_Total)\% Free Space
    -   instance_label:
        instance_name: total
        query: \LogicalDisk(_Total)\Free Megabytes
    -   instance_label:
        instance_name: total
        measurement_label: paging_file.usage.pct
        query: \Paging File(_Total)\% Usage
    -   instance_label:
        instance_name: total
        measurement_label: memory.available.mbytes
        query: \Memory()\Available MBytes
    -   instance_label:
        instance_name: total
        measurement_label: numa_node.memory.available.mbytes
        query: \NUMA Node Memory(_Total)\Available MBytes
    -   instance_label:
        instance_name: total
        query: \NUMA Node Memory(_Total)\Total MBytes
    -   instance_label:
        instance_name: total
        measurement_label: system.processes.count
        query: \System()\Processes
    -   instance_label:
        instance_name: total
        measurement_label: system.threads.count
        query: \System()\Threads
    -   instance_label:
        instance_name: total
        measurement_label: system.uptime.seconds
        query: \System()\System Up Time
    -   instance_label:
        instance_name: vmxnet3
        query: \Network Interface(vmxnet3 Ethernet Adapter)\Bytes Received/sec
    -   instance_label:
        instance_name: vmxnet3
        measurement_label: network.interface.out.bytes
        query: \Network Interface(vmxnet3 Ethernet Adapter)\Bytes Sent/sec
    perfmon.group_measurements_by_instance: true
    perfmon.ignore_non_existent_counters: true
    period: 10s
        enabled: true
        index: metricbeat-%{[beat.version]}
        password: changeme
        username: elastic
        name: metricbeat-%{[beat.version]}
        pattern: metricbeat-%{[beat.version]}-*


Metricbeat is just once piece of the holistic Application and Infrastructure
monitoring solution. We'll need this metric information along with logging and
application performance (APM) data to paint a good picture of the healt and
status of applications and infrastructure, and to alarm/alert when the data
tells us something is abnormal.

Running metricbeat on Linux and collecting all but the raid and socket
metricsets at a 10-second collection period account for nearly 1GB of data per
day per monitored host
. For Windows, it is currently about 300MB of data per
day per monitored host
. If you have 1000+ hosts to monitor, this could easily
top 1TB per day. And again, this is not accounting for other data sources such
as logs, APM telemetry, etc. Furthermore, this isn't taking into account
ElasticSearch data high availability with multiple primaries and replicas keeping
the data resilient/accessible.

The moral of the story here is you will need to clearly define the hosts you
want to monitor and come up with some sort of data retention strategy. You'll
likely need to make use of Curator and/or ILM policies to ensure your not
hanging on to perishable data for too long. Another strategy may be to make
use of rollup indexes to summarize metric data over time.

I'm certainly open to hearing stories from the community on how they've used
ElasticSearch as the heart of an Application and Infrastructure monitoring platform
and the strategies they've employed to balance functionality and resource


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.