Metricbeat Storage in ElasticSearch index

benjamin.watson · December 27, 2019, 4:59pm

Greetings all,

I'm designing a monitoring and alerting platform and intend to base it largely on ElasticSearch and leverage Metricbeat for per host metrics collection. I've looked up some general (and often conflicting) ElasticSearch "best practices" when it comes to cluster, index, and shard configuration. And I know the issue I'm facing is the general answer "it depends". It depends on how much data you want to keep, how hot/warm/cold you want it, if you'll do roll-up indices, which metricbeat modules and metricsets you will use, etc.

Given that, I still need to start planning and get an initial cluster sized "the best I can". Assuming I know the number of hosts (machines/VMs) I want to monitor via metricbeat, and assuming I use the default out-of-the-box configuration for the system module, approximately how much data can I expect to ingest into ElasticSearch per monitored host per hour (or day, or week)?

For example, assume my default metricbeat config looks like this:

metricbeat.modules:
                - module: system
                  metricsets:
                    - cpu
                    - load
                    - memory
                    - network
                    - process
                    - process_summary
                    # - uptime
                    # - core
                    # - diskio
                    # - filesystem
                    # - fsstat
                    # - raid
                    # - socket
                  enabled: true
                  period: 10s
                  processes: ['.*']

How much ES index storage does that translate to on a per-host basis? I know there is a lot of fine tuning that can be done (e.g. filtering top_N processes, etc., etc.). But what does the above "default" configuration typically yield (assuming ES/Beats 7.x).

Are there any "rules of thumb" guides out there that generalize the storage volume generated by the various metricbeat modules/configurations?

Thanks!

Ben

benjamin.watson · December 31, 2019, 3:32pm

MetricBeat Stats

The following table summarizes some experiments running the metricbeat agent on
various platforms and with various configurations to determine the amount of
storage within ElasticSearch to store the metrics over periods of time. The
idea here is to get a per-monitored-host storage requirement so we can
appropriately size an Elastic cluster to serve as an infrastructure monitoring
platform.

Stats Collection Methodology

Since I already use Ansible to configuration manage hosts and have playbooks for
managing ElasticSearch, I employed them to spin up a new VM host with
ElasticSearch. The node was configured as a master/ingest/data node. The VM
also has Kibana installed to allow me to easily observe index creation and
document arrival into the index.

Once initially provisioned, we have ElasticSearch with no indicies ready to
receive some data. I then used Ansible to provision the metricbeat agent to
a host and configure the metricbeat modules and metricsets. Next, I run the
metricbeat agent for 5 minutes and observe the created index size in Kibana. I
take note of the number of documents and size of the index. I record this
information into a spreadsheet, where simple cell-based multiplcation calculates
the extrapolated storage for 1 hour, 1 day, 1 month, and 1 year increments.

After each 5-minute run, I delete the index, reconfigure metricbeat, and run the
agent for another 5-minute run. I performed this with metricbeat running on
Linux (the same node running ElasticSearch/Kibana) and with metricbeat running
on Windows.

Results

Platform	Modules	Period (s)	# of Docs (5m)	Size (MB) (5m)	# of Docs (1hr)	Size (MB) (1hr)	# of Docs (1d)	Size (MB) (1d)	# of Docs (1mo)	Size (MB) (1mo)	Size (GB) (1mo)	# of Docs (1y)	Size (MB) (1y)	Size (GB) (1y)
Linux	system	5	7617	6.1	91404	73.2	2.1937e+06	1756.8	6.58109e+07	52704	51.4688	8.00699e+08	641232	626.203
Linux	system	10	3673	3	44076	36	1.05782e+06	864	3.17347e+07	25920	25.3125	3.86106e+08	315360	307.969
Linux	system	20	1838	1.5	22056	18	529344	432	1.58803e+07	12960	12.6562	1.93211e+08	157680	153.984
Linux	system	10	3828	3.2	45936	38.4	1.10246e+06	921.6	3.30739e+07	27648	27	4.02399e+08	336384	328.5
Linux	system	10	3937	3.2	47244	38.4	1.13386e+06	921.6	3.40157e+07	27648	27	4.13857e+08	336384	328.5
Linux	system	10	4206	3.3	50472	39.6	1.21133e+06	950.4	3.63398e+07	28512	27.8438	4.42135e+08	346896	338.766
Linux	system	10	4271	3.4	51252	40.8	1.23005e+06	979.2	3.69014e+07	29376	28.6875	4.48968e+08	357408	349.031
Linux	system	10	4173	3.3	50076	39.6	1.20182e+06	950.4	3.60547e+07	28512	27.8438	4.38666e+08	346896	338.766
Windows	windows	10	5790	1	69480	12	1.66752e+06	288	5.00256e+07	8640	8.4375	6.08645e+08	105120	102.656
Windows	windows	10	444	0.12	5328	1.44	127872	34.56	3.83616e+06	1036.8	1.0125	4.66733e+07	12614.4	12.3187

The first 3 rows correspond to a "default" metricbeat configuration, where the
only variation is the metricbeat reporting period. The default reporting period
is 10 seconds (row 2), but I also experimented with cutting that time in half
for more resolution (row 1) and doubling that time for storage considerations
(row 3). The configuration below is representative of these cases.

logging:
  files: {keepfiles: 2, name: metricbeat.log, path: /var/log/}
  level: warning
  to_files: true
  to_syslog: false
metricbeat.modules:
- enabled: true
  metricsets: [cpu, load, memory, network, process, process_summary]
  module: system
  period: 10s
  processes: [.*]
output:
  elasticsearch:
    enabled: true
    hosts: ['http://elastic.example.com:9200']
    index: metricbeat-%{[beat.version]}-default
    password: changeme
    username: elastic
setup:
  dashboards: {enabled: true}
  kibana: {host: 'http://kibana.example.com:5601'}
  template: {name: 'metricbeat-%{[beat.version]}', pattern: 'metricbeat-%{[beat.version]}-*'}

The next 5 rows correspond to a default 10-second reporting period, but for each
row I added an additional metricset. I incrementally added the uptime, core,
diskio, filesystem, and fsstat metricsets.

The last 2 rows correspond to metricbeat running on Windows. The first Windows
row is the metricbeat windows module collecting the service metricset. The
configuration looks like this:

logging:
    files:
        keepfiles: 2
        name: metricbeat.log
        path: C:\Metricbeat\logs
    level: debug
    to_files: true
    to_syslog: false
metricbeat.modules:
-   enabled: true
    metricsets:
    - service
    module: windows
    period: 10s
output:
    elasticsearch:
        enabled: true
        hosts:
        - http://elastic.example.com:9200
        index: metricbeat-%{[beat.version]}
        password: changeme
        username: elastic
setup:
    template:
        name: metricbeat-%{[beat.version]}
        pattern: metricbeat-%{[beat.version]}-*

The last Windows row uses the (still beta) perfmon metricset and is configured
like this:

logging:
    files:
        keepfiles: 2
        name: metricbeat.log
        path: C:\Metricbeat\logs
    level: debug
    to_files: true
    to_syslog: false
metricbeat.modules:
-   enabled: true
    metricsets:
    - perfmon
    module: windows
    perfmon.counters:
    -   instance_label: processor_name
        instance_name: total
        measurement_label: processor.time.total.pct
        query: \Processor(_Total)\% Processor Time
    -   instance_label: physical_disk.name
        instance_name: total
        measurement_label: physical_disk.time.total.pct
        query: \PhysicalDisk(_Total)\% Disk Time
    -   instance_label: physical_disk.name
        instance_name: total
        measurement_label: physical_disk.time.read.pct
        query: \PhysicalDisk(_Total)\% Disk Read Time
    -   instance_label: physical_disk.name
        instance_name: total
        measurement_label: physical_disk.time.write.pct
        query: \PhysicalDisk(_Total)\% Disk Write Time
    -   instance_label: logical_disk.name
        instance_name: total
        measurement_label: logical_disk.space.free.pct
        query: \LogicalDisk(_Total)\% Free Space
    -   instance_label: logical_disk.name
        instance_name: total
        measurement_label: logical_disk.space.free.mb
        query: \LogicalDisk(_Total)\Free Megabytes
    -   instance_label: paging_file.name
        instance_name: total
        measurement_label: paging_file.usage.pct
        query: \Paging File(_Total)\% Usage
    -   instance_label: memory.name
        instance_name: total
        measurement_label: memory.available.mbytes
        query: \Memory()\Available MBytes
    -   instance_label: numa_node.name
        instance_name: total
        measurement_label: numa_node.memory.available.mbytes
        query: \NUMA Node Memory(_Total)\Available MBytes
    -   instance_label: numa_node.name
        instance_name: total
        measurement_label: numa_node.memory.total.mbytes
        query: \NUMA Node Memory(_Total)\Total MBytes
    -   instance_label: system_info.name
        instance_name: total
        measurement_label: system.processes.count
        query: \System()\Processes
    -   instance_label: system_info.name
        instance_name: total
        measurement_label: system.threads.count
        query: \System()\Threads
    -   instance_label: system_info.name
        instance_name: total
        measurement_label: system.uptime.seconds
        query: \System()\System Up Time
    -   instance_label: network_interface.name
        instance_name: vmxnet3
        measurement_label: network.interface.in.bytes
        query: \Network Interface(vmxnet3 Ethernet Adapter)\Bytes Received/sec
    -   instance_label: network_interface.name
        instance_name: vmxnet3
        measurement_label: network.interface.out.bytes
        query: \Network Interface(vmxnet3 Ethernet Adapter)\Bytes Sent/sec
    perfmon.group_measurements_by_instance: true
    perfmon.ignore_non_existent_counters: true
    period: 10s
output:
    elasticsearch:
        enabled: true
        hosts:
        - http://elastic.example.com:9200
        index: metricbeat-%{[beat.version]}
        password: changeme
        username: elastic
setup:
    template:
        name: metricbeat-%{[beat.version]}
        pattern: metricbeat-%{[beat.version]}-*

Conclusions

Metricbeat is just once piece of the holistic Application and Infrastructure
monitoring solution. We'll need this metric information along with logging and
application performance (APM) data to paint a good picture of the healt and
status of applications and infrastructure, and to alarm/alert when the data
tells us something is abnormal.

Running metricbeat on Linux and collecting all but the raid and socket
metricsets at a 10-second collection period account for nearly 1GB of data per
day per monitored host. For Windows, it is currently about 300MB of data per
day per monitored host. If you have 1000+ hosts to monitor, this could easily
top 1TB per day. And again, this is not accounting for other data sources such
as logs, APM telemetry, etc. Furthermore, this isn't taking into account
ElasticSearch data high availability with multiple primaries and replicas keeping
the data resilient/accessible.

The moral of the story here is you will need to clearly define the hosts you
want to monitor and come up with some sort of data retention strategy. You'll
likely need to make use of Curator and/or ILM policies to ensure your not
hanging on to perishable data for too long. Another strategy may be to make
use of rollup indexes to summarize metric data over time.

I'm certainly open to hearing stories from the community on how they've used
ElasticSearch as the heart of an Application and Infrastructure monitoring platform
and the strategies they've employed to balance functionality and resource
utilization.

system · January 28, 2020, 3:32pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Correct way of configuring metricbeat Beats elastic-stack-monitoring , beats-module , metricbeat	3	338	September 5, 2021
Can Metricbeat collect elasticsearch shard size? Beats metricbeat	6	548	August 17, 2020
Advice regarding Metricbeat sharding and retention Beats metricbeat	6	2058	March 14, 2018
Metricbeat on 10 hosts - how to config for low shards Beats metricbeat	10	1148	May 24, 2019
Architecture of elasticsearch-xpack monitoring with Metricbeat Beats elastic-stack-monitoring , metricbeat	3	409	June 28, 2021

Metricbeat Storage in ElasticSearch index

MetricBeat Stats

Stats Collection Methodology

Results

Conclusions

Related topics