We are evaluating meticbeat as a possible replacement for our internal performance and availability monitoring tool which collects basic metrics around CPU, memory, disk, network once a minute. During a PoC we collected the equivalent beats data from a handful of servers and found that sampling data at the same time frame resulted in some fairly large storage usage (several 100MB for a single server). This was not optimized at all so I'm sure it can be reduced by being more selective about what we collect but it does give us concern as we look to be able to monitor several 1000 servers.
I'm wondering if anyone has experience collecting metricbeats from a large number of servers that can speak to how they manage the storage and retention requirements? Is there any approach for aggregating results into larger timeframes as it ages so that the granularity is reduced in favor of lower storage?
The main consumer of storage space are usually the "per process" stats. If you disable the "process" metricset from the system module, you will likely see a drastic reduction in storage size. If you want the "per process" information, you can choose to whitelist a set of processes to monitor.
For some of the data, you can reduce the polling interval, for example the file system stats (which also generate a lot of data) is fairly static, so you can reduce it to 30s.
By default ES uses 5 shards and 1 replica per index. Depending on your setup, you can go down to one shard and zero replicas, which will show an important improvement, but that of course depends on other things.
You can use processors to filter out the fields that you don't use. You should review the fields that are added to all objects (beat.*, metricset.*) and see if you need all of them.
You can disable the _source and the _all field, although in my testing this doesn't buy that much of an optimization compared with the other suggestions, and it does reduce the functionality.
There are other optimizations possible, but the above are giving the highest returns in my tests. We will be evaluating the defaults for 6.0, so we make Metricbeat more efficient by default.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.