I have a largish(?) metricbeat installation which collects from ~2000 machines minutely, generating ~100GB index with 135M docs per day. Currently I have the indexes set to 5 shards x2 replication, which is performing well enough when querying a few days. In this context, I have two questions:
Would you recommend splitting the daily indexes by metricset to reduce the sizes so that aggregations over longer time periods (>7 days) run more efficiently? If I split by metricset, should I adjust the number of shards?
What combination of forceMerge/shrink operations would be appropriate to consolidate older metricbeat data? I found this article, but it seems targeted at ES 2.0, is this advice still relevant?
Most of our current queries against metricbeat come from kibana/grafana dashboards, so they are aggregations over host filters and time buckets. I expect most queries would only request data from a single metricset.
Each daily metricbeat index is about 100GB with 135M documents.
We use the diskio, filesystem, process, network, and cpu system metricsets. Your question led me to look at a histogram of doc-counts for each of these in 1 day, and the result was a little surprising, though it makes sense in hindsight:
As for shrinking shards, I did look into this but it seems that since we currently have 5 shards per index the only option would be to shrink to 1 shard, is that correct?
As you have only 5 metricsets it would probably give you an improvement to have one index per metricset as the data would also be stored closer together. I wanted to do some benchmarks for this on rally for quite some time to see if also sorting on ingest time could make a difference but didn't get to it yet. So this is only my theory and not proven yet.
I think your advice regarding rollover indices is good; that will allow us to have more evenly-sized indices once we start indexing by metricset.
I wrote a simple python script to benchmark the query response time of our ES cluster for some representative query aggregations. I have left that running periodically so I can measure the performance impact of any changes going forward.
I am going to see what shrinking older indices to 1 shard does to performance. Should I use forceMerge at all during the process or does shrinking already combine the Lucene segments?
I'm really curious about the results here. Would you mind sharing them later?
Based on the docs I assume force-merge does not happen as part of the shrinking but I'm not sure about the inner workings. So using force merge after the shrinking could be beneficial especially as now writes happen anymore to this index (I assume).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.