Advice regarding Metricbeat sharding and retention

(Matt Hagenbuch) #1


I have a largish(?) metricbeat installation which collects from ~2000 machines minutely, generating ~100GB index with 135M docs per day. Currently I have the indexes set to 5 shards x2 replication, which is performing well enough when querying a few days. In this context, I have two questions:

  1. Would you recommend splitting the daily indexes by metricset to reduce the sizes so that aggregations over longer time periods (>7 days) run more efficiently? If I split by metricset, should I adjust the number of shards?
  2. What combination of forceMerge/shrink operations would be appropriate to consolidate older metricbeat data? I found this article, but it seems targeted at ES 2.0, is this advice still relevant?

Thanks in advance for any advice you can give!

(ruflin) #2

Great to hear that you deployed Metricbeat on this scale.

For question 1:

  • What do your queries look like?
  • What is the size of each of your index?
  • How many metricsets do you use and which ones? (shard explosion can be an issue here)

For question 2:

(Matt Hagenbuch) #3

Hi ruflin, thanks for responding so quickly!

Most of our current queries against metricbeat come from kibana/grafana dashboards, so they are aggregations over host filters and time buckets. I expect most queries would only request data from a single metricset.

Each daily metricbeat index is about 100GB with 135M documents.

We use the diskio, filesystem, process, network, and cpu system metricsets. Your question led me to look at a histogram of doc-counts for each of these in 1 day, and the result was a little surprising, though it makes sense in hindsight:

diskio: 63M
filesystem: 22M
process: 22M
network: 14M
cpu: 2.5M

As for shrinking shards, I did look into this but it seems that since we currently have 5 shards per index the only option would be to shrink to 1 shard, is that correct?

Thanks again!

(ruflin) #4

As you have quite a bit of data I'm wondering if you should perhaps start using rollover ( instead of daily indices. This would give you a more predictable size of the indices.

As you have only 5 metricsets it would probably give you an improvement to have one index per metricset as the data would also be stored closer together. I wanted to do some benchmarks for this on rally for quite some time to see if also sorting on ingest time could make a difference but didn't get to it yet. So this is only my theory and not proven yet.

From 5 you can only go to 1. But after indexing is done I wonder if this would be an issue? There is also a feature splitting of shards coming: But please read the limitations there.

(Matt Hagenbuch) #5

I think your advice regarding rollover indices is good; that will allow us to have more evenly-sized indices once we start indexing by metricset.

I wrote a simple python script to benchmark the query response time of our ES cluster for some representative query aggregations. I have left that running periodically so I can measure the performance impact of any changes going forward.

I am going to see what shrinking older indices to 1 shard does to performance. Should I use forceMerge at all during the process or does shrinking already combine the Lucene segments?

(ruflin) #6

I'm really curious about the results here. Would you mind sharing them later?

Based on the docs I assume force-merge does not happen as part of the shrinking but I'm not sure about the inner workings. So using force merge after the shrinking could be beneficial especially as now writes happen anymore to this index (I assume).

(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.