Metricbeat - Sparsity - Best Practices

Hi All,

I have a question related to storing of the metrics of various applications monitored by metribeat in the same metricbeat index. For example if I am monitoring system, apache, ngnix, mogodb using metric beat. For query performance and best practices perspective would it be fine/advised to push all the metrics in the same metricbeat-weekly index. My question is primarily related to sparsity. As my mappings would have fields for all applications but each of my documents contains subset of metrics wouldn't it lead to sparsity issue when storing doc_values.

This is no longer an issue as of 6.0. A long with ES 6.0 came Lucene 7 which provided support for sparse doc values. This eliminates the storage overhead for "placeholders" where no data existed. So storage volume is reduced, which also frees up page cache space for more real data. IIRC, testing with Metricbeat resulted in an approx 30% reduction, with a related increase in performance.

Thanks @rcowart for the reply.
I have two questions:

  1. Is it a best practice storing metrics of different applications in a same index for scaling and performance.
  2. I am currently using ES 5.6.5. For 5.6.5 how much would this(sparsity) be an issue when it comes to query performance or for 5.6.5 what should be the index creation strategy( group all of them to a single index / multiple index per application type with each index having a single primary shard).

I really think it depends on the size of the indices. When an index becomes really large, indexing new data will become slower. Having lots and lots of small indices can also add unnecessary overhead, and problems with things like excessive open file handles.

For time series data like beat and logs, target index sizes around 10-30GB. If you need indices to be smaller use weekly or monthly indices instead of daily. If you need them be smaller, you could split data out into multiple indices.

If splitting different apps into different indices gives you optimal index sizes AND reduces sparsity, that is the best of both worlds.

thanks @rcowart for your quick reply.

I have a question related to optimal index size for time series data.
As i read from this blog i see the below.

TIP: Small shards result in small segments, which increases overhead. Aim to keep the average shard size between a few GB and a few tens of GB. For use-cases with time-based data, it is common to see shards between 20GB and 40GB in size.

So my question is if we go with 10-30GB per index wont we be left with many shards?

The point is... if splitting up the data per application means that you have a bunch of indices smaller than a few GB, then you are better off either increasing the time period of data in the indices (e.g. monthly instead of daily) or keeping all of the apps together.

If you have so much data that you would have lots of indices larger than 50GB you will need to create a larger cluster (add more nodes).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.