We're using metricbeat to stream various system usage metrics to the Elasticsearch server. The problem is the index gets quite large (7+ GB per week). So, we are considering removing some of the metricbeat events fields. The problem is it's hard to predict exact impact of those potential changes.
Can anyone please advice on estimating how much disk space does specific field(s) use in a particular index?
It's really, really hard to estimate, unfortunately.
Lucene uses a number of tricks to compress fields, and these compression tricks depend in large part on what kind of data is being indexed. E.g. high cardinality fields take up more space than low cardinality fields, because low-cardinality fields compress better. Numerics are smaller than strings, scaled-floats are smaller than floats which are smaller than doubles, etc.
And then it gets more complicated because different compression strategies are used depending on the data in each segment, which can change as segments are merged (i.e. two medium-ish cardinality segments may merge into one segment and form a high cardinality set, changing the compression scheme. Or two segments may merge and vastly reduce their on-disk footprint due to mutual compression).
If you wanted to experiment, you could use the Reindex API to index a single field from your existing data over to an isolated, test index. Because that index only holds a single field, you'll have a very good estimate of the field size. Rinse, repeat for various fields. We have an internal tool to estimate field sizes... and it basically does exactly that.