Elasticsearch Disk space issues

Hello,

I am having issues with managing the amount of space elasticsearch is taking up on each one of the nodes in the cluster. As of right now we have 5 ES nodes one which is a dedicated master Node. The rest are data and master. 3 of the nodes have 250 Gb of space and the other 2 have 500 G of space. With this it seems we are only allowed to keep 4 days of logstash data and 2 days of metricbeat data. We have roughly 5 metricbeat indexes(74 hosts) and 4 different logstash indexes (10 hosts). This seems like a very low amount of time that the data is kept. We are using best_compression and the number of replicas is set to 1. Is there anything that is recommended to raise the retention of the data?

+1 I'd like to know about best practices for optimal ultilization of disk space as well

In my experience compression improves with shard size, so having reasonably large shards may improve how much data the cluster can hold. You may also want to optimise your mappings, as described in this blog post, even if it is getting a bit old.

I suspect that this is not a storage efficiency issue as much as an issue of simply collecting too much data. From my experience with the default metricbeat config the data volume you are experiencing is about right. If fact it is a little lower than I would expect, which is probable related to the increased compression you are using. A few things to consider...

1. By default metricbeat collects metrics every 10s. Do you really need that granularity?
period: 10s

There is certainly a trade off to consider. Granularity can be great for troubleshooting purposes... IF you have the skills to interpret what the data is telling you... otherwise it is a waste of resources. The simple fact is that many organizations lack the operational sophistication to take advantage of much of the data they have available.

I have been in the infrastructure management space for a couple decades. Back in the 80s and 90s 15 minute collection cycles was very common. Even a couple years ago 5 minutes was totally acceptable. Today 3 minutes seems to be the minimum expectation, with 1 minute for specific use-cases. Everyone wants the ability to collect metrics at greater granularity for short time spans, but not permanently.

You have to be honest with yourself... do you really need 10s granularity? If you bump this up to 60s, you still have a fairly granular view that will be more than sufficient for most troubleshooting and capacity planning use-cases. You will also have increased your storage capacity from 2 days to 12 days... voila!

2. Do you need data for all processes?
processes: ['.*']

By default metricbeat will produce metrics for all process. In Linux every thread is handled as a process, which combined with all of the behind the scenes stuff going on in a modern OS can result is very large number of processes. This means a very large amount of data. In fact, I would bet that metricbeat data for processes is by far the largest amount of data you are collecting. Consider this list of processes metricbeat picked up from a simple CentOS VM running MySQL...

agetty
ata_sff
auditd
bioset
crond
crypto
dbus-daemon
deferwq
fsnotify_mark
ipv6_addrconf
irqbalance
kauditd
kblockd
kdevtmpfs
khugepaged
khungtaskd
kintegrityd
kmpath_rdacd
kpsmoused
ksmd
ksoftirqd/0
ksoftirqd/1
kswapd0
kthreadd
kthrotld
kworker/0:0
kworker/0:0H
kworker/0:1
kworker/0:1H
kworker/0:2
kworker/1:0
kworker/1:0H
kworker/1:1H
kworker/1:2
kworker/u4:0
kworker/u4:2
master
md
metricbeat
migration/0
migration/1
mpt/0
mpt_poll_0
mysqld
mysqld_safe
netns
NetworkManager
nfit
ntpd
packetbeat
pickup
polkitd
qmgr
rcu_bh
rcu_sched
rsyslogd
scsi_eh_0
scsi_eh_1
scsi_eh_2
scsi_tmf_0
scsi_tmf_1
scsi_tmf_2
snmpd
sshd
systemd
systemd-journal
systemd-logind
systemd-udevd
ttm_swap
tuned
vmtoolsd
watchdog/0
watchdog/1
writeback
xfs-buf/sda1
xfs-buf/sda3
xfs-cil/sda1
xfs-cil/sda3
xfs-conv/sda1
xfs-conv/sda3
xfs-data/sda1
xfs-data/sda3
xfs-eofblocks/s
xfs-log/sda1
xfs-log/sda3
xfs-reclaim/sda
xfs_mru_cache
xfsaild/sda1
xfsaild/sda3
xfsalloc

Which of these processes do we really care about? Or perhaps the better question... for which of the these processes do we have the operational sophistication to do anything about? For me personally I am looking only at...

crond
metricbeat
mysqld
mysqld_safe
packetbeat
rsyslogd
sshd

By specifying exactly the processes I am interested in I can cut down from 90 processes to 7. In my environment with a default metricbeat configuration collecting data on all processes, the process-related data is 70% of all of the data collected. By focusing on only the processes I care about (and can do something about) I can cut this down to about 15% of the total data. If you can achieve a similar reduction you can now stretch those 12 days to over 30 days.

Conclusion
While the initial setup of metricbeat is easy. There is a difference between quickly showcasing its capabilities (which is what the default config really does) and deploying a comprehensive solution to monitor and manage your infrastructure. Elastic Stack provides some great building blocks for such a solution, but its needs to be deployed inline with the requirements of your organization.

I hope you find this useful.

Rob

3 Likes

Thank you Rob, I will make the changes you suggested and see if that helps.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.