We have a self-hosted Elasticsearch cluster, accepting log feeds from multiple sources. In a recent incident, issues on one single source caused entire cluster failure and data loss to all the other sources.
Due to a Logstash configuration error, one feeding source suddenly went crazy by sending a huge amount of data within a short period. We have a very big disk buffer, but it was still eaten up. All the nodes got a disk full, and the cluster’s functionality came into a halt. It took some time to fix the problem, and incoming data is lost during the gap.
Feels like, a single point of issue should not fail the entire cluster, and we want to look into possibly restricting total disk usage of indices from each source.
On our cluster, each source creates one index per day, with naming pattern source_name-YYYY-MM-DD, and we can get total disk usage per source by command “du -chs /indices/source_name-* | grep total”.
Tentatively thinking, maybe we can have a watchdog script to check it, and close or delete the indices that exceed the quota. I am wondering if there is any existing tool for it?
I am also wondering how Elasticsearch Cloud handles disk usage issue, assuming there are also multiple clients on their shared host. Is it deploying one virtual machine for each client or something similar?
Feedbacks and pointers will be highly appreciated.