We have a self-hosted Elasticsearch cluster, accepting log feeds from multiple sources. In a recent incident, issues on one single source caused entire cluster failure and data loss to all the other sources.
Due to a Logstash configuration error, one feeding source suddenly went crazy by sending a huge amount of data within a short period. We have a very big disk buffer, but it was still eaten up. All the nodes got a disk full, and the cluster’s functionality came into a halt. It took some time to fix the problem, and incoming data is lost during the gap.
Feels like, a single point of issue should not fail the entire cluster, and we want to look into possibly restricting total disk usage of indices from each source.
On our cluster, each source creates one index per day, with naming pattern source_name-YYYY-MM-DD, and we can get total disk usage per source by command “du -chs /indices/source_name-* | grep total”.
Tentatively thinking, maybe we can have a watchdog script to check it, and close or delete the indices that exceed the quota. I am wondering if there is any existing tool for it?
I am also wondering how Elasticsearch Cloud handles disk usage issue, assuming there are also multiple clients on their shared host. Is it deploying one virtual machine for each client or something similar?
Feedbacks and pointers will be highly appreciated.
I am also wondering how the cloud service is handling the disk space issue, assuming there are also multiple clients on the shared host. Is it deploying one virtual machine for each client or something similar?
With multiple feeding sources, is it more popular to use separated nodes facing each source respectively? We have security requirements. If change to this direction, is there a way to build an internal cloud service, by Docker or what?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.