Hello everyone,
I am currently working on my final-year project, which focuses on designing and automating the deployment of a highly available and fault-tolerant ELK stack (Elasticsearch, Logstash, Kibana).
The goal of this project is to build a resilient architecture capable of handling failures while ensuring service continuity. It also involves full automation using Infrastructure as Code tools (Terraform and Ansible), as well as implementing failover scenarios (node failures, recovery using snapshots, etc.) and monitoring (Metricbeat / Prometheus / Grafana).
While studying the official documentation, I noticed recommendations about running Elasticsearch as a dedicated service and avoiding resource contention with other heavy applications.
This brings me to my main question:
From a production and best-practices perspective, what is the recommended approach for deploying an ELK cluster:
-
Virtual Machines (VMs)
-
Containers (Docker and Kubernetes)
More specifically:
-
Is one approach more suitable for ensuring high availability and fault tolerance?
-
How do resource isolation, performance, and storage management compare between the two?
-
Are there known limitations or risks when running Elasticsearch in containers for production workloads?
Additionally, I would greatly appreciate any advice regarding:
-
Recommended architecture for HA Elasticsearch clusters (node roles, sizing, distribution)
-
Best practices for Logstash and Kibana high availability setups
-
Common pitfalls to avoid when designing such a system
Thank you in advance for your insights and guidance.