Virtual Machines or Containers?

Hello everyone,

I am currently working on my final-year project, which focuses on designing and automating the deployment of a highly available and fault-tolerant ELK stack (Elasticsearch, Logstash, Kibana).

The goal of this project is to build a resilient architecture capable of handling failures while ensuring service continuity. It also involves full automation using Infrastructure as Code tools (Terraform and Ansible), as well as implementing failover scenarios (node failures, recovery using snapshots, etc.) and monitoring (Metricbeat / Prometheus / Grafana).

While studying the official documentation, I noticed recommendations about running Elasticsearch as a dedicated service and avoiding resource contention with other heavy applications.

This brings me to my main question:

From a production and best-practices perspective, what is the recommended approach for deploying an ELK cluster:

  • Virtual Machines (VMs)

  • Containers (Docker and Kubernetes)

More specifically:

  • Is one approach more suitable for ensuring high availability and fault tolerance?

  • How do resource isolation, performance, and storage management compare between the two?

  • Are there known limitations or risks when running Elasticsearch in containers for production workloads?

Additionally, I would greatly appreciate any advice regarding:

  • Recommended architecture for HA Elasticsearch clusters (node roles, sizing, distribution)

  • Best practices for Logstash and Kibana high availability setups

  • Common pitfalls to avoid when designing such a system

Thank you in advance for your insights and guidance.

I lean toward using VMs personally. Elasticsearch is pretty sensitive to resource stability (disk I/O, memory, CPU), and VMs tend to give you more predictable performance and simpler resource isolation. You also avoid the extra abstraction layer that comes with containers, which can make troubleshooting and tuning a bit harder. Containers are definitely great for portability, quick spin-up and automation—I’ve found it much faster to get a cluster running with Docker—but for longer-term testing and experimenting with failure scenarios, I’ve had better results using VMs.

What part of this project is automated? Are you using Ansible to spin up ES nodes?

bare metal would be an even better option in that regard, though @saba_kallel seems to have excluded this option.