Hi, I am considering the ELK stack in a large telco opportunity in the MEA region as the propsosed log management solutions (Humio, Splunk Enterprise, etc,.) are turning out to be expensive in the overall solution approach. Please find my requirements below. I would appreciate it if someone from Elastic guides me with the design considerations so that I will not either oversize or undersize the environment.
Data ingestion is estimated to be 1TB/day for production and 500GB/day for PoC environment.
The data retention will be 30 days for production and PoC sites.
Would like to go with the Ingest node (rather Logstash) in the solution approach.
Storage for all the nodes (master, data, ingest, coordinator) will come from the external SAN storage.
It is assumed to gather the logs from Filebeats from the target environment.
The ELK stack needs to be highly available and should support N+1 redundancy in the production site. PoC site does not require any HA/redundancy and can tolerate failures.
Please let me know incase you need more details.
I would like to know how many VMs needed do I need to carve out and the specifications (CPU/RAM/DISK) of those VMs for both production and poc sites.
We're happy to help here, however if this is an important business decision/investment you should really reach out to us via https://www.elastic.co/contact, as we can put you in contact with the best resources.
One point of PoC is sizing, so build it on the small side and see how it works. I'm designing as I write, no guarentees.
PoC:
If your SAN can handle IO, I'd build fewer bigger nodes. Elastic is easy to add nodes, if you can afford them. For 500GB a day, replicated, you need 30TB for a months data. Elastic can handle 5-7TB (or maybe up to 10TB) data per node, but you can't run disk over 85%. 5 data nodes with 8Tb disk each, 64Gb ram (31 Elastic heap) and guessing start with 8 CPUs if more can be added easily. I would use the data nodes for ingest, use ILM to target 50Mb shard size and delete at 30 days. I would try letting the data nodes ingest, all of them, clients use load balancing directing to all 5. I would replicate PoC at least at first, for a sizing predictions for production. You will need to study your heap used by ingest vs all other purposes. Your ingest design drives it's CPU and heap.
Allocate 3 dedicated masters, they can be much smaller and only need OS disk. I'd setup Kibana on servers running elastic client nodes to coordinate queries for Kibana and any other query load. One or more as needed.
If PoC is over built, drop a data node, but watch for disk space. Run for 30 days, if it's good, double data nodes for production.
The number of tenants makes a difference, one index ingesting vs many different indices. If one tenant/index, shard it more to spread the load. If many, shard them less, as multiple indices will spread the load anyway.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.