Hello ELK community,
I'm currently implementing my first ELK infrastructure in production.
The plan is to ingest S3 logs that are sent from 54 source servers where we installed filebeat.
We have 3 physical servers available, each with 1TB ram , 2 X CPUS Intel(R) Xeon(R) Gold 6426Y Hyperthreaded, (16 physical cores, 32 threads, 2.5 Ghz frequency)
These servers are also equipped with NVME disks with a capacity of around 23TB for each server.
The plan is to make a cluster with these 3 servers making all of them data + master nodes (2 data nodes per physical server). plus the implementation of logstash on each of them which would be the target of filebeat, logstash here would do simple filtering (drop fields) + format to JSON.
The cluster will be queried every 5 minutes by a Nifi PaginatedJsonQueryElasticsearch querying 10000 lines every time (https queries) with a keepalive of 10 minutes every time. This is something that was already in place and I can only act on the frequency/keepalive time. but it has to go through Nifi for reasons that are specific to our organization.
The ingestion into the ELK cluster is continuous (around 1TB per day) with a retention of 7 days.
For now I'm planning to have all roles of the cluster (including logstash) in these physical machines, It seems like it goes against ELK best practices to have the master and data localised on the same host, but also I'm wondering if wiht powerful machines like these (which were used as hypervisors before) the implementation would be possible and especially safe. otherwise we always have the option to implement VMs to support some non disk critical roles.
The indexing can be delayed to a couple of minutes if needed, the thing I'm mostly worried about is Resiliency and Backpressure to the Filebeat agents where it would start affecting the servers we are monitoring.
TLDR : How safe is it to have all ELK roles + logstash on physical servers that are continuously ingesting + being queried by Nifi
Thank you & regards.