To help you, but don't do as I say, I'm not in your context. This is just to show you that ES is flexible, only few designs are truly bad, plenty of them are good IF they fit your usecase. I'm also not a guru spreading truths.
I have 3 masters (c5.xlarge) across 3 AZs. (behind ALB to talk to them)
I have 2 coord (r5.xlarge) across 2 AZs. behind ALB to talk to them)
I have 4 ingest (c5.2xlarge) across 2 AZs. (behind ALB to talk to them)
I have 20 data (i3.2xlarge) across 2 AZs. (local ephemeral SSD only for storage.)
I have 3 Kibana across 3 AZs (behind ALB to talk to them)
I have 3 Grafana across 3 AZs (behind ALB to talk to them)
I have 1 Cerebro...need to make it 3 AZs I guess
I have ~1200 hosts running Beats from AWS/Openstack/VMware/Physical.
I have custom lambdas scraping buckets, shipping stuff from AWS.
Small(for now... ) army of Heartbeat (t3.nano) in all network scopes poking stuff, and each other.
I'm currently in the range of ~2 TBs per day for ingestion.
The cluster currently has 29 nodes, ~425 indices, ~4000 shards, ~20 billions docs, ~25TB.
Fairly small at this point still. I'm not sure I want to run huge clusters either.
I want to partition away to reduce blast radius, like logs/metrics or dev/staging/prod.
I'm not sure what is the official recommendation from Elastic but I have enough IT expertise to know that even if I can run huge clusters, I would choose not to do it by design. Change rollouts, blast radius, cell isolation, region isolation, use case isolation, team isolation, etc. Which means partitioning in more >1 cluster.
But licensing currently charges for the master nodes in addition to the data nodes. Which means direct license cost impact when using cluster partitioning where 30 data + 3 masters would become (10+3)+(10+3)+(10+3). I don't work there but I think Elastic should raise the node price and remove masters from the equation in such a fashion that it mean zero impact for them and I can do 10 clusters of 3+3 if that is what I want without paying licenses for 30 master nodes. I still get ingest/search over the same nodes... my 30 data nodes but now I need to justify 60 nodes to my cost manager. I'm actually baffled by this blocker on cluster partitioning we have on the normal per-node licensing type. I wake up at night because of my blast radius, lol . Riot, anyone?
It runs over AWS ECS (1 cluster) with 1 ASG per node type(4) and 1 container per instance (excluding Kib and Graf which are like normal containerized microservices)
I don't use logstash at all, but it's possible my future holds logstash clusters doing duplication/buffering to disk (PQ)/archiving to S3/etc. It also possible my future is Elastic Cloud or ECE or ECK(over AWS EKS). You should look into that, I don't think ECK is mature enough yet to use over EKS, it just launched and I'm not an EKS/Kube expert as we are currently running heavily using AWS ECS+EC2. Will see.
Version 5.6.16 currently planning/doing the bump to latest 6.X for all of the above.
Looking/planning for potential use of hot-warm, ILM(rollover+replace curator+hotwarm), metrics rollups, Beats monitoring, auditbeat
Thinking of passing everything through ingest nodes for centralization of transformation configs, control (why are you shipping me events stamped from a year ago, that'll kill me!) and monitoring of ingestion delay through injection of ingest timestamp, etc, etc.
- I'm thinking I need more cores on the data nodes (bump instance type...) as my contention today centers around CPU/load avg on the data nodes and no contention elsewhere: IO, HEAP, RAM, etc.
I would get more diskspace too, I think I can take a smaller RAM/DISK ratio because I don't have HEAP contention. (Except when someone bombs a coord node... )
- Retention, I need ILM+rollups+hot/warm bad so I can keep stuff longer without putting everything on i3 and keeping metrics with 10sec granularity. Also to lower shard count of read-only indices. "Gimme 15 months retention."
- Either ES snaphots or logstash to send a raw copy of everything to S3(data lake style) in parallel to the indexing pipeline. (I don't like ES snapshots too much for long term archival/reuse; transformed, specialized, version specific, not ready for other uses, not a data lake, users can't restore/query them on demand, etc.)
- Alarms; pushbacks against the complexity, lack of user friendliness, knowledge required, lack of framework/direction, GUIs, etc. "Metrics+alarms would be easier for the devs in a specialized SasS like Datadog with all the bells and whistles!", "ES is not for metrics/alarms!", "We need Prometheus!?", "Make all the alarms for them by code or give them a GUI they can use!", etc.
- Frequent upgrade/reconfig to thousands of diverse Beats across versions and configs to unlock main stack upgrades possibilities and features. (Might not be an issue in more aligned, less diverse environment with less legacy and spread across technologies and deployment methods.)
- How do I control rate limiting/quality/delay/overload effects during events/outages (indexing/queries explosion). ES doesn't autoscale and is not bottomless, not always easy to know what's wrong, who's doing it. I get bombed by bad queries, bad dashboards, log explosions, etc. I can't be down when people NEED ES because something else is going wrong. "Kibana is down? ; Well... everything is at 100% with crazy load avg, let me check what's happening and in the mean time it will probably come back on its own like it usually does, I do need to find who did what when to cause this... and then conclude there is not much I can do easily, to prevent it."
- Others I probably forget right now, that's the job.
TLDR? You asked for it, twice...