Since I don't think its been mentioned in this post yet... some drawbacks that should be made clear:
- Logging often makes use of the source IP (this is particularly true of LAN-based logging protocols such as syslog; less so of logging-as-a-service). Reverse-proxying this Beats traffic will cause Logstash (beats input) to see all traffic as coming from Nginx
- You won't be able to use SSL/TLS client certificates...
- This is very unlikely to be the happy path with regard to tools like Fleet and the Elastic Agent, if that is of interest.
- An alternative design, often desired with TCP reverse-proxying in general, is to use the PROXY protocol, which means that both Nginx (or HAProxy, where this came from) and the backend input (beats input) needs to support PROXY protocol... but that functionality is not implemented yet in the stock Logstash inputs.
Alternative designs include:
- using floating IPs (eg. Pacemaker) to float various dedicated service IPs around a cluster a cluster [while not really necessary for Beats, this does prove very useful for other types of logging endpoints, such as for Syslog.
- using DNS Round Robin (really not recommended as a lot of things don't tend to lookup the DNS name very often, particularly network equipment).
- GSLB (see DNS Round Robin, just run faster, unless your head is already firmly stuck in the clouds, in which case it could well be your best friend.)
- layer-4 load-balancing, which is effectively playing tricks with NAT.
Designs to avoid:
- anything where you point clients at Elasticsearch directly [this is a concern regarding trust boundaries]
- anything where you point clients at Kafka directly [concern regarding trust, versioning and robustness]
I would strongly urge you to ensure that you have some sort of persistent queue, either what Logstash provides, or using tools like Kafka (which bring other architectural opportunities).
My own experience tells me that Pacemaker is really quite useful for floating some IPs around a small cluster; I use that to surface logstash (beats), as well as various rsyslogd instances. All of these put their data into Kafka, which then gets consumed by another logstash which does the enrichment and passed it into Elasticsearch. This will very likely be too big and complex for small deployments, but gives a lot of flexibility and control, particularly if you want a place to filter data before sending it into the likes of Elastic Service. Elasticsearch indexing should be your slowest part of your pipeline, so I wouldn't be concerned about scaling out Logstash for performance reasons (just know how not not make regular expressions that suck).
PS. Also worth noting is that 'Beats' is being renamed to 'Elastic Agent'... no idea what implications that would have protocol wise.
If you haven't already, check your thinking against Deploying and Scaling Logstash | Logstash Reference [7.14] | Elastic