ELK architecture advice with S3

I am new to using ELK and I am planning to stream logs from multiple application servers into S3 and then make logstash read from there. So, I need some advise on how that architecture could look like. I have gone through some documentation and resources on internet here, here, here, here, here, here. But, with S3 I am still confused which architecture to go for.

If I were to use File beats on app servers, my architecture could look like this.
App server (file beats)--> Queue (Redis) --> Logstash --> Elasticsearch --> Kibana.

But, since I have a use case where logs are already being streamed to S3, should I really use a queuing system like Redis? I mean, since its the Logstash which pulls the data from S3, it knows when to pull and not pull data and so no log event is missed. Let me know if this understanding of mine is wrong.

So, I am planning to start a production setup with an architecture like this:

S3 (multiple directories) --> Logstash (1 Server) --> Elasticsearch (3 nodes) --> Kibana (1 Server with elastic search client)

I have following questions regarding this architecture:

  1. To get better latency (time between an event logging on app server and event being indexed on ES), which is safer option to go? S3 or Filebeats?
  2. If I go for S3 (even if slow), do I require queuing system? If I do require, can I put it on same server as Logstash?
  3. If I use only 1 Logstash instance, is it susceptible to not being a real-time log analysis system? Should I use 2 Logstash instances, so that even if one instance goes down, I can still read logs on another logstash instance from S3?
  4. If I use 2 logstash instances should I use 2 queues or 1 queue would be alright and gives me high availability?
  5. For a load (25 GB/day and with replication 50 GB/day; Retention=60 days), would it be OK if I start the cluster with 3 nodes? i.e, 2 serving as Master+Data nodes and 1 as dedicated master. Do I require 5 nodes (3 dedicated masters, and 2 data nodes)?

Any suggestions?

On the masters front: the answer is always "3 dedicated master nodes" and anything else is an architectural compromise. The rationale for this is :

Dedicated because asking a master node to also perform indexing or searching duties can put it under stress. A master node must perform an important (but lightweight) managerial function so shouldn't be stressed by data loads in the same way you do with data nodes or clusters can become unresponsive
3 because 1 represents a single-point-of-failure and 2 has the potential for split-brain.

Of course not everyone builds a cluster using 3 dedicated master nodes but it is important to know the trade-offs you're making when you stray from this advice.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.