I am having requirement where multiple hosts ( count would be approx 120) sending data via filebeat. Amount of data would be more during business hours and less during night. Total avg data ingested is 35 GB per day ( 1.5 million records per day) .
Peak load is 200 MB data from few servers within 5 minutes.
Setup that I am thinking is ( setting up in AWS)
Filebeat > logstash > ElasticSearch
Logstash will have two instances ( r5a.xlarge i.e. 4 CPU , 32 GB RAM)
Elasticsearch will have 4 nodes (m5.xlarge.elasticsearch i.e. 4CPU, 16GB RAM) and 750GB EBS volume attached to each instance.
My requirement is to have data as near as possible to real time. Is this configuration good enough or need to bring in solution like REDIS for caching or use more servers in logstash/ES.
There is no way for anyone to answer that. For one thing, you haven't said what your logstash pipelines are doing, or even why you are putting logstash between filebeat and elasticsearch. The cost of pipelines varies enormously.
The only way to know the answer is to try it. Build a solution, measure its throughput, then scale the part of the solution that is a bottleneck.
I am having my jboss/wildfly logs which is spitting out multiline logs and is not fully qualifies json as there are other fields like timestamp , java threadname , class name etc. Logstash will be breaking those multiline logs in proper json and those fields will be used in creating visualization in Kibana.
I understand that it would be a hit and trial method to get to know the best suitable configuration but is there any baseline available for logstash or Elastic search.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.