I have set up an Elasticsearch cluster to handle logs from a Kubernetes (K8s) cluster hosting over 500 applications. Retention policies need to be configured per application, with a default retention of 20 days and extended periods for specific applications.
The Elasticsearch cluster comprises 9 nodes (16 vCPUs, 64.0 GiB per node): 5 master+hot nodes and 4 data nodes. Each application has a dedicated index, resulting in over 500 indices daily and a total log size of approximately 1.5 TB per day.
Logs are initially written to the hot nodes and, using Index Lifecycle Management (ILM), are moved to the data nodes after one day.
Currently, with this setup, I am ending up with 11K indices and over 25K shards. Additionally, I frequently encounter Elasticsearch rate limit issues. Is there a better way to architect this scenario to improve performance and scalability?
Hello @carly.richmond, thank you for responding. We have only hot and warm tiers. All the data is written to the hot tier, moved to the warm tier after a day, and deleted after 20 days. Additionally, we take a snapshot to S3 every day.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.