We're evaluating an S3-based buffer architecture to address circuit breaker issues and reduce costs in our ECK deployment. Would really appreciate insights from anyone who's implemented similar patterns.
The Challenge
Running ECK 8.14 on AWS EKS with significant stability issues:
Circuit breakers triggering 2-3 times daily during traffic spikes (63GB variations)
75-78% heap pressure causing cluster instability
~ PVC storage costs
Data loss during Elasticsearch downtime
Proposed Architecture
Moving from direct ingestion to S3-buffered approach:
The architecture uses S3 as an intermediate buffer layer between data sources and Elasticsearch
FluentD Forwarders (DaemonSet) collect logs from pods
FluentD Aggregator (StatefulSet) writes to S3
OpenTelemetry Collector exports APM data to S3
Buffer Layer:
S3 bucket with lifecycle policies
SNS notifications on new objects
SQS queue for rate-controlled processing
Processing:
FluentD/Logstash polls SQS
Batch processes S3 objects
Controlled indexing to Elasticsearch
Design Considerations
1. Latency Trade-off
Moving from ~2 seconds (direct) to 30-60 seconds (buffered). Is this acceptable for most use cases? Our monitoring team is okay with it for logs, but uncertain about APM data impact.
What is the daily volume you have for APM data? Also, is this 63 GB or TB? 63 GB per day is pretty small for Elasticsearch, using s3 as a buffer in this case seems unnecessary, you may be able to fix your performance issues in other ways.
What issue exactly are you having and what are the resources of your nodes?
Also, how did you troubleshoot that your issue is related to indexing and not to search?
I agree with Leandro, this sounds like pretty light load, you shouldn't need anything so complex to deal with it. You already have some client-side buffering (log files are naturally buffered anyway and APM traces should be buffered in the collector) to smooth out any peaks, but maybe you need to adjust the config in this area to make better use of it. That's definitely what I'd investigate before introducing so much other operational complexity into the system.
This version is over a year old and there have been improvements in this area since then - at least #113044 applies more effective backpressure when overwhelmed by spikes in indexing. You're due an upgrade.
I think I'm missing something here - you're proposing S3 as a temporary buffer for the data on its way into Elasticsearch, but the PVCs in the ES cluster relate to the permanent storage of the data which would be the same either way.
Searchable snapshots will be worthwhile at this kind of scale.
Are you sure you need replication, or do you just want to do cross-region searches? Cross-cluster search is much simpler, and how we handle this kind of global data from our own internal systems within Elastic. FWIW our internal systems don't have this kind of separate buffering layer, it's all done with client-side buffering.
hello @leandrojmp and thank you.
Those 63 GB are in test environment with 5 separate deployments of our 20 miscroservices solution .
we expect more ammount of data , we will do multi cluster eks .
my concerns are about the PVCs cost ( we are deploying our eck-stack on eks k8s) , those PVCs are part of PV ( EBS) so we expect a cost increasing , that's why I am thinking about a work arround by using AWS S3 not only as buffer layer but it can be our primary storage because we can controll ingestion by SNS/SQS , so in our PVCs we can reduce retention periode.
also it can be a work arround to do multiregion observability.
what do you think . @DavidTurner also thank you for your reactivity.
in fact it's a 1 master node with 3 data nodes and it's a test env.
but when we will be in production and QA env we will do as you said 3 master nodes, but now I am making tests and research with the minimum of resources so when increasing the resources at scale we will maintain perfectly our stack with the maximum of optimization.
please as you are an expert , what are your recommendations for performance tuning and memory managment of elasticsearch nodes ( master and data)
also , what are the best practices when configuring indices .
I don’t see how S3 would have any impact on this, your primary storage would be the storage for your elasticsearch data nodes and these cannot be on S3, also, for better performance you need fast storage, so you need something backed by fast disks like nvme, at least for the hot tier.
It is not clear what exactly is your issue as you didn’t provided what are the specs of your cluster or the errors you are getting, there are many things that you can do to tune the cluster.
Having a buffer between the source of the data and Elasticsearch is pretty common, I use Kafka as a buffer layer in some data ingestion flows, but not everything, I also have thousand of agents sending data directly to Elasticsearch.
If you send your data to S3 buckets, you would need to also have a SQS queue configured to receive notifications of new files and use the Elastic Agent to get the data and send to Elasticsearch, this is a pretty common scenario, there are multiple services that ship data directly to S3 buckets for example (Cloudflare, Github, AWS etc) and you can use Elastic Agent to get it from there.
Without SQS you cannot consume your data in parallel and depending on the volume of the data you may need multiple agents to consume it, which would need multiple VMs/Pods.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.