Thanks @wa7son. I did look over the Common Problems document. The documents implies that if we only received 503 return codes, an Elasticsearch disk may be full. We are only receiving 503 returns when the error occurs, but I've verified we have plenty of space on our Elastic Stack cluster.
It sounds a little like the APM Server can't keep up with the amount of data that's being sent to it. Do you know approximately how many HTTP requests and events it's receiving?
If that's the issue, the recommended solution is to spin up multiple APM Servers behind a load balancer.
Hi @wa7son, I'm leaning towards the same conclusion. Perhaps, our single APM server cannot keep up with the requests. In our production cluster, we've received right at 56 million events over the past 24 hours. Over the past week, we've received almost 240 million events.
The defaults are rather conservative. It's almost certainly worth tuning your APM server settings depending on what kind of hardware you are running APM server on.
You can directly increase the size of the queue that's being filled up using queue.mem.events. You'll also want to adjust your workers accordingly. That document also describes many other tunables you can adjust for your workload.
I'd also suggest enabling monitoring if possible to help guide some of these changes.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.