50 percent depreciation in API performace while Using APM agent


(Akash Patel) #1

Hi,

We are recently in a process of adding Elastic APM. But after Performance testing of API with APM agent throughput of API is reduced by 50%.

Throughput before adding APM agent - 15200/sec
Throughput after adding APM agent - 8100/sec

APM-Agent configuration :

  logLevel: 'info',
  serverTimeout: "10s",
  captureExceptions: true,
  sourceLinesErrorAppFrames: 5,
  sourceLinesErrorLibraryFrames: 0,
  captureErrorLogStackTraces: "messages",
  captureSpanStackTraces: false,
  stackTraceLimit: 15,
  transactionSampleRate:1,
  captureBody:false,
  instrument:true,
  disableInstrumentations:["redis","mysql"],
  transactionMaxSpans:50,
  apiRequestTime:"10s",
  apiRequestSize:"750kb"

We tried using different configurations but there wasn't much improvement in throughput.

Tried lowering transactionSampleRate , apiRequestTime and apiRequestSize but insignificant improvement.

Do we need to tune apm-server configurations?
Is there anything else need to be tuned in APM agent?

Please let me know if you need more info on this.

Kibana version: 6.5.4

Elasticsearch version: 6.5.4

APM Server version: 6.5.4

APM Agent language and version: Node js, 2.1.0

Thanks


(Ronald Tumulak) #2

Have you tried scaling out Elastic APM? Intake API bandwidth can be a problem given the quantities above. Assuming a 10% sampling rate, that is still about 800 transactions you are sending per second, easily exceeding 800 kB.

Also, do you see CPU or memory-bound bottlenecks on the instrumented services? You might want to play around with those settings.


(Akash Patel) #3

Hi,

Thanks for your time

For now we are running APM server on single machine with 40 cores, 512GB of ram with only single instance of Apm server.
I think you are talking about this parameter right ?

You said scaling out so should i try multiple instances of apm server on same machine? or each instance running on different machine?

I thought of running multiple instances of apm server on same machine and load balancing them with NGINX. will this help?

Thanks


(Ronald Tumulak) #4

You can try multiple instances of the APM server on the same machine and load balance them in front. I've done it with Haproxy in front of three dockerised APM Servers on a single server and it improved out APM intake throughput. It also helped that we set up a cluster of Elasticsearch stores: 3 Master, 4 Data. Keep monitoring the resources and see if you are CPU or memory or network bound. In our case, we did a bit of custom ingest + Geo and Useragent so CPU use was a bit high.


(Akash Patel) #5

Thanks for your guidance. Will try the same, and also monitor CPU+Mem usage.

Meanwhile,

Can you provide me some suggestions to begin with the scaling ?
Assuming that machine has 40 core and 512GB of RAM.

  • How many instances of APM server should i run to begin with?

  • What should be the values of following properties in YAML file for each instance to begin with?
    queue.mem.events, output.elastic.bulk_max_size , output.elastic.workers

  • Do i need to change any other property in YAML file in order to optimize apm server ?

Thanks


(Ronald Tumulak) #6

To be honest, the quantity would really depend on the amount and type of instrumentation you are doing. Trial and error is the way to go. Start small, run the load test, then observe CPU and RAM along with I/O.

On my laptop (32 GB RAM MacBook 2018), I have 3 APM Servers running with 2 GB heap each and a 7-node Elasticsearch cluster. I can easily ingest a combined 4000 calls per minute of a fairly complex transaction (distributed across six Java services) at 100% sampling rate.


(Akash Patel) #7

Sure. Will try it out.

Thanks


(Stephen Belanger) #8

I would turn the sample rate way down. A good balance is aiming for 50-100 traces per second. Setting sourceLinesErrorAppFrames to 0 and captureErrorLogStackTraces to 'never' might also help.


(Akash Patel) #9

Sure will try this out and let you know. Thanks


(Felix Barnsteiner) #10

Did you try out the suggestions? Did it help to retain the throughput?