Intake very slow (more than 10secondes)

Elastic stack version 7.12.0 with Platinium licence

APM Agent language and version: nodeJS - 3.X (latest)

Run on k8s.

Our nodeJS application complains about timeout when sending APM events to intake API. I setup instrumentation on apm-server, and see POST /intake/v2/events can take more than 10s!

I dont see any trouble on ES cluster, thread pools are fine.

Our apm-server.yml config file:

In APM-server logs, I can see sometimes "queue is full".

We use "stack monitoring", on an extra ES cluster:

I am not an expert on APM-server (yet!) , but I was thinking intake API should be very fast, with a queue system between in and output?

I suspect long GC:

@ebuildy Hi! I don't know for sure from your data, but it is possible this is due to an issue in the Node.js APM agent: Blocking Behavior under Benchmarking Load · Issue #136 · elastic/apm-nodejs-http-client · GitHub
That issue is "fixed", but is not yet in a released APM agent. The work to get it into an APM agent release is here: fix: blocking behaviour under load by trentm · Pull Request #2024 · elastic/apm-agent-nodejs · GitHub

This Node.js APM agent issue can happen when the app using the agent is under fairly high load and/or the APM server is being slow or non-responsive. If possible, you could try this branch of the APM agent to see if that helps: GitHub - elastic/apm-agent-nodejs at trentm/blocking-behavior

I don't know apm-server that well, so I don't know if there could also be a server-side issue here. Seeing "queue is full" is the apm-server logs suggests that yes, there might be.

Those requests taking 10s might be normal behaviour. They are long-running requests that an APM agent can keep open while sending up data with a chunked-encoding. The agent's apiRequestTime (Configuration options | APM Node.js Agent Reference [3.x] | Elastic) config var defaults to 10s. That is the time after which it will close an intake request to the APM server and start a new one.

Ho, ok , didnt know about long live connection.

So nodeJS agent settings serverTimeout must be > apiRequestTime , ok I got it !

That is the best practice yes. It is slightly more subtle: the serverTimeout is a timeout on socket inactivity, so it will get reset if there is any data being sent through (like APM transactions or spans, or the agent's regular metricsets reporting). Also the agent is gzip'ing the data it sends so there is some buffering happening there as well.