Use nodejs lambda layer, npm package, or both?

I've been seeing a lot of noisy logging from Elastic APM in the CloudWatch logs for my node.js Lambda, despite setting ELASTIC_APM_LOG_LEVEL=error. Hoping that there are some bug fixes available, I was looking into updating the lambda layers, and I notice there's two layers, one for the APM extension and another for the APM agent. I don't know if it's always been that way, but before I was only using the first of those.

If we use that APM agent layer, does that replace the need for installing a dependency on elastic-apm-node, or does requiring elastic-apm-node/start do the same thing as that layer? Or are we supposed to use both the lambda layer and the agent package? If we're supposed to use just the layer, how do we go about calling the agent API, like for setting custom transaction names when the elastic-apm-node package is no-longer a dependency?

Hi @apoco,

It has almost always been that way. In early days before the Lambda support was GA we didn't always have a layer published for the agent.

The setup that is documented is to use both the APM extension layer (required) and the APM agent layer (optional, but typically more convenient) and to not have elastic-apm-node as a dependency in your "package.json". In what we understand to be the common case -- where the Lambda function code is not calling the APM agent's API -- this setup means there user doesn't need any code or package.json changes to their Lambda function.

All the APM agent Lambda layer does is make the elastic-apm-node package and its dependencies available in the Lambda function VM, and puts that directory on the NODE_PATH so that --require "elastic-apm-node/start" works and/or require('elastic-apm-node') works. This is the equivalent to your installing elastic-apm-node yourself in your function code's "package.json" and node_modules/.... In other words: either way will work.

Some notes on pros/cons of either approach for installing the APM agent:

  • If you are using the Agent API, then it might be nicer for your tooling (e.g. for linting, or for auto-complete in your editor/IDE) if you have elastic-apm-node in your package.json.
  • If you are including the APM agent it in your package.json and hence in the .zip file that is uploaded to AWS for new versions of your function, then updates to your function code will take a bit longer, because the update will re-upload the full node_modules/... everytime.
  • Beware that if you do both -- use the APM agent layer and install it in your node_modules/... -- then if they are different APM agent versions there might be some version conflict if features differ between versions.

Can you quote exactly the log messages you mean? Some log messages are from the APM agent and some are from the APM extension. We can then follow up on the appropriate repos. I agree that there is more log output/noise than there should be.

Thanks for the clarifications. Just to be 100% sure, you're saying we don't need to start the agent if using the layer, correct?

Here's an example of the type of output that shows up in our logs. It says it's level warn but it still shows up:

{
    "log.level": "warn",
    "@timestamp": "2022-06-23T23:12:47.121Z",
    "log.origin": {
        "file.name": "extension/process_env.go",
        "file.line": 70
    },
    "message": "Could not read ELASTIC_APM_DATA_RECEIVER_TIMEOUT_SECONDS, defaulting to 15: strconv.Atoi: parsing \"\": invalid syntax",
    "ecs.version": "1.6.0"
}

FYI I did try for my self to remove my NODE_OPTIONS="-r elastic-apm-node/start" environment variable, and now I'm not seeing any APM data, so maybe we do still need to use that?

Also of note my lambda stopped getting the mysterious timeouts now that the agent appears to be no-longer running. Are there any known gotchas for nodeJS lambdas we should be aware of, because we really do want instrumentation, but in CloudWatch it looks like it's working way better without it.

How well has ELASTIC_APM_ASYNC_HOOKS been working as opposed to the node module patching? Does that work more reliably? I will play with it and report back, but any foresight would be appreciated.

Latest on my investigation shows that if I do start the agent but turn off instrumentation we don't have the lambda timeout issues any longer, so there's some kind of bug with instrumentation.

Sorry, I wasn't clear. The APM agent needs to be started somehow, and for a Lambda function invocation the agent needs to be started before the module holding the Lambda function is loaded. That means that using require('elastic-apm-node').start(...) in your Lambda function handler code doesn't work: it ends up being too late for the agent to automatically instrument the handler function. So using NODE_OPTIONS="-r elastic-apm-node/start" to start the APM agent as soon as the node process starts the Lambda Runtime is necessary.

For the APM agent and extension to report APM data it needs to spend some time after the function invocation to send that data on. If your function time tends to run very close to your Lambda function's configured timeout, then the extra time for APM to send on the data might be hitting that timeout. I seem to recall that the default Lambda timeout is quite short.

Can you post some of the logs from CloudWatch showing these timeouts? That might help confirm it is the timeout I'm guessing it might be.

If so, one thing you could try is configuring the APM extension's "send strategy" to only try to send on APM data during the next invocation of your function. That would mean APM data can be a little delayed, but generally that sending APM data won't add extra time at the end of your function invocations. See https://www.elastic.co/guide/en/apm/guide/8.2/aws-lambda-config-options.html#_elastic_apm_send_strategy

I would tend to avoid ELASTIC_APM_ASYNC_HOOKS=false. It cannot properly track async/await usage. Also its usage is much less common and less well tested.

Oh. That is definitely possible. Can you let us know what packages and versions of packages that you are using? Can you isolate it to a particular instrumented module by selectively disabling instrumentation of some modules via ELASTIC_APM_DISABLE_INSTRUMENTATIONS=... (Configuration options | APM Node.js Agent Reference [4.x] | Elastic)?

Another thing that would probably help me a lot in diagnosing issues if if you can get a trace-level log (set ELASTIC_APM_LOG_LEVEL=trace) and show me the logs from an invocation that hits the timeout.

1 Like

Can you post some of the logs from CloudWatch showing these timeouts? That might help confirm it is the timeout I'm guessing it might be.

So good news is I think I may have gotten a false read. What I had done was bump the extension layer and add the agent layer but accidentally disabled starting the agent. After doing so I didn't see those timeout issues. Then I turned off instrumentation and started the agent and the timeout issues were still gone, but what I didn't do is try starting the agent without turning off instrumentation.

Looking back at the logs during the time period where we were getting errors, before the layer updates, I do see some crashes relating to sending things to the APM server; maybe that alone was the cause of our problems, and maybe that has been fixed. Here's a snippet showing a succeeded Lambda execution followed by one that failed:

2022/06/23 19:30:11 Received event: {
    "eventType": "INVOKE",
    "deadlineMs": 1656012670571,
    "requestId": "7985d602-29dc-5601-8ccc-8d2ce8a2cd87",
    "invokedFunctionArn": "arn:aws:lambda:us-west-2:931316656723:function:vnext-event-processor-high-priority-lambda",
    "tracing": {
        "type": "X-Amzn-Trace-Id",
        "value": "Root=1-62b4bf3e-1aebf3d9124186c32dc64b99;Parent=404c7b1f3ea53b6c;Sampled=0"
    }
}
2022/06/23 19:30:11 Checking for agent data
2022/06/23 19:30:11 No agent data on buffer
2022/06/23 19:30:11 Received log event platform.logsSubscription
2022/06/23 19:30:12 Received log event platform.start
2022/06/23 19:30:12 Received log event platform.extension

(logs specific to our handler)

2022/06/23 19:30:19 Adding agent data to buffer to be sent to apm server
2022/06/23 19:30:19 Received agent done signal
2022/06/23 19:30:20 Received log event platform.runtimeDone
2022/06/23 19:30:20 Received runtimeDone event for this function invocation
2022/06/23 19:30:20 APM server response body: 
2022/06/23 19:30:20 APM server response status code: 202
2022/06/23 19:30:20 Checking for agent data
2022/06/23 19:30:20 No agent data on buffer
2022/06/23 19:30:20 Waiting for next event...
END RequestId: 7985d602-29dc-5601-8ccc-8d2ce8a2cd87
REPORT RequestId: 7985d602-29dc-5601-8ccc-8d2ce8a2cd87    Duration: 9561.71 ms    Billed Duration: 9562 ms    Memory Size: 128 MB    Max Memory Used: 128 MB    Init Duration: 3452.04 ms    
START RequestId: 6a988652-920b-5dcd-b676-dd7c5376b1a4 Version: $LATEST
2022/06/23 19:30:20 funcDone signal received, not processing any more agent data
panic: 
send on closed channel

goroutine 16 [running]:
main.main.func3()
    /var/lib/jenkins/workspace/ibrary_apm-aws-lambda-mbp_PR-120/src/github.com/elastic/apm-aws-lambda/apm-lambda-extension/main.go:170 +0x209
created by main.main
    /var/lib/jenkins/workspace/ibrary_apm-aws-lambda-mbp_PR-120/src/github.com/elastic/apm-aws-lambda/apm-lambda-extension/main.go:157 +0x6af
END RequestId: 6a988652-920b-5dcd-b676-dd7c5376b1a4
REPORT RequestId: 6a988652-920b-5dcd-b676-dd7c5376b1a4    Duration: 85.04 ms    Billed Duration: 86 ms    Memory Size: 128 MB    Max Memory Used: 128 MB    
"RequestId: 6a988652-920b-5dcd-b676-dd7c5376b1a4 Error: exit status 2
Extension.Crash"

I'll try re-enabling instrumentation; it might just be that the layer update was the fix, not turning that off.

I'll try re-enabling instrumentation; it might just be that the layer update was the fix, not turning that off.

Yes, things look fine after re-enabling instrumentation. Sorry for that red herring, and thanks again for answering the question about starting the agent.

Whoa, the extension should never crash. If you are able to reproduce that with the current latest extension version (current latest is v1.0.2, Releases · elastic/apm-aws-lambda · GitHub) then we'd really appreciate a bug report (Issues · elastic/apm-aws-lambda · GitHub) and a chance to track that down.

I think we're good. I had done a major upgrade from:

arn:aws:lambda:us-west-2:267093732750:layer:elastic-apm-extension-ver0-0-3-x86_64:3

...to:

arn:aws:lambda:us-west-2:267093732750:layer:elastic-apm-extension-ver-1-0-2-x86_64:1

...and so far there there doesn't seem to be any crashes. I do see in your Github issue tracker that crashes like the one I noticed had been reported and fixed before.

Sweet. Thanks for getting back.

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.