APM with AWS Lambda Java

Hi,

I try to use the apm-java-attach on AWS Lambda openjdk-8 however I am getting "Attaching the Elastic APM Java agent is not possible on this JVM" .

Looks like the APM attach is not working on the AWS Lambda OPENJDK 8 ?

Welcome to the forum and thanks for you question!
Probably this is because the Java runtime used to run your Lambda is a JRE, rather than a full JDK. In any case, it seems it does not include a tools.jar (at least not a valid AttachProvider).
Is this something you can control?

Hi , thanks for replying.

I've sorted out the issue already.

Yup, so by default, the AWS Lambda OpenJDK provided by AWS doesn't incl. the tools.jar in the classpath. I have to provide it and then set the property "co.elastic.apm.attach.bytebuddy.agent.toolsjar" with path to the tools.jar.

It all works fine now, however, I am having small issues with the reporting.

I need to flush all the message / buffers in the agent to the APM server before the Lambda Handler method returns.

Are you aware of any way in the agent code that allows me to manually flush all the transaction/span to the APM server?

Yeah, that's another issue we need to see how we tackle with regards to AWS Lambda and the like...
Due to the costs of compression and transmission, we send batched events to APM server in the background. In addition, we attach common metadata once per batch, so this is another optimization. So the optimal would be to find a way to utilize the existing communication mechanisms, but that depends on the extent of control you have over the JVM.

There is no way to flush events manually, but one thing you can try is set the api_request_size configuration option to a very low value and wait some time after ending the traced transaction before letting the Lambda Handler return. This configuration sets the maximum time we allow to accumulate events before sending them. However, note that this may incur considerable overhead, mostly on the JVM but also on the APM server.

In general, we made lots of efforts to make sure we don't do any heavy blocking actions on the request-handling thread, and this defies this principal.
Just some questions out of curiosity:

  • Do you intend to do this in a service deployed at production? What volumes are we looking at?
  • Do yo intend to do this blocking or async in any way?
  • Did you use the attach API as well within the Lambda or as an initiation thing?
  • What kind of control do you have over the JVM used by AWS to run your Lambdas (eg shared JVM resources, JVM configuration, JVM initiation capabilities)?

Let me answer your questions first:-

  • Yes. It will be used in a production environment. We are looking at potentially 1k requests per seconds load.
  • I am currently use the internal config on the Report agent (report_sync) as this is the only way to have the value reported back to APM server reliably AFAIK. I am not sure how I can optimise it to use async way because as soon as the method returns in the Lambda context, the JVM goes into suspend mode and no living threads will be running until the next request comes in.
  • I am currently using the attach() mode as I can't use the javaagent on the AWS Lambda.
  • Not much. It is very limited. I am lucky to figure out there is an option I can use on the agent to set the tools.jar path (co.elastic.apm.attach.bytebuddy.agent.toolsjar) otherwise I won't be able to use APM at all on AWS Lambda.

As I mentioned earlier, I am currently using the "report_sync" config and it seems to be doing the job reporting the value, however it will for sure impact the response time for the APIs (from the client perspective).

I will give api_request_size a go and see how well it can do. Let me know if you have a better idea.

Right- this is blocking the response.

I put the right link, but with the wrong config option, I meant api_request_time!
But note that this will help you only if you can release the response back to the client and then wait within the handler. Otherwise this is essentially blocking as well.

But are you using it each time the Lambda is invoked, or as some kind of JVM initiation?

The way the AWS Lambda works as the code will init once and stays in memory. So, I the attach() will only happen once as per JVM instance.

OK. I will use the api_request_time and gives it some times before the method ended and see how it goes.

Thanks for your great help in this matter. Very very much appreciated.

My guess is if the APM Reporting is able to allow manual flushing the message to the server , i am sure that will solve my issue. (similar to logger flush method) . I might look at the agent report source code and see if it is doable.

Thanks for the feedback! If that's a valid option (ie can be called without blocking the response) with acceptable overhead - we can certainly consider it (or something of that sort).
This may still be used for batching as well, for example- if you invoke your service synthetically once every 10 seconds with a marked request that tells the service to invoke blocking flushes.

I think we can't assume that the program is able to flush the report on its own thread in the Lambda world. Even with the 'api_request_time' approach, there is no 100% guarantee that the thread is able to stay alive to do that task.

My preferred approach is to have the reports flush (in blocking) manually if needed. We can do whatever cache/batching we need as usual, but the flush method can just flush all the span/transaction reports in memory to the APM server at will.

I certainly think the Lambda world will require some creativity and adjustments :slight_smile:

That's what I tried to suggest. I am looking for a way to use the "manual flush" for batching. If you have a way to configure scheduled executions (maybe using CloudWatch events?) - that's great.

Otherwise, you will need to cache and flush periodically on a Lambda-executing thread. For that, I can (currently) see two options, both require some logic in the Lambda code:

  1. Flush if the decided flush-interval has elapsed since the last flush
  2. Use periodic fake requests to your service that will be used for the flush, for example, a request that contains some query parameter telling it to flush and do nothing else

The first is easier to implement, but it has no guarantee about flush timings. The second requires, in addition to changes in the service code, something that creates periodic requests. Luckily, we have Uptime :slight_smile:.

And maybe- manually flushing after each execution will not be that bad even at 1k requests per second.

I will build a quick (and naive) implementation for a blocking manual flush and let's see how it works out.

yup. awaiting for your glory return (of the code change). :smiley:

Hi again :slight_smile:

Please download this API snapshot jar and this agent snapshot build and try them out.
I added a co.elastic.apm.api.ElasticApm#blockingFlush API.

Note that this is just a quick test build, not implemented in the way we eventually would want to implement that but should be good enough to get some feedback.

Will be waiting for your feedback on that.
Thanks,
Eyal.

wow... you are super quick!...

I will give it a go tomorrow and see how it goes. :slight_smile: and report it back. stay tuned

2 Likes

Hey!
Any updates? Any feedback on this would be very useful for us - basic tracing functionality, overhead, other functionality (eg how does the Metrics tab look and does it makes sense to you?).
Thanks,
Eyal.

Hey Eyal,

Thanks for keep an eye on this. I have been cop up with production issue in the past few days and haven't got a chance to use the snapshot jar. But i will definitely do it in the weekend and let you know how it goes.

@f136989c Hey!
Did you get a chance to test this?

hey @Eyal_Koren i finally got a chance to test this using the snapshot. It seems to be working fine with the blocking flush.

what's the plan now? i am happy to use the snapshot for now.

Hi again Johnny!

Great, so please do for now. The only limitations are that you won't be able to upgrade until we implement that officially and once we do, you may need to adjust your code to use the new API.

To open a GitHub issue that will include the "official" blocking flush implementation and API. However, I would like this issue to include the documentation of what's needed in order to properly trace AWS Lambda. For that, if you can spend the time to provide the following info, that would be incredible:

  • Anything special you were required to do in order to make this work- eg anything you did to get the tools.jar available to the agent
  • The exact settings you used, for example- where and how you added the attach jar and the code to make the attach.
  • Anything you noticed not working as expected? Specifically- do the metrics make sense?
  • Anything you think we need to do different for Lambda- eg hide specific metrics, report different metrics etc.
  • Anything you can say about the overhead of using the agent?

The more you elaborate on those, the more you can make this useful for future users following your steps.

Thanks a lot for your great feedback!!
Eyal.

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.