100% sampling vs. 10% sampling

APM Java Agent language All versions

Has any else tress tested the APM Java agent at extreme loads and measured the CPU and Memory resource hit to the service? We have and 100% sampling is possible with a small footprint if you follow the methodology below.

Results for Test #1

  • 100% sampling including un-redacting and sanitizing the post body for PWD, UID, SSN, etc. (event is sanitized before leaving the container).
  • Running 3 times production load, the agent overhead is less than 2% with efficient buffer of 1000 and no loss of events.

Results for Test #2

  • 10% sampling including un-redacting and sanitizing the post body for PWD, UID, SSN, etc. (event is sanitized before leaving the container).
  • Running 3 times production load, the agent overhead is less than 5% to 8% with efficient buffer of 1000 and no loss of events.

In Summary, the 100% sampling is more efficient and you can do custom aggregations and transaction separations in the APM Pipeline using the fields extracted from the Post b0idy. Once the field are extracted from the Post body, drop the Post body as the last step in the APM Pipeline and do NOT store the 100butes on the transaction events in production. You can leave in on in preprod so you can validate the fields selected for sanitization are performed in the container and not passed to the APM Pipeline. Extracting the fields in the APM pipeline and naming them with the labels.myco_action will extend the Elastic Schema to easily add the Business perspective to a Technology and framework-based monitoring tool. One Agent config for all service in the ecosystem since enrichments and aggregations are performed in the APM Pipeline. Much easier to Audit and Control by one team and for the Enterprise ecosystem. Also, the summarizations and aggregations, IE: sampling or aggregation is now controlled by you in the Pipeline. This is done against the Post soapAction or the Post action for the HTTP interface and is common across the enterprise. Create a NVP, SOAP, XML, etc. section in your pipeline and call it based on checking the serviceName on the event. You get the idea, the interface can be XML, Name Value, Pair, etc. and you can extract the fields you need for identification. you can even copy the Transaction field to the labels.myco_action field so one dashboard works for all of your services. We have about 8 standard dashboards for all debugging and Analytics and they are common to the services that use our methodology. The messaging interface can be done the same.
The alternative on splitting transactions and aggregating the metrics in the agent is very expensive for any agent. Streaming raw events to the Pipeline adds a little load to the APM Pipeline, but by far, Pipeline processing is way more efficient than the heavy agent regex and string manipulation done by all sample-based systems. The compressed gzip format to send the event buffer to the Pipeline is efficient for sending 100% of the raw events. Once you create the summary data or aggregated data in your summary processors, you can drop the original event to save storage. You can aggregate events based on certain business transactions or the labels.myco_action field I mentioned earlier. Create multiple aggregation processors and call the one you want for the specific transaction of the specific service or let 100% go into elastic for some while aggregating others. I do not recommend dropping without an aggregation in case that is the transaction causing a memory leak or JVM issue. Summary on ingestion is powerful so you can create Monitoring dashboards against the Summary data while using Kibana drilldowns to link to a debug dashboard using the Raw events. This technique will optimize the queries against the Elastic cluster. Monitoring dashboards use event fields containing the aggregated counts, response times, 90 percentile and it runs against the summary index that drops event fields to a minimum, No host name, user agent, GEO location, IP, error counts, etc. exist on summary events. fir example, only keep the filed you will use in the monitoring dashboard like the summary event for each labels.myco_action transaction and use the drilldown feature to drill down from the error event count to the raw events containing the error.

After done, you can query Elastic for "labels.myco_*" and see all the schema extensions across the ecosystem if you use federated clusters. Using this approach, you can create a Data Dictionary for the custom field you use throughout the enterprise and enrich the events with an enterprise field name to quickly search by the Enterprise alias. I recommend keeping the original name on the event and creating an enterprise alias on the event, so the application owner does not lose the names they use to deal with vendors, their developers, support, or anyone. If you follow this approach, you will have clean correlated events and your Data Scientists and Analytics teams will get way more efficient than using log files. Elastic Log correlation with only ingesting the fields you need is awesome! The last thing I need is access to the response body or the transaction event and the request and response object of selected span events and then I ask, why do I need log file analytics and monitoring? Keep log files for regulatory and audit purposes only or have double coverage in the event of a dropped event in either system. Build a full-fledged BTPM, Business Transaction Performance Monitoring, solution instead of a standard aggregated APM, Application Performance Monitoring, tool. Help me push 00% Distributed Tracing to the max using ELASTIC where I own the schema and not a vendor. We can share kibana dashboards across the industry for debugging, monitoring, business transaction analytics, backend analytics, etc. We store 100% of log message and it is a pain to correlate log messages without trace ID's. (We are injecting Trace.id's on log events where possible but it can screw up jason format) We keep messages for dependency calls in logs, why not used the same capture patterns for span events that we use for logging span calls? Showing the Developers Transaction Analytics or Backend Analytics on a Kibana dashboard is powerful. A feature change to compare fix windows but at different time from different clusters will allow us to compare a 30-minute stress test to a peak day or an error event in production to see within minutes what the difference in production load compared to Test. We can then generate a coverage report or check it live for various peak windows or peak days like Friday, the last day of the month, the first days of the month, or just compare Stress test windows to a 30-minute window in real time.

I am open for discussion on this approach, a handful of people to onboard all your applications across the enterprise. Make sure you test ever agent version so not to drop a single event. We need resiliency build in to make sure we do not drop events even if it is sending zipped buffers to disk and later ingest when the system recovers.


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.