Dec 13th, 2019: [EN][APM] How to diagnose web app flu

'Tis the season for festivities, overindulgence and occasional illness. We could suggest season-appropriate decoration (at least if your app is hosted in the Northern hemisphere). However, today we will focus on apps which are feeling unwell and have become sluggish - and how to use APM to diagnose them.

We've all used text logs to diagnose an app. CPU, RAM, disk, and network-use metrics are also not difficult to obtain. An ideal expansion of these tools would be the ability to X-Ray an app, seeing its components and potential bottlenecks all in one place. That's precisely what an APM system does, but despite over a decade of development in this space, a surprising number of app maintainers don't use one. We think the free APM we make at Elastic is pretty good and a breeze to use, so let's take a look at an ill Python app:

As you can see, this is an app which allows users to crowdsource information about shops serving hot chocolate around the world. But, it's having problems. For a while, admins have been complaining that the app feels sluggish, and now similar reports are to coming in on the app's community forums.

Let's add Elastic APM instrumentation to the app. It only takes a few lines:

Restart the app on the servers and we're off! Even without time for more advanced testing practices, we can simply test in production. Soon, a particularly annoyed and detail-oriented user files a report:

We received an email from a user complaining of poor performance on the site. They tried several times but the site seems unresponsive and laggy. Details below:

Username: janedoe

Browser: IE (no version provided)

Time of Usage: Between roughly 02:05AM and 02:17AM GMT

Alright, let's open the APM system:

There are some important numbers here: the average response time, of course. But notice how it's not showing much of an anomaly. That's why the 95th and 99th percentile metrics are important - they refer to our data collection attempts to understand what the experience will be like for 95% or 99% of our users respectively. A large number of fast but irrelevant transactions may make the average response look fast to us, but our users are still annoyed.

Then, looking further down, we can see that activities within the app ("transactions") are ordered by impact - the impact of each transaction on server resources and the app's general ability to respond.

It seems at first glance that listing hot chocolate establishments is having the most impact. However, the transactions are not especially slow - looks like normal app operation.

Let's zoom in on the timeframe from that user report.

Now if we look at the transactions ordered by impact, we see that updating the record of a hot chocolate store is having the most impact. 4.2 seconds on average and 13 seconds for 95% of users is pretty bad. Let's take a closer look by clicking the update_store transaction.

This is a pretty typical page for an APM system. A helpful feature here is Transactions duration distribution - the ability to see how many of this kind of transaction there are and grouping them by performance. Normally this route is fast, but occasionally it is really slow. Scrolling down further, we see what the transaction trace of a typical, fast transaction looks like for this route.

That colourful display is at the heart of what makes an APM system so uniquely useful. We can see how long the app spends in different parts of the code, what database queries are made and how long they take. We can even track the request as it's passed between services over HTTP - the different colours on the bar chart represent different services, a feature called "Distributed Tracing". In this case the request is tracked from the front-end (client Javascript can be monitored) through the back-end all the way to the data storage layer.

Now that we know what a typical transaction looks like, let's look at some problematic ones by selecting the bucket of transactions that took 12-13 seconds.

Interesting, looks like some kind of error is happening during zip code validation. If we click on the deepest part of the stack where the error first occurs, we'll see a huge trove of information automatically captured by the APM system for this request.

http.response.headers.errors is particularly interesting - the APM has captured the transaction's inputs. Looks like an overly long invalid zip code is creating this specific problem. Having transactions run for 12 seconds will soon use up all available workers in a typical Python web app deployment. It'll make all transactions look slower ... but using an APM we've tracked down the original culrprit.

If we look at the code, we can see the issue - a regex (of course!). We can reproduce this in the app itself or by running the piece of code flagged by Elastic APM in a python interpreter.

We hope you've enjoyed this example of diagnosing a sluggish app! You can read more about the free Elastic APM here. If you're itching to deploy it to your app, start on the install instructions here. So far we focussed on the app and its problem, but if you're curious how Elastic APM collects its data, this graph shows the setup clearly: