This post is also available in espagnol.
This post is also available in romanian.
If you ever endeavoured in the SRE town, there is a very good chance that you already have met some of my favourite characters: the SLO, SLA and SLIs. They are very similar yet they wear very different hats. With only the last letter to change who could not mix them up?
My Christmas bonus to you in this article is to guide you through this SRE never-ending adventure theme and clarify a bit who’s who. Taking them in the order of appearance, SLOs or Service-Level Objectives tell us what’s our target for system availability. If you are an SRE probably you don't need to define these by yourself, you would also need other stakeholders in your company helping you selecting what’s more advisable and particular to your organisation. Bottom line is you have to have them and they have to be chosen in relation to the SLAs your company committed to. The SLA or Service-Level Agreement promises that if your service is below a certain level of availability to which you consented then some kind of penalty will be paid. Basically the client gets his money back if the SLAs drop below a certain threshold.
Last but not least, SLIs or Service-Level Indicators come in handy as they measure the successful probes of your system. This happens by observing your service’s behaviour and indicating if your system has been running within SLO for a given interval of time.
Now ain’t this a fancy explanation? Let’s dive a bit into something a bit more concrete.
SLA is the agreement your company makes with your clients.
SLOs are the goals your team must hit to meet the SLAs.
SLIs represent the real numbers on your system’s performance.
If you have bad SLOs right out of the box then your SREs practices, customer experience and DevOps framework basically will shatter. You need well-defined SLOs to honor the SLA your company committed to and once you do have the SLOs defined you will also need well-defined SLIs to measure your system’s performance.
Enough theory, let’s see how a bad SLO versus a good SLO looks and behaves like.
BAD SLO’s Objective:
Ensure the application’s uptime is satisfactory.
BAD SLO’s Metric:
Availability
BAD SLO’s Threshold:
Try to keep the system up as much as possible.
BAD SLO’s Observation Window:
Continuous monitoring with no specified timeframe.
BAD SLO’s Rationale:
We want the system to be available most of the time.
Now turning to a good SLO.
GOOD SLO’s Objective:
Ensure API Response Time meets user expectations.
GOOD SLO’s Metric:
Average response time of API endpoints.
GOOD SLO’s Threshold:
Maintain an average response time of 200 milliseconds for 95% of API requests.
GOOD SLO’s Observation Window:
Continuous monitoring over a rolling 7-day period.
The BAD SLO is bad because: it’s vague and subjective, it lacks quantifiable metrics, it has an undefined threshold and has no observation window.
What makes an SLO a good one: it’s specific and measurable, it’s user-centric, it’s quantifiable and achievable and it’s timeframe defined.
You distinguish a well-defined SLO because it focuses on a crucial aspect of service quality, provides clarity, measurability and alignment with user expectations, which are essential elements for effective monitoring and evaluation of service reliability.
You are probably wondering where Elastic stands on good and bad SLO. How are we doing on that? With 8.11 upgrade of the Elasticsearch we are promoting SLOs right out of the UI, organised as a dashboard list, which gives the user a quick summary of what’s happening in each SLO.
The user gets an SLI history in the detailed view of an SLO, error budget burn down charts and current active alerts. Also SLOs are easily configurable via Kibana, under Stack Management.
Hope I made you curious enough to have a go at creating your own SLOs and SLIs in Elasticsearch and fully honour your SLAs.
I hope you have a very enjoyable Christmas!