While many people are singing "It's the Most Wonderful Time of the Year...", I catch myself making a small adjustment:
"It's the Most Exciting Time of the Year..."
I live in the Netherlands and the mid November to mid December are just so exciting! From more traditional occasions like Sinterklaas (which we celebrate today) to more global ones like Black Friday, I just know that this is the moment when the systems I am working on will be challenged the most!
Honestly, I feel this is a great challenge for me too as an engineer. Our industry has switched from denying failure to embracing it. Things can go wrong and usually, something will. Keeping this mantra in mind, we do not only implement systems that work well and can handle the increased load of these days but we implement systems that can tolerate failures and bounce back.
As wonderful of a challenge as this is, there is also reality, time constraints and other trade-offs that we have to make through the year. As a result, our systems might not be able to bounce back from every failure on their own, they might need a bit of help, our help.
That is what this post is all about, we would like to bring to your attention our troubleshooting guides. Let's see how they could be useful.
Once upon a time without troubleshooting guides, at a place not that far away there was an engineer enjoying their evening when their phone starts buzzing. There are indices missing data, the elasticsearch API /_cluster/health
reports red
, that's pretty serious. They check the servers and they see that the Grinch has pulled off the power cables of half the servers and has locked the doors so these servers are as good as gone.
But fear not the engineer knows they can restore the data from a snapshot, the only problem is they haven't done this before during an incident. With or without previous experience, the problem needs to be address, so the engineer begins:
Analysis
First comes the analysis, which indices are missing data and how can we fix that.
- Go to the cat indices API guide to see how can they find all the indices missing data. There are two indices missing data, a simple an index and the write backing index of a data stream. This complicates things a bit.
- After executing and collecting the names of the indices, they go to the restore a snapshot guide. There is a lot of information there covering many cases, the engineer has to go through all the cases listed to find which one fits their use case the most, but it's not easy. They have indices and data streams, it's not clear what the best way forward is, they decide to follow what feels like the easiest option, delete & restore.
Action
Then comes the execution, actually deleting and restoring. The engineer followed the steps in the guide and now they wait..... This operation takes a while because the data stream was containing a lot of data, but at least soon everything will be fine.
Issue resolution
And soon enough everything is.
But what if there was a troubleshooting guide, was this the best approach? Almost.... The engineer did the best they could considering the circumstances, but there was a quicker way to resolve the issue. Steps 7 & 8 in the troubleshooting guide "Restore from snapshot" offer an alternative route that has the following benefits:
- No extra data is deleted, which reduces the restore time.
- The data stream can ingest data while the snapshot is being restored, which probably will allow for the smoother operation of the ecosystem.
The troubleshooting guides are specialized and to the best of our efforts they are concise and complete. Our hope is that they will guide you from the analysis to the resolution of the related issues without assuming prior knowledge or having to look somewhere else for information. You will find all the guides about many common issue such as shard allocation issues, disk space issues and many others.
Have a great season!