Being on-call 24/7 for 18 months straight was one of the most anxiety-inducing responsibilities that I've ever had. I had to make sure that I always had my phone within reach, my laptop accessible, and my mind ready to fix any critical system failure. I sometimes jolted up in the middle of the night, believing there was a page on my phone when there wasn't. It wasn't until months later that I realized I had a problem, when my partner accused me of being irritable lately and it dawned on me that she was right.
If there's one piece of advice that I would give to fellow software developers hesitant about performing on-calls, it is this; don't do it! Don't take a job that requires it if you're feeling anxious about on-calls. We're blessed to be in a hot IT labour market now, so I'm sure that you have plenty of options available. However, if you're proceeding with being on-call (there are some good reasons to) and want to minimize the drain, continue reading.
What is it?
I define on-call as being on pager duty for a period of time, expected to respond to urgent system issues. Typically, developers would be on rotation for 7 consecutive days of on-call duty out of every 4-6 weeks. On-call is a common practice with software teams that practice DevOps such that "you build it, you run it" is the expectation. In such teams, on-call duty is part of the development process, as opposed to a traditional separation of development team versus operations team.
The weird thing about my anxiety is that I find myself less anxious when there are more alerts happening regularly. One of the worst times was when I was on a family vacation for the first time in a long while, and there hadn't been any alert for weeks prior.
Anxiety is distress or uneasiness of mind caused by a fear of danger or misfortune. Wikipedia
According to the American Psychological Association, anxiety differs from stress in that anxiety is defined by a persistent worry that won't go away, whereas stress is typically caused by an external trigger. I am anxious about the prospect of our system going down. But once it's down, I get stressed about debugging the failed system while our customers are pounding on our support team. In fact, I gave a talk about the mental stress of debugging production applications.
Motiva A.I. orchestrates digital marketing campaigns for our enterprise clients. Imagine a scenario where if you sign up for a newsletter at Verizon for Business (one of our customers), you'd expect to receive an email from them. Well, not if Motiva is down. Establishing an on-call practice ensures that we can immediately respond to any service interruptions.
"I can't wait to spend my weekend waiting for a system outage!", said nobody.
Keep in mind that on-call is a last line of defense to keep things running. If high-availability is a business requirement, then that consideration should be an integral part of the product development process right from the start.
For the individual software developer, on-call can be a helpful practice to hone your debugging and software engineering skills. Bugs happen when an unexpected event pushes your known system into an unknown state. Debugging in such an environment can be a helpful process to push your understanding of your systems and your tools. Moreover, I personally adhere to a pain-driven development mentality. Developers that feel the pain caused by the system they built will build better systems.
Dealing with on-call anxiety
As a developer
Seek professional help
If you're feeling anxious about your on-call duty, my first advice is to seek a workplace therapist. Second to that, I recommend working through this Anxiety & Worry Workbook by Clark and Beck.
What ultimately helped me mentally is accepting the fact that the worst that can happen if I miss an alarm or two isn't so bad afterall. We might lose a customer or two, which is pretty bad for an enterprise startup with only a handful of customers. But what's worse is getting burnt out and not being able to function anymore. Growing a startup is a marathon, and I shouldn't lose the race over any single bump. Having said that, I understand that the ability to step back is easier said than done. That's why I recommend seeking a therapist or working through that anxiety workbook to ground yourself first. It took me months of actively working on easing my mind before finding my peace.
While I worked on my mental health, here are some small adjustments that I made to improve my well-being.
- changed my pager tone to something more relaxing
- took it easy when I got paged, resolved it when I could and did not fret about jumping onto my laptop immediately
- blocked out certain hours in our on-call scheduling system during expected slower weeknights so that I could rest easy. Getting woken up in the middle of the night was my biggest anxiety. This gave me peace of mind.
- this wasn't an option for me, but I wish I could have opted out of on-call duties for a couple months when I was going crazy
In terms of operation, it would suck to keep getting paged for the same issue. Make sure that any incident is followed by a retrospective with actions to mitigate the problem in the future. We write up a one-page incident report for each incident, highlighting who it affected, why it happened, and suggestions for future prevention with actual tickets scheduled to be worked on. This give visibility to the product team around delivery expectation.
A second operational tip I find useful is that during an incident, don't try to do too much. Focus on getting the system back up, or at least the mission critical parts, even if they will be in a limping state. Leave the fixing for the team during work hours. For example, a couple times I would just pause our affected workers before spending any more time debugging. It's easier to explain to our customers that the system stopped working than to explain that it did the wrong thing. This is a debugging tip but it helped with my anxiety because I lowered the bar for what needs to be done substantially, i.e. from fixing to pausing the system gracefully.
As a manager
Be wary of your bus factor for mission critical components
As a serial technical co-founder, I often find myself taking on-call responsibility for months at a time, simply because there's no one else available. What is different this time is that we had been around for 4 years already when this long on-call stretch started unexpectedly. It happened because our only other senior backend developer left us in mid-2020.
People come and go. It was my fault for letting myself be stuck in this situation. We have other developers on the team, but they've been working on more interesting and valuable parts of our product. I've let myself be left behind as the only one responsible for our legacy, but still mission critical, components.
the more mission critical a component is, the more eyes should be on it
We categorize our application workflows as i) mission critical, ii) essential, or iii) everything else. Only a couple mission critical workflows on our system are allowed to raise on-call alarms outside of work hours. This reduces support fatigue and focuses our attention on what matters.
Motiva has been a remote-first company from the start. One thing we did well early on is to hire across time zones for our engineering team. My previous senior engineer was on the other side of the world from me (not deliberately to such an extreme; suitable partners are just hard to find). That worked wonders for our on-call and support needs. However, that obviously comes with its challenges as we had to rely heavily on asynchronous collaboration methods for actual development work.
Cover each other
I left myself behind in my case. But having lived through this trauma, I wouldn't put this on anyone else on my team. In addition to sharing knowledge across critical components, nobody should be on-call alone. I make sure that we have a clear escalation procedure in place so that whoever is on-call feels supported.
In reality for a startup, there's only so many people that can play musical chair in our team of ten. That's why we built semi-automated diagnostic tools for non-developers on our team to help out. The more issues that our customer support team can resolve, the fewer tickets will need to be escalated to our devs.
On the other end of the escalation ladder, I make sure that our team knows that I'm always available to whoever is on-call. That's the unfortunate responsibility of being the tech lead at a small startup. Luckily, my team only had to call me once outside of my on-call schedule this year.
Set clear expectations
Just like any team process, we make sure that a new engineering team member taking on-call duty knows that they are not expected to fix the world, knows that it's ok to take it easy, and knows that they have our support. If anything, err on the side of escalating early for their first few incidents so that we can pair on the incident with them.
Early on in this post, I mentioned that I wouldn't recommend taking on a job requiring on-call. If you don't think that on-call is for you, then don't do it. However, this is an unavoidable responsiblity if you want to work in a small startup. Something mission critical will eventually fail, and somebody needs to fix it. For such times, I'd rather that we have a clear on-call schedule with clear expectations than scramble to fight the fire.
Build robust and maintainable systems, but we don't have time!
The lasting remedy for my on-call anxiety was building confidence in our systems so that I know we'll be fine if I'm unavailable. We dug ourselves out of that technical debt. For two quarters in 2021, we directed all engineering resources to rebuilding the most vulnerable parts of our system as high-availability workflows.
Developers that feel the pain caused by the system they built will build better systems.
Hold on a second! Did I mention that I got stuck with this endless on-call stint because one of our two backend developers left? Between scaling our system to handle the 400% increase in system load this year, hiring for a replacement, building new features, reviewing code, mentoring, ... how did we find time to address this rotting technical debt?
Engineering management in a startup setting is still something that I'm learning even after 10 years. In this case, I couldn't take it anymore and had to convince the rest of the company that we had to stop everything to focus on addressing some technical debt known for years. The fact that it got to the point where my health was at stake until I prioritized is a failure on my part. Having said that, I learned a lot over these past few years in terms of maintaining an enterprise system and engineering management. I've been heads-down building Motiva for the past 6 years. If you find this post useful, take a look at my blog. I plan to write regularly over 2022 to capture some of my learnings.