Start with On-Call

This is my second post on running a standup, you can read my first post about walking the board here.

Being on-call is exhausting

Being on-call can be an isolating, exhausting and thankless task. I’ve had on-call shifts where I was woken up multiple times throughout the night and by morning I am wiped out and cranky. Worse still, the next night is likely to be more of the same. In these situations the last thing I want to do is spend standup talking about how much progress the team has made on the current projects. I want to talk about why I’ve been getting paged, and how the team can help make this waking nightmare stop.

Start the day discussing the previous on-call shift

Standup should already be one of the first meetings of the day. It helps to get everyone aligned on the ongoing tasks and brings up to speed anyone who may have been on vacation. But before discussing any projects you should start off asking the question:

Did the person on-call get paged, and if so, why?

It is criticially important this question is asked first. If the on-call was bad enough, it may require halting all project work and diverting everyone to resolving an ongoing issue. Or maybe the previous shift was so bad that after giving a hand off, you want to excuse the person on-call so they can log out and catch up on sleep. Either it’s the most important question you can ask for the health of the team so it should be your first question.

A common push back is the concern that reviewing the pages may end up taking the full time of your standup. If this is happening it is either because you are allowing runaway conversations (that would be better left for post-standup discussion), or your on-call is in such dire straits that you should probably be canceling all project work and focusing on making it better.

Have an answer for every page

Every pagable event needs to fall into one of the following categories:

This is a known issue for which the team is already working on a solution

These pages can be mostly ignored as the team is already working to resolve the underlying issue. Maybe the services are under provisioned and the team is working on scaling things up. Or maybe this is a known bug that the team is working on a fix for. Either way the expectation is that over the next couple of days the underlying issue will be resolved. It would be ideal if these pages could be muted but that is not always possible as it may hide other issues. Just be honest with yourself and resist the urge to group pages together which are not actually related.

This is a once-in-a-lifetime-event and can be ignored

These should be pretty rare and what the team defines as lifetime should increase in duration over time. Initially lifetime can mean “once a month” but this should eventually stretch out to “once a quarter” and then “once a year”. If you are getting paged weekly for anything, that is not a once-in-a-lifetime event.

This is a new issue which needs to be investigated

This is really everything else. Sometimes these are once-in-a-lifetime issues that suddenly start happening more often. But regardless, all new issues should have an investigation issue created.

Keep small problems small

The investigation of all new issues should take top priority for the on-call and possibly secondary person. Failing to properly investigate an issue not only means that the on-call person is likely to get woken up from it again, but it can also be a missed opportunity to deal with an issue before it becomes a major incident. As the backpacking mantra goes, “keep small problems small.”

Offer the person on-call a break

Getting woken up multiple nights in a row can be highly disruptive and ultimately lead to health issues. If the on-call person was woken up between the hours of 9pm - 9am, I strongly recommend the primary and secondary rotations switch for the following 9pm - 9am shift. Just knowning that the following night they will be able to get a full night sleep can make the world of difference to the person getting woken up every hour.

It also offers a motivation for the secondary to help out as they are going to be on the hook for the next night.

Closing Comments

At many organizations the manager for a team is not in the primary on-call rotation and this level of separation can already cause some resentment in the team. Especially with a high pager volume the team may feel like the manager is not aware or doesn’t care about the struggles they are going through. Starting the day by focusing on the on-call rotation helps to assure the team that even after a bad night their team is going to be there for them in the morning.

Joshua Gerth
Joshua Gerth
Engineering Manager
Distributed Systems Engineer
Systems Architect

My research interests include big data, language parsing and ray tracing.