There has been a few new hiccups with firefighting coverage lately - so far nothing too bad in terms of consequences (no client server was harmed in the process!), but there have been instances where the pager escalated a lot, sometimes without getting picked up. We’re having some gaps during weekends and on some timezones, or during vacation of people in the escalation queue.
To make sure this doesn’t end up translating into a client outage left unattended, it’s worth tackling the issue properly. We have started discussing it with @tikr and @kaustav on MNG-3213 and came up with a few ideas, but before drafting a handbook PR, it’s worth discussing it as a larger team here.
Tim came up with a nice idea that could help a lot with coverage, with very little change to the current approach:
Perhaps we need to redesign Bebop’s FF rotations so that there is at least one person from each of these timezones on FF duty at all times? For the European and South American timezones, that would mean needing to be on FF duty more frequently than team members located in Indian timezones. But if I’m not mistaken there should be a way to set this up so that nobody would have to be on FF duty for consecutive sprints
Sprint India/Pakistan Europe South America A Team member 1 Team member 1 Team member 1 B Team member 2 Team member 2 Team member 2 C Team member 3 Team member 3 Team member 1 D Team member 4 Team member 1 Team member 2 E Team member 5 Team member 2 Team member 1 F Team member 6 Team member 3 Team member 2 = 1 sprint each = 2 sprints each = 3 sprints each
- Indian timezones: 6 people
- European timezones: 3 people
- South American timezones: 2 people
That should help expand the coverage around the clock, at least during the week.
While we increase coverage, we might also want to consider reducing the overall volume of firefighter hours - @kaustav mentioned that they are currently a bit high for the amount of firefighting Bebop has to do. So rotations might come in more often, but with less work and number of alert to handle per firefighter each time.
As a gentle reminder, when on firefighting duty and planning for vacation:
- Ensure to have a backup who will step in while away. Even if the timezone is already covered, there is safety in numbers and it helps ensuring that an alert will be more likely to be picked on early, or that the firefighter will have availability at that time.
- Make sure backups are clearly indicated on the rotations calendar, so we all know quickly who to contact during alert escalations.
- To also help with coverage, try to get a backup from the timezone with the least coverage during that sprint – though again if that’s not possible, redundancy will still be helpful.
There are already reminders about firefighting backups in the vacation checklists, so the plan is only to add some details to the item from the sprint manager role, to ensure consistency - this will work best if everyone keeps these points in mind when looking for rotation backups!
To help with keeping a closer eye on firefighting, and better manage alerts that come up outside of the hours covered by firefighters or escalate, we need to consider implementing the firefighting manager role we had discussed earlier. See the discovery document.
Note that this is mainly a coordination role, similar to what @braden and I have been doing - ensuring that firefighting is handled properly, and dealing with escalations that reach us by finding someone to handle the alert rather than actually firefighting it. The only exception would be when nobody could be found to handle a truly critical alert, but in practice that almost never happened (proof being that I have been doing that role, and I haven’t had to go fix one in years - I wouldn’t even know how anymore!).
Is there anyone who would be willing to handle this? Ideally we would have two firefighting managers, on timezones far apart, as this allow to split the hours during which to manage escalations, to keep them mostly during one’s day - again a bit like what @braden and I have been doing. Also note that @braden and I would still be available as backups, to increase reliability and coverage further.
A suggestion in our discussion thread was to try to assign these roles to @falcon . This would keep things a bit more fair for Bebop and Serenity who already handle alerts, while further increasing coverage and redundancy, and still taking into account that Falcon members generally don’t have expertise in projects outside of the cell. It would take advantage of the fact that the role is more about coordination than the actual handling of alerts.
There might be other ways or better ideas though - don’t hesitate to suggest alternatives or additional approaches. Comments welcome!