New firefighting manager role

Hey lovely team,

After some discussion with Xavier and also the “Further delegation of my work & scheduling requests” topic, we started to work on a new role that emerged from a routine that Xavier is doing for a long time.

This new role, named “Firefighting manager” (not a definitive name) is being defined in this discovery.

The first step is to read this discovery and discuss it so that everybody has a comprehensive idea about what it is.
After that, I will raise a pull request against our handbook to add it into our DNA. :slight_smile:

Ticket: FAL-1768

3 Likes

This seems like a solid idea to me. I take it that it would also mean that the automatic escalations would end up in the hands of this person instead of Xavier?

1 Like

@toxinu I left a few comments in the discovery document. I think that we can automate some of the scopes of the new role.

1 Like

It should not be Xavier anymore, for sure. I am just not sure about how to address this kind of responsibility compare to every other OpenCraft member.

I mean logging 25 minutes or work in the middle of the night after being woke up by an alert should not be the same thing as regular time logging. I am not sure about that, maybe that’s why it has always been Xavier that had this role. :sweat_smile:

1 Like

Agreed.

Similarly, having to “be on call” during normally non-working hours raises some personal questions, like “Can I have another drink?”, or “Can I be offline at all?” We’re contractors, so this isn’t straightforward, see e.g. Do You Get Paid for Being on Call?

If anyone on the team is going to take over non-working-hours on-call duty, then there should be some compensation for the imposition, regardless of whether any calls are actually taken. I don’t know what the laws are in various people’s locations, but this bears discussion.

4 Likes

I think we should very narrowly define what this role is to do. I don’t think we should expect this person to be sober all of the time, as we wouldn’t expect them to actually take care of the escalation. They just need to try to find someone to take point on it. We’re international-- someone’s always online. If someone messaged something like:

@fox olease take ffast lok at tshing

I’d look up, see the error flashing, let them know I understood what they were asking me to do and then when they sobered up in the morning they’d be expected to figure out why they had to be roused at the pub.

I’d also be fine with there being a ‘minimum billable’ for this kind of response since it requires interruption of one’s off-time.

:+1: Yes, this would help.

Can OpsGenie do this for us? It’s job is to escalate issues. If we all have our working hours in OpsGenie the the firefighting manager could just arrange the escalation order to put the SFs at the top, then couldn’t we configure OpsGenie to escalate sensibly through the available people?

2 Likes

@toxinu’s discovery currently says:

“This new role is not about being a new firefighter, the person assigned does not have to be on-call like other firefighters.”

+1 to this idea. But then who’s at the top of the escalation? I like @jill’s idea:

In other words, the FF manager should make sure that at every sprint all hours are covered by somebody. We’ll soon have 4 cells - that means 8 FFs at any given time. It should be doable.

But that still leaves the question:

Currently, an FF is expected to, roughly speaking, be on-call during their work hours. It is set arbitrarily by the firefighter at the beginning of the sprint to whatever they feel comfortable with. I think it should be the FF manager’s job to keep a closer eye on this every sprint, so that:

  • All FFs have actually set their on call times
  • After this is done, all hours are covered
  • The top escalation is a firefighter

But what about gaps and weekends? (The handbook even explicitly discourages weekend work.) If I were discussing this with a generic consulting customer, I’d simply provide an SLA and charge monthly. Maybe that could be adapted to the OpenCraft Way: people that volunteer for firefighting beyond their normal availability get paid proportionately. :shrug:

I agree that SFs should be at the top of the escalation order but we can’t ask every SFs (6 people right now) to be on-call, even during the weekend, and also based on an arbitrary rotation.

Today if an alert is paging Xavier or Braden, you usually don’t fix the problem right after the alert. I think most of the time you are paging SFs again, notifying clients that we are working on the problem, and sometimes start the investigation.

Why not inside of the SFs rotation, having another rotation of at least 2 of the 6 SFs to be at top of the escalation order and to be on-call even during the weekend. Based on that, these “special SFs” should have special compensation for being on-call outside of their work hours. This means that this role is still rotating but we can still accommodate our personal needs (as we are sometimes already doing for the FF role).

For example:

- FAL : Geoffrey
- FAL : Samuel     <-- Special FF
- BE  : Giovanni
- BE  : Josh
- SE  : Adolfo     <-- Special FF
- SE  : Daniel

I don’t really know about the legal stuff and the fact that we may update our recruitment requirements to notice people that they may need to be on-call even during the weekend, or should it be based on voluntary?

Since we are not building cells based on timezone and that people are free to move we can’t really confirm that this is doable. And what “the FF manager should make sure” means? If it is not the case, what the FF manager should do?

+1

I don’t think we can and really want to do that. I am happy today, even as a FF to be able to sometimes take an alert lighter than another in order to finish what I am doing. And also, the FF manager can’t really check that.

1 Like

I agree it would be tough to cover all hours by the regular FF rotation. So +1 to the “Special FF” idea, as long as:

  1. It is voluntary, and not a duty. In other words, not rotated. It should be treated as a task like any other. The FF manager is the epic owner that needs to find an assignee for every sprint.
  2. Firefighters that are on regular call are the top of the escalation. Special FFs are only for covering gaps. Otherwise it’ll be too easy to just rely on the special FFs for everything.
2 Likes

Sorry that it took me some time to look into this. I’ve reviewed the discovery document and it looks good to me. :+1:

I think the only outstanding point is the issue of the top of the escalation chain, right? As correctly noted, in 99% of the cases this isn’t about handling the escalation itself technically, but more figuring out who can take care of it. Ie checking if it’s an alert which needs immediate investigation (if it’s already closed, the investigation can usually be delayed until the next time a firefighter becomes available), if it’s the case pinging the firefighters and/or other people online, and following-up until it’s solved (for example when the investigation is passed between people due to availability/timezones). It’s extremely rare to actually have to do an actual investigation – I think in the past couple of years I only had to do this once or twice. It’s also fine to just live normally during that time (including having fun or drinking) - the main thing is to keep a phone charged with OpsGenie, and be willing to jump on the chat if it rings.

Since it’s more of a management role than actual technical firefighting, and with responsibilities that match the firefighting manager role, imho it should be part of that role. Firefighters are already included in the rotation, we would likely not gain much by rotating this role - we would likely end up with firefighters on rotation who forget about it, or don’t know how to handle situations they would encounter very rarely, with very little experience gain over time. Generally, the firefighting manager would better know what to do and who to contact by having continuity on the responsibility. And part of the role during “normal work hours” would be to minimize the chances of an uncaught escalation (for example, by ensuring that the firefighters are on the escalation path, that alerts are acknowledged by them during normal work hours, etc.).

What can help making the few escalations that do get through manageable though is to split the 24h shift in two, like we do with @braden – there could be a firefighting manager backup, on a different timezone (the furthest away, the better!), and each would take 12h, thus ensuring that most of the alerts would be during the day, rather than at night.

For the compensation, I agree that it would make sense to compensate for the fact that, unlike the firefighters who set their own work hours (and thus escalation hours) freely, the firefighting manager and their backup wouldn’t choose when to deal with the alerts, aside from agreeing on their own 12h window. To keep things simple, maybe the firefighting manager and their backup could be logging their time x1.5 or x2 when they have to deal with an escalation reaching their level at the top of the queue?

5 Likes

:+1: to this :slight_smile:

This matches the kind of compensation for overtime offered in Australia, so works for me too :+1:

Thanks for taking these issues into consideration @antoviaque !

2 Likes

Thanks a lot for your replies @antoviaque and @jill . :slight_smile:

Seems fair to me. Can we decide on that value?

We also still need to redefine the escalation policy on OpsGenie.

  • 5m: Firefighters and volunteers
  • 10m: Management (Xavier & Braden)
  • 20m: Management - backup (Xavier & Braden)
  • 30m: Ops review - backup (Tim)

What about?

  • 5m: Firefighters and volunteers
  • 10m: Management (Firefighter manager)
  • 20m: Management - backup (Firefighter manager & Firefighter manager backup)
  • 30m: Ops review - backup (@tikr do you still want to be here? Maybe @antoviaque and @braden, you want to be here too?

Once we decided on these two last questions, I will start the handbook pull request and then ask if anybody in the team wants to take this new role and the backup one. Sounds good?

1 Like

Yes, you can leave me on the last level of escalation in case the first levels are unavailable.

@toxinu Yup, you can also leave me at the last level. :+1: And for the time logging for the firefighting manager, let’s put x2.

@toxinu You can leave me at the last level of escalation as well for now. Thanks for checking :slight_smile:

Once we’ve tested this new process a bit and are happy with how it works we can consider updating the ops reviewer role and remove any overlap with the firefighting manager role (including being a part of the escalation levels that we have defined in OpsGenie).