Sorry that it took me some time to look into this. I’ve reviewed the discovery document and it looks good to me.
I think the only outstanding point is the issue of the top of the escalation chain, right? As correctly noted, in 99% of the cases this isn’t about handling the escalation itself technically, but more figuring out who can take care of it. Ie checking if it’s an alert which needs immediate investigation (if it’s already closed, the investigation can usually be delayed until the next time a firefighter becomes available), if it’s the case pinging the firefighters and/or other people online, and following-up until it’s solved (for example when the investigation is passed between people due to availability/timezones). It’s extremely rare to actually have to do an actual investigation – I think in the past couple of years I only had to do this once or twice. It’s also fine to just live normally during that time (including having fun or drinking) - the main thing is to keep a phone charged with OpsGenie, and be willing to jump on the chat if it rings.
Since it’s more of a management role than actual technical firefighting, and with responsibilities that match the firefighting manager role, imho it should be part of that role. Firefighters are already included in the rotation, we would likely not gain much by rotating this role - we would likely end up with firefighters on rotation who forget about it, or don’t know how to handle situations they would encounter very rarely, with very little experience gain over time. Generally, the firefighting manager would better know what to do and who to contact by having continuity on the responsibility. And part of the role during “normal work hours” would be to minimize the chances of an uncaught escalation (for example, by ensuring that the firefighters are on the escalation path, that alerts are acknowledged by them during normal work hours, etc.).
What can help making the few escalations that do get through manageable though is to split the 24h shift in two, like we do with @braden – there could be a firefighting manager backup, on a different timezone (the furthest away, the better!), and each would take 12h, thus ensuring that most of the alerts would be during the day, rather than at night.
For the compensation, I agree that it would make sense to compensate for the fact that, unlike the firefighters who set their own work hours (and thus escalation hours) freely, the firefighting manager and their backup wouldn’t choose when to deal with the alerts, aside from agreeing on their own 12h window. To keep things simple, maybe the firefighting manager and their backup could be logging their time x1.5 or x2 when they have to deal with an escalation reaching their level at the top of the queue?