Firefighting issues - Coverage, escalation & backups

I agree, based on the FF experience in Bebop for the last year or so, I would say 5hrs x 3FFs should be sufficient. Adding 1 hour to everyone’s sprint also seems reasonable to me, in context of the other proposal to ping everyone when FFs are not available. That way, anyone could atleast do a basic diagnosis, check logs, etc. before handing over to the FFs/client owners when they come online.

+1 from me on this idea of the chain of escalations with multiple people being pinged to plug any gaps. This second option looks better to me.

Thanks @kshitij @maxim and @tikr for proposing this idea.

Escalation path & firefighters revamp

Alright! The two escalation paths you suggested @tikr have both pros and cons, but let’s start with this one:

This is the one that is the most in line with having the firefighting manager being more of a coordination role - we want the escalations to go up to that person only once the firefighters, volunteers and self-management/assignation have failed, rather than having that person being the permanent “third firefighter”.

We do need to make sure that we have a way to only ping Bebop members during actual work hours, which can be achieved by having everyone setting their hours on a new escalation step, after the firefighters/volunteers initial step.

And to account for this in the sprints, we would plan 1h for it in everyone from Bebop’s sprint - is there a way to automatically create these tasks for all Bebop members each sprint? Maybe with the exception of firefighters?

Firefighting managers assignation

@rpenido Thank you! That’s great, being on a later timezone that is a good first half of the coverage.

Anyone from an earlier timezone from @falcon would be willing to be the second firefighting manager? @jill @yusuf would either of you want to take it? Your work hours look like they would complement @rpenido 's hours well.

I think we should limit the number of managers to 2, to avoid dilution of responsibility for that role, which is precisely meant to help with the general dilution effect we see with firefighting. Though the rest of the Falcon members could help by providing backup to the role during vacation?

Firefighting manager billing

Btw, we will also need to figure out how to implement the higher billing rate for for the time the firefighting managers would spend answering the pager. Again, it should be rare, as work on the role such as prior coordination, checking the opsgenie roster every sprint, etc. could be done async in normal work hours - but that part of the role can still happen, so we will need to have it in place to start. A no-code approach would be to log that time twice: once in the normal task (to ensure it’s also billed when appropriate), and a second time in a specific internal task, which we can also use to track how much firefighting managers are being paged, and the corresponding budget?

Next steps

In an upcoming sprint I will create a MR for the new points discussed here, including:

  • New escalation path
  • Firefighting managers
  • Bebop rotations refactoring
  • Reducing Bebop firefighting hours

And we still need to figure out:

  • Payment of pager time for firefighting managers
  • Task to automatically create 1h tasks for all bebop non-firefighters each sprint

Missing reviews

@jill @pooja @mtyaka @demid @Cef @rafay @braden Unless I’ve missed it, I haven’t seen your comment in the current thread - could you post it now?

Serenity-specific

@gabor Good to have in mind! Something to put in a new “pages reduction” epic, to schedule after the sandboxes and the Harmony migration are completed? It could also be something to work on iteratively on firefighter hours, when corresponding alerts comes up?

3 Likes

Yes, it is possible to do this with some tweaks to the existing automations. Do we really want to create new tasks for every sprint, though? This will result in automatic spillovers during vacations. We could do one of the following approaches instead:

  1. Create a recurring task for everyone in Bebop. Then, we’ll just reduce the FF ticket estimate by 1h.
  2. Subtract 1h from the remaining hours in SprintCraft. We’re already doing this - SprintCraft takes 1h from the remaining time to block it for the sprint management. Again, we would still reduce the FF ticket estimate by 1h.
2 Likes

Sure, I can be a firefighter manager, that makes sense.

Sorry no, we don’t have a FF training course. And I understand that this is more of a “coordination” role, but I think it’s worth a small Training task for @rpenido to work with someone on Serenity and/or Bebop to learn how to access our infrastructure and client services, and diagnose issues. I could use a refresher too; a lot has changed since I was FF!

FF Manager will need a procedure for escalating incidents, including what to do if no one else is awake/working right now.

Agreed. Escalation rules needs to be clear, otherwise it ends up being “no one’s responsibility” :)

Since we can’t log time to the FF tickets, they don’t really serve any purpose, so if SprintCraft can block out the hours, then that’s better.

1 Like

I don’t have a particular preference for the order of the escalation. As long as we have the role of the FF manager clearly defined I think it would work in either option.

I totally agree with this. Moving our tooling to K8S and letting Digital Ocean manage our databases significantly reduced the alerting. It’s a bliss. P&T clients and beta-test Ocim instances were generating many alerts too, so discontinuing this offering made things even better.

However I agree with this too:

Finding how to connect to, for example, the SprintCraft shell is definitely not easy if one never interacted with https://gitlab.com/opencraft/ops/infrastructure and doesn’t know about this repo existence.

Is it possible or does it make sense to automate the “Update your pager rotation” sprint checklist item? Or make Crafty ping people who forget to set the pager? I can imagine someone not filling the checklist in time and consequently not setting the pager in time, which may result in escalation. Of course, ideally this shouldn’t happen, but we can add a bit of extra protection by automating this and reduce our checklist a bit.

1 Like

I’m so happy to hear this, @gabor and @demid :) And yeah, I’ve noticed the same.

Yes, I just want to emphasize this and make sure everyone is aware of it. Sometimes the incidents log is not filled out consistently. But the idea is: most issues we see have happened before; so you can fix them much more easily by scanning the incident log and reading about how we solved it the last time it happened. For that reason, when you fill out the incident log for anything totally new, please include lots of info about the root cause and how you solved it.

Is this something that the firefighting managers could help to check? (Make sure at the end of the sprint that the incident log was filled out?) (Or is someone already checking that?)

Yep - we have the incidents documentation and (elsewhere) specific instructions on how to fix the mail list server when it’s blocked by OVH because that happens so often, as well as troubleshooting issues with the load balancers (is that still relevant with Grove?). I think these private docs are the best place to put such info for any recurring issues.

Sorry, I’ve been passively reading this thread and appreciating the discussion. I think the plan you’ve proposed sounds good!

3 Likes

Good idea :slight_smile: I don’t think we have anyone checking this.

Sorry about that. I have been following the thread but I had nothing in particular to add.

Big +1 on this. Having handy commands on how to start a diagnosis, a simple restart command, or a link to an exact portion of the documentation can be extremely helpful, especially for newcomers or people not familiar with the impacted component. In other words, a cheat sheet (I love the K8S one) of some sort.

I agree that having a FF manager will be helpful :+1:

K8s has quite a learning curve, but once you’re familiar with it it’s actually not that hard to debug. And moving most of our infra to k8s has greatly reduced the amount of time spent on FF, as others have noticed.

I do think we need to do something about our mail server. It’s a bit ridiculous how it periodically gets blocked by OVH and we need to ask them to unblock it.

Unfortunately, migrating to K8s takes a little bit of preparation, so it is not something we can do on a “side track” while handling an issue :confused: However, putting this on our priority list would worth it for sure as that would (hopefully) reduce the issues with the mail server as well.

Sorry, I didn’t have anything particular to add and has been watching the thread to gain an insight since I have never been a firefighter. However, another approach that I can think of for firefighting is to instead of division by day, it can be divided by hours for a day, that could increase availability and also cover different timezones.

Handbook merge request

Here is the handbook merge request to implement the points discussed here:

Who would like to review? This can be planned as a proper task for the upcoming sprint.

Remaining tasks

The remaining TODO items to implement the changes from the handbook MR will be the following:

  • @jill & @rpenido to decide on their exact assigned hours for their pager escalation level, and add it to the pager
  • Create and onboarding firefighting task for @rpenido
  • Updating Bebop rotations (assignation and hours)
  • Block 1h in everyone from Bebop’s sprint
  • Lower the firefighting hours for Bebop (aside from the previous point) from 30h to 25h per sprint
  • Add all Bebop members to a new escalation layer in the pager, and get everyone to add their working hours to it
  • Ensure everyone has a “Update your pager rotation” / contact availability sprint checklist item
  • Create recurring “Sprint manager escalation” tasks for sprint managers, to allow to log a second time the time spent dealing with escalations

@rpenido @jill Can I ask you to set this all up, as part of your new role? No rush though, it can be planned calmly for next sprint.

Also @gabor @mtyaka could you create the task about updating your sprint checklist that was discussed above?

Discussions

@jill Thank you! :) Hopefully it won’t be too disruptive for you and @rpenido - I appreciate that you are both willing to take this on. Don’t hesitate to provide feedback as how it goes, and if any refinements are needed.

+1 - @rpenido would you like to create that task for the upcoming sprint?

Currently when it escalates to me, the procedure is more or less (@braden might have some differences in his approach, which we can integrate in the procedure):

  • Go to the devops channel, mention that the page has escalated, asking if anyone online is available to look at it, and pinging explicitly the firefighters who might have missed the page on their phone
  • If nobody answers immediately, check if the page needs urgent attention, or if it’s something that can wait for the firefighters to come back online:
    • if it can wait until then, snooze until the time where the next firefighter rotation starts in opsgenie
    • if it can’t wait, snooze for 5, 10 or 30 minutes depending on the alert urgency, to give time for someone to see either the chat pings or the pager alert
  • If the alert re-escalates after the snooze, repeat the previous step, but this time also ping @here or @channel depending on the issue’s importance and urgency, to widen the number of people being pinged
  • If there are still no answers in the chat, that it re-escalates, and that it’s an important and urgent issue, then it’s time to start either looking at solving the problem personally, or use the contacts spreadsheet to find the right person to call on the phone to help.

How does that sound? I’ve used that in the PR, but we can revise it as needed, and iterate based on experience.

We could automate it yes - maybe using the contact spreadsheet availability as the source of truth, and then having a script updating the pager duty times based on the rotations? Not sure if the work would be worth it though, as there might be plenty of edge cases to consider, like availability on specific days and vacations, differences between roles and cells, etc. It might be worth starting with a manual version, checked by the firefighting managers to avoid forgetting, and then maybe look at automating once we have solidified the process?

Yup that is indeed part of the approach :slight_smile: See for example what has been mentioned above for reorganizing firefighting rotations in Bebop around timezone coverage, or having two firefighting managers, each covering half of the timezones.

Serenity-specific

@Fox @gabor @mtyaka For the task about updating your checklist, you might already be aware of this, but note that there is already a specific checklist for your cell at Cell-Specific Rules - OpenCraft Handbook . It would be worth starting with a PR to update that page of the handbook, before applying the resulting specific checklist in Listaflow?

2 Likes

Sure, it is done and scheduled for the next sprint! Ref1, Ref2

2 Likes

Sure! I will look into that next week!

2 Likes

We can review your MR as part of FAL-3574 Implement Firefighter Manager role.

I’ve created FAL-3575 for this. Can someone in Bebop volunteer to review and help mentor @rpenido on firefighting (blocking out time with Crafty)?

1 Like

Did you mean firefighting managers and not sprint managers?

That link points to the checklist for epic planning and sustainability management :slight_smile:

Serenity’s cell-specific definitions of the sprint planning manager and sprint manager roles are here and here. They currently don’t include any checklists.

CC @Fox @mtyaka @gabor

@tikr True, I missed that this wasn’t for all the roles - and that this one is actually more your checklist, since you took the epic planning & sustainability role for Serenity :) Though @mtyaka is your backup, so it’s also still potentially part of his checklist. In any case, if we are doing custom checklists for Serenity, it’s worth having a unified approach/place to store them.

@tikr & @antoviaque Could you please confirm that the new escalation policy’s “On call users in Firefighters , if not acknowledged” is not applying on Serenity during the night (read: literally during the night as outside of business hours is a different story)? I’m asking because:

At the moment, all Serenity members are on a 365-days rotation:

Due to its small size Serenity does not explicitly designate two firefighters each sprint. Instead, each member of the cell allocates a small amount of time for firefighting each sprint, relative to their weekly commitments (in hours): The number of firefighting hours that each cell member allocates should (roughly) match 7.5% of their committed hours.

If we take the new escalation policy enforcement into consideration, it means that Serenity members’ phone could ring 0-24 for 365 days. This would be a bit overwhelming. Also, Serenity members would have to wake up for “false-positive” alerts as:

  • Serenity is solely responsible for maintaining OpenCraft’s internal infrastructure and must handle all fires affecting that infrastructure.
  • Firefighting responsibilities for Serenity do not include addressing issues affecting client instances, unless there is an underlying infrastructure problem that is causing these issues and affecting multiple instances at once.

Therefore, even if there is a real fire during the night, if the root cause is not the infrastructure, it is not he scope of Serenity to handle it.


If I recall correctly, we agreed on this as a company around the time the members of Serenity were moving to Bebop, but I couldn’t find the conversation, so I would need your confirmation again.