Firefighting issues - Coverage, escalation & backups

There has been a few new hiccups with firefighting coverage lately - so far nothing too bad in terms of consequences (no client server was harmed in the process!), but there have been instances where the pager escalated a lot, sometimes without getting picked up. We’re having some gaps during weekends and on some timezones, or during vacation of people in the escalation queue.

To make sure this doesn’t end up translating into a client outage left unattended, it’s worth tackling the issue properly. We have started discussing it with @tikr and @kaustav on MNG-3213 and came up with a few ideas, but before drafting a handbook PR, it’s worth discussing it as a larger team here.

Bebop rotations refactoring

Tim came up with a nice idea that could help a lot with coverage, with very little change to the current approach:

Perhaps we need to redesign Bebop’s FF rotations so that there is at least one person from each of these timezones on FF duty at all times? For the European and South American timezones, that would mean needing to be on FF duty more frequently than team members located in Indian timezones. But if I’m not mistaken there should be a way to set this up so that nobody would have to be on FF duty for consecutive sprints

Sprint India/Pakistan Europe South America
A Team member 1 Team member 1 Team member 1
B Team member 2 Team member 2 Team member 2
C Team member 3 Team member 3 Team member 1
D Team member 4 Team member 1 Team member 2
E Team member 5 Team member 2 Team member 1
F Team member 6 Team member 3 Team member 2
= 1 sprint each = 2 sprints each = 3 sprints each
  • Indian timezones: 6 people
  • European timezones: 3 people
  • South American timezones: 2 people

That should help expand the coverage around the clock, at least during the week.

Firefighting volume hours reduction?

While we increase coverage, we might also want to consider reducing the overall volume of firefighter hours - @kaustav mentioned that they are currently a bit high for the amount of firefighting Bebop has to do. So rotations might come in more often, but with less work and number of alert to handle per firefighter each time.

Backups during vacation

As a gentle reminder, when on firefighting duty and planning for vacation:

  • Ensure to have a backup who will step in while away. Even if the timezone is already covered, there is safety in numbers and it helps ensuring that an alert will be more likely to be picked on early, or that the firefighter will have availability at that time.
  • Make sure backups are clearly indicated on the rotations calendar, so we all know quickly who to contact during alert escalations.
  • To also help with coverage, try to get a backup from the timezone with the least coverage during that sprint – though again if that’s not possible, redundancy will still be helpful.

There are already reminders about firefighting backups in the vacation checklists, so the plan is only to add some details to the item from the sprint manager role, to ensure consistency - this will work best if everyone keeps these points in mind when looking for rotation backups!

Firefighting manager

To help with keeping a closer eye on firefighting, and better manage alerts that come up outside of the hours covered by firefighters or escalate, we need to consider implementing the firefighting manager role we had discussed earlier. See the discovery document.

Note that this is mainly a coordination role, similar to what @braden and I have been doing - ensuring that firefighting is handled properly, and dealing with escalations that reach us by finding someone to handle the alert rather than actually firefighting it. The only exception would be when nobody could be found to handle a truly critical alert, but in practice that almost never happened (proof being that I have been doing that role, and I haven’t had to go fix one in years - I wouldn’t even know how anymore!).

Is there anyone who would be willing to handle this? Ideally we would have two firefighting managers, on timezones far apart, as this allow to split the hours during which to manage escalations, to keep them mostly during one’s day - again a bit like what @braden and I have been doing. Also note that @braden and I would still be available as backups, to increase reliability and coverage further.

A suggestion in our discussion thread was to try to assign these roles to @falcon . This would keep things a bit more fair for Bebop and Serenity who already handle alerts, while further increasing coverage and redundancy, and still taking into account that Falcon members generally don’t have expertise in projects outside of the cell. It would take advantage of the fact that the role is more about coordination than the actual handling of alerts.

Other suggestions?

There might be other ways or better ideas though - don’t hesitate to suggest alternatives or additional approaches. Comments welcome!

[ Ticket to log time ]

3 Likes

That makes a lot of sense. To be fair I’m already FF every three sprints or so, not much different with this change.

LGTM :+1:

This will be a problem for Americas and Europe-based people, as that means FF duty for consecutive sprints for whoever takes the backup position in longer vacations.

I remember it is stated in the handbook somewhere, but Serenity is a little bit Special team from the backups and oncall point of view.

I assume this is mainly meant for Falcon and Bebop. In Serenity, both of us are in the rotation, every day. This means that we are not able to act every time (since we are kinda all-time on calls and we cannot bring our notebooks everywhere), but trying our bests.

One way I can see of increasing coverage is to automatically make everyone a firefighter. i.e. Add 1 hr to everyone’s sprint for firefighting. This gives us roughly 10-12 hrs per sprint of FF time + we can retain the existing firefighters but with fewer hours, around 5 each. This would give us around 20 hours of FF time.

The pros are:

  • increased coverage with someone is every timezone
  • Everyone can use their FF time to do a bit of FF management, i.e. use their time to ping main FFs or find someone to take on the task
  • The short 1 hr period might not be enough for most FF tasks, but one can quickly dig into the issue at least post about it so someone else can take over.

The cons are:

  • Reduces the FF role somewhat, if everyone is FF no one is focussed on it
  • One hour isn’t really enough to do much at all
  • Everyone needs to be on the pager rotation, more people are pinged or disrupted than necessary. If the pager is set up properly then at least the core FFs will be the first to be pinged and others will be fallback
2 Likes

This idea is worth considering.

The way I see it could work, is basically everyone has to set an OpsGenie rotation. Then we would add a new tier to the escalation chain, so that people are pinged in the following order:

  1. FFs and volunteers
  2. Anyone else in Bebop who is currently on rotation
  3. Management

This will achieve the following:

  • The pings will escalate to Xavier or Braiden much much rarer, because in theory combined rotations from Bebop members should cover 24h, so there will always someone who should respond (the only exception is if there is only one person in a certain time range, and he is on vacation)
  • There is always a responder that can evaluate the situation and do one of the following:
    • If the issue is not critical (i.e. can wait 6-8 hours), which is 99% of the cases - snooze and ping relevant people who have the context and/or FFs
    • If the issue is critical, which is rare (I can’t think of any which have happened in the last year; I’m not saying they don’t happen at all) - attempt to solve it and start waking relevant people up, because if it’s so urgent, it’s probably warranted
6 Likes

@paulo On this, I didn’t mean that the backup had to be from a specific - picking a backup firefighter from the timezone with the least coverage should be preferred, but if that doesn’t work out it’s ok to pick a backup from any timezone. It’s still better to have extra firefighters even on well covered timezones, to still be able to have extra redundancy in numbers.

@gabor It’s meant as a company-wide discussion, so it’s useful to remind of the specificity of Serenity, which is important to factor in the discussion. The responsibility for ensuring that alerts are properly addressed fall on the whole team, so we can also consider Serenity’s role in the discussion. How is it going for you and @mtyaka? Any issues noticed for the types of alerts you’re responsible for? Any changes that would be useful?

@kshitij Interesting suggestion! It would certainly help to improve coverage, for sure. The challenges might be the ones you mention, especially the dilution of responsibility, and the difficulty to do much in 1h. Maybe something in-between, with maybe one “main” firefighter who takes on larger tasks and coordinates the smaller tasks or alerts happening while off, and the rest of the cell putting in 1h when needed to handle emergencies? In any case, we could always do a test and see how we like it in practice.

Was ‘OVH is on fire’ within the last year or was it the year before? Then again I’m not sure it actually affected anything in production, only on stage. :slight_smile:

This makes sense to me. I haven’t ended up on pager rotation since I ended up going to Deathstar soon after rejoining-- so bear with me if I don’t quite understand how it works. But if each team members normal waking hours are registered in there, then if the Firefighter can’t get there, someone can. The fact that everyone knows that others have gotten the ping might still result in that ‘diluted responsibility’ thing.

Is there a way to configure things so that a random person within the window is selected and is considered the ‘point person’, falling over within a few more minutes to the next? Or would that be too many people to go through? The point being that if you receive the notification, you know that you’re the current person in line to fix/route the problem, and shouldn’t let it fall though, whereas if it pings everyone at once, each person might think someone else ‘has it’.

1 Like

I did a quick check on Opsgenie, and it seems we can have a random person be notified. However, I don’t know if it takes the availability into account. The alerting rules are “notify on-call users in schedule” or “Notify random member of team”. If this random user selection process doesn’t take into account if the user is “on-call” then it defeats the point somewhat, since that user might not be available, and can be irritating if that user is sleeping or not in a position to be disturbed. A quick search didn’t reveal much, but we can probably do a test of this.

tl;dr; I think we should finish the infrastructure migration and maybe change the mail archiver, to resolve vast majority of our alerts. However, the situation is not that bad as it feels lately.

I was thinking about this lately, and maybe discussed it with someone as well – though I cannot recall who was it, maybe @maxim? At last how I feel, the on-call alerting situation is way better than it was. To elaborate on this:

  • The (still ongoing, but paused) migration from VM based hosting of our own tooling (forum, mattermost, vault, etc) to Kubernetes was a great move. I don’t even know if we got any alerts from these services since they are migrated (more than a year ago!).
  • The migration from RabbitMQ to Redis was also a wise move as RabbitMQ used to trigger alerts a lot. Now everything is quiet on that front.
  • Although to get an instance deployed can be struggling sometimes (I’m currently doing a discovery on eliminating that issue), after an instance is deployed, it is safe and stable. We had issues in the past with Redis configuration within the clusters, but that’s a story resolved long time ago. As a rule of thumb: “If an instance is deployed, it is running safely until we stop it”. Yes, it is the promise of Kubernetes, but still, we are not facing with random OVH network issues or similar.

These three items reduced the on-call alerting a lot. However, what’s on our plate usually is the mail archiver and mailing related issues. Probably there are more “frequent” alerts, but that’s what clicks in to me. Other than this when we are getting alerts from our hosting infrastructure, it is 99% of the case DigitalOcean acting wrong and causing issues we cannot even fix. There were 2 incidents since we use them, so I wouldn’t say it is frequent either.

In my option, the reason we feel we have many alerts is the mail server. It is reporting frequently and it is usually cumbersome to resolve the issue.

I’m trying to keep an eye on client-related alerts as well, and maybe my observations are wrong, but the alerts are coming from the old infra in 90% of the cases. Also NewRelic is flipping many times nowadays without a reason. Searching for status: closed here shows the recent issues and correlates to what I observed.

2 Likes

I remember having to resolve an issue for SprintCraft, but it didn’t cause an alert, because it’s monitored by Sentry. One issue I had while debugging it was that it seemed like workers were being killed by the manager, either because of the timeout or exceeding resources, but I could not find any logs other than standard exit codes. Perhaps it’s my lack of knowledge in k8s.

+1

There is an issue with the workers getting stuck after deployment of a new edx instance, I don’t know if it was fixed. MM ref. In theory, healthchecks should fail, triggering an alert. For the client I’m managing we didn’t setup monitoring yet, because not all things are configured yet. Idk about the other clients. cc. @paulo Being fixed:

Yeah. There are two type of alerts, which I can remember:

  1. Port 25 getting blocked by the hosting provider
  2. Mailman 3 web UI becoming unresponsive

I’ve put experimental changes onto the mail server, which in theory should fix #2.

We don’t have a solution for #1. It doesn’t happen too often, but when it does, it can take days to resolve, and unless someone turns off or reduces the frequency of alerts by tuning the monitoring, it can ping every 15-30 minutes.

@maxim It’s being fixed via BB-8063 :slightly_smiling_face:

CC @gabor

I think that could be where the new firefighting manager role comes in:

  1. FFs and volunteers
  2. Anyone else in Bebop who is currently on rotation
  3. Firefighting manager(s)
  4. Management

With this setup, if neither the main firefighters nor anyone else on rotation responds to an alert1, it would go to a firefighting manager before escalating to Braden and/or Xavier.

1With three FFs and at least 7 additional cell members on rotation, it seems like that would be a rare occurrence, even in situations where some people choose to rely on others to catch the alert.

Another option would be to put firefighting managers second in the chain of escalations:

  1. FFs and volunteers
  2. Firefighting manager(s)
  3. Anyone else in Bebop who is currently on rotation
  4. Management

With this setup, alerts would probably escalate to firefighting managers a bit more often than in the previous scenario. However, they’d still have the advantage of having a lot of options for who to ping for follow-up, and they could easily skip people who wouldn’t be available at the time based on where they’re located. And we wouldn’t necessarily need to find

If neither the main firefighters nor the firefighting managers catch an alert2, it would probably make sense for it to escalate to multiple people simultaneously.

2With three FFs and up to 5 firefighting managers (= the original idea was to have Falcon handle this role), it doesn’t seem like that situation would come up very often.

3 Likes

I like the role of a Firefighter Manager, who can play the role that “Management” currently does, but is closer to the action.

This seems to reduce the no. of instances a group of people would be pinged. I like this better than FFM being in 3.

Exactly.

It will depend on the number of FF managers. If we go with the 2 proposed (with different timezontes) they will always be on duty, without rotation. I don’t think this is good in the long term.

As a Falcon member in America’s timezone, I volunteer as FF Manager. It is worth mentioning that I have little to no experience with Opencraft infrastructure. Even if the role is more about coordination, I think the FF Manager should be able to diagnose the most common/straightforward cases to avoid escalation (false positives, known occurrences that a simple server restart solves, etc…). Does this make sense?

Which is the best way to get this experience? We have a specific FF training course, right? Should the FF Manager shadow an FF on duty to learn from incidents?

2 Likes

From what I understood, the role of the firefighting manager is to be more deliberate with who they ping based on who is awake/available at that time, as opposed to the more “passive” automated pinging with ops genie? In that case, I think that makes sense, it gets more people involved, and less issues would fall through the cracks. I see this working more for non-critical-must-fix-now issues, since there could be some time buffer between the alert and when the actual issue is resolved, however I’m not so sure it would be as helpful for the critical-must-fixe-now issues, especially since the FF manager would most likely not be familiar with the infrastructure in place. Though based on what was mention on this thread, it seems like those issues are quite rare (thankfully).

Agree. For instance, this sprint we did not have any fire. Thanks to our new infrastructure (grove).

+1 This seems like a balanced approach. We could have someone from falcon to take the role of FF manager(s) as suggested and reduce the number of hours for main FFs.

I do not have anything to add as I have no experience with the firefighter role in practice yet, but I think @kshitij’s idea sounds good.

I too like this idea. An assigned FF manager might be more ready to respond to alerts than the whole team (members who supposedly already have sprints planned and might not be available to take on unexpected work)

I agree with the Firefighter Manager role :+1:

I think this would make more sense. I also have the same @rpenido concern:

and I think that for the role to work, there should not be many people with that role (at most one per cell)

:+1: to this. Not many alerts are unique, so there is a high chance that the incidents log or Mattermost channel already contains a description of a similar problem (and a solution). Having an initial filter for these cases would be very helpful, as searching for similar errors is usually the first step of the initial diagnosis anyway.

It could also be helpful to prepare a separate document with more detailed descriptions of performing an initial diagnosis in different environments (e.g., where you can find the configs or logs for the specific service in Open edX or k8s) and for resolving some common issues. Mattermost debugging threads are usually long, so going through them to compile all steps during the incident takes a while.