Firefighting issues - Coverage, escalation & backups

@paulo On this, I didn’t mean that the backup had to be from a specific - picking a backup firefighter from the timezone with the least coverage should be preferred, but if that doesn’t work out it’s ok to pick a backup from any timezone. It’s still better to have extra firefighters even on well covered timezones, to still be able to have extra redundancy in numbers.

@gabor It’s meant as a company-wide discussion, so it’s useful to remind of the specificity of Serenity, which is important to factor in the discussion. The responsibility for ensuring that alerts are properly addressed fall on the whole team, so we can also consider Serenity’s role in the discussion. How is it going for you and @mtyaka? Any issues noticed for the types of alerts you’re responsible for? Any changes that would be useful?

@kshitij Interesting suggestion! It would certainly help to improve coverage, for sure. The challenges might be the ones you mention, especially the dilution of responsibility, and the difficulty to do much in 1h. Maybe something in-between, with maybe one “main” firefighter who takes on larger tasks and coordinates the smaller tasks or alerts happening while off, and the rest of the cell putting in 1h when needed to handle emergencies? In any case, we could always do a test and see how we like it in practice.

Was ‘OVH is on fire’ within the last year or was it the year before? Then again I’m not sure it actually affected anything in production, only on stage. :slight_smile:

This makes sense to me. I haven’t ended up on pager rotation since I ended up going to Deathstar soon after rejoining-- so bear with me if I don’t quite understand how it works. But if each team members normal waking hours are registered in there, then if the Firefighter can’t get there, someone can. The fact that everyone knows that others have gotten the ping might still result in that ‘diluted responsibility’ thing.

Is there a way to configure things so that a random person within the window is selected and is considered the ‘point person’, falling over within a few more minutes to the next? Or would that be too many people to go through? The point being that if you receive the notification, you know that you’re the current person in line to fix/route the problem, and shouldn’t let it fall though, whereas if it pings everyone at once, each person might think someone else ‘has it’.

1 Like

I did a quick check on Opsgenie, and it seems we can have a random person be notified. However, I don’t know if it takes the availability into account. The alerting rules are “notify on-call users in schedule” or “Notify random member of team”. If this random user selection process doesn’t take into account if the user is “on-call” then it defeats the point somewhat, since that user might not be available, and can be irritating if that user is sleeping or not in a position to be disturbed. A quick search didn’t reveal much, but we can probably do a test of this.

tl;dr; I think we should finish the infrastructure migration and maybe change the mail archiver, to resolve vast majority of our alerts. However, the situation is not that bad as it feels lately.

I was thinking about this lately, and maybe discussed it with someone as well – though I cannot recall who was it, maybe @maxim? At last how I feel, the on-call alerting situation is way better than it was. To elaborate on this:

  • The (still ongoing, but paused) migration from VM based hosting of our own tooling (forum, mattermost, vault, etc) to Kubernetes was a great move. I don’t even know if we got any alerts from these services since they are migrated (more than a year ago!).
  • The migration from RabbitMQ to Redis was also a wise move as RabbitMQ used to trigger alerts a lot. Now everything is quiet on that front.
  • Although to get an instance deployed can be struggling sometimes (I’m currently doing a discovery on eliminating that issue), after an instance is deployed, it is safe and stable. We had issues in the past with Redis configuration within the clusters, but that’s a story resolved long time ago. As a rule of thumb: “If an instance is deployed, it is running safely until we stop it”. Yes, it is the promise of Kubernetes, but still, we are not facing with random OVH network issues or similar.

These three items reduced the on-call alerting a lot. However, what’s on our plate usually is the mail archiver and mailing related issues. Probably there are more “frequent” alerts, but that’s what clicks in to me. Other than this when we are getting alerts from our hosting infrastructure, it is 99% of the case DigitalOcean acting wrong and causing issues we cannot even fix. There were 2 incidents since we use them, so I wouldn’t say it is frequent either.

In my option, the reason we feel we have many alerts is the mail server. It is reporting frequently and it is usually cumbersome to resolve the issue.

I’m trying to keep an eye on client-related alerts as well, and maybe my observations are wrong, but the alerts are coming from the old infra in 90% of the cases. Also NewRelic is flipping many times nowadays without a reason. Searching for status: closed here shows the recent issues and correlates to what I observed.

2 Likes

I remember having to resolve an issue for SprintCraft, but it didn’t cause an alert, because it’s monitored by Sentry. One issue I had while debugging it was that it seemed like workers were being killed by the manager, either because of the timeout or exceeding resources, but I could not find any logs other than standard exit codes. Perhaps it’s my lack of knowledge in k8s.

+1

There is an issue with the workers getting stuck after deployment of a new edx instance, I don’t know if it was fixed. MM ref. In theory, healthchecks should fail, triggering an alert. For the client I’m managing we didn’t setup monitoring yet, because not all things are configured yet. Idk about the other clients. cc. @paulo Being fixed:

Yeah. There are two type of alerts, which I can remember:

  1. Port 25 getting blocked by the hosting provider
  2. Mailman 3 web UI becoming unresponsive

I’ve put experimental changes onto the mail server, which in theory should fix #2.

We don’t have a solution for #1. It doesn’t happen too often, but when it does, it can take days to resolve, and unless someone turns off or reduces the frequency of alerts by tuning the monitoring, it can ping every 15-30 minutes.

@maxim It’s being fixed via BB-8063 :slightly_smiling_face:

CC @gabor

I think that could be where the new firefighting manager role comes in:

  1. FFs and volunteers
  2. Anyone else in Bebop who is currently on rotation
  3. Firefighting manager(s)
  4. Management

With this setup, if neither the main firefighters nor anyone else on rotation responds to an alert1, it would go to a firefighting manager before escalating to Braden and/or Xavier.

1With three FFs and at least 7 additional cell members on rotation, it seems like that would be a rare occurrence, even in situations where some people choose to rely on others to catch the alert.

Another option would be to put firefighting managers second in the chain of escalations:

  1. FFs and volunteers
  2. Firefighting manager(s)
  3. Anyone else in Bebop who is currently on rotation
  4. Management

With this setup, alerts would probably escalate to firefighting managers a bit more often than in the previous scenario. However, they’d still have the advantage of having a lot of options for who to ping for follow-up, and they could easily skip people who wouldn’t be available at the time based on where they’re located. And we wouldn’t necessarily need to find

If neither the main firefighters nor the firefighting managers catch an alert2, it would probably make sense for it to escalate to multiple people simultaneously.

2With three FFs and up to 5 firefighting managers (= the original idea was to have Falcon handle this role), it doesn’t seem like that situation would come up very often.

3 Likes

I like the role of a Firefighter Manager, who can play the role that “Management” currently does, but is closer to the action.

This seems to reduce the no. of instances a group of people would be pinged. I like this better than FFM being in 3.

Exactly.

It will depend on the number of FF managers. If we go with the 2 proposed (with different timezontes) they will always be on duty, without rotation. I don’t think this is good in the long term.

As a Falcon member in America’s timezone, I volunteer as FF Manager. It is worth mentioning that I have little to no experience with Opencraft infrastructure. Even if the role is more about coordination, I think the FF Manager should be able to diagnose the most common/straightforward cases to avoid escalation (false positives, known occurrences that a simple server restart solves, etc…). Does this make sense?

Which is the best way to get this experience? We have a specific FF training course, right? Should the FF Manager shadow an FF on duty to learn from incidents?

2 Likes

From what I understood, the role of the firefighting manager is to be more deliberate with who they ping based on who is awake/available at that time, as opposed to the more “passive” automated pinging with ops genie? In that case, I think that makes sense, it gets more people involved, and less issues would fall through the cracks. I see this working more for non-critical-must-fix-now issues, since there could be some time buffer between the alert and when the actual issue is resolved, however I’m not so sure it would be as helpful for the critical-must-fixe-now issues, especially since the FF manager would most likely not be familiar with the infrastructure in place. Though based on what was mention on this thread, it seems like those issues are quite rare (thankfully).

Agree. For instance, this sprint we did not have any fire. Thanks to our new infrastructure (grove).

+1 This seems like a balanced approach. We could have someone from falcon to take the role of FF manager(s) as suggested and reduce the number of hours for main FFs.

I do not have anything to add as I have no experience with the firefighter role in practice yet, but I think @kshitij’s idea sounds good.

I too like this idea. An assigned FF manager might be more ready to respond to alerts than the whole team (members who supposedly already have sprints planned and might not be available to take on unexpected work)

I agree with the Firefighter Manager role :+1:

I think this would make more sense. I also have the same @rpenido concern:

and I think that for the role to work, there should not be many people with that role (at most one per cell)

:+1: to this. Not many alerts are unique, so there is a high chance that the incidents log or Mattermost channel already contains a description of a similar problem (and a solution). Having an initial filter for these cases would be very helpful, as searching for similar errors is usually the first step of the initial diagnosis anyway.

It could also be helpful to prepare a separate document with more detailed descriptions of performing an initial diagnosis in different environments (e.g., where you can find the configs or logs for the specific service in Open edX or k8s) and for resolving some common issues. Mattermost debugging threads are usually long, so going through them to compile all steps during the incident takes a while.

I agree, based on the FF experience in Bebop for the last year or so, I would say 5hrs x 3FFs should be sufficient. Adding 1 hour to everyone’s sprint also seems reasonable to me, in context of the other proposal to ping everyone when FFs are not available. That way, anyone could atleast do a basic diagnosis, check logs, etc. before handing over to the FFs/client owners when they come online.

+1 from me on this idea of the chain of escalations with multiple people being pinged to plug any gaps. This second option looks better to me.

Thanks @kshitij @maxim and @tikr for proposing this idea.

Escalation path & firefighters revamp

Alright! The two escalation paths you suggested @tikr have both pros and cons, but let’s start with this one:

This is the one that is the most in line with having the firefighting manager being more of a coordination role - we want the escalations to go up to that person only once the firefighters, volunteers and self-management/assignation have failed, rather than having that person being the permanent “third firefighter”.

We do need to make sure that we have a way to only ping Bebop members during actual work hours, which can be achieved by having everyone setting their hours on a new escalation step, after the firefighters/volunteers initial step.

And to account for this in the sprints, we would plan 1h for it in everyone from Bebop’s sprint - is there a way to automatically create these tasks for all Bebop members each sprint? Maybe with the exception of firefighters?

Firefighting managers assignation

@rpenido Thank you! That’s great, being on a later timezone that is a good first half of the coverage.

Anyone from an earlier timezone from @falcon would be willing to be the second firefighting manager? @jill @yusuf would either of you want to take it? Your work hours look like they would complement @rpenido 's hours well.

I think we should limit the number of managers to 2, to avoid dilution of responsibility for that role, which is precisely meant to help with the general dilution effect we see with firefighting. Though the rest of the Falcon members could help by providing backup to the role during vacation?

Firefighting manager billing

Btw, we will also need to figure out how to implement the higher billing rate for for the time the firefighting managers would spend answering the pager. Again, it should be rare, as work on the role such as prior coordination, checking the opsgenie roster every sprint, etc. could be done async in normal work hours - but that part of the role can still happen, so we will need to have it in place to start. A no-code approach would be to log that time twice: once in the normal task (to ensure it’s also billed when appropriate), and a second time in a specific internal task, which we can also use to track how much firefighting managers are being paged, and the corresponding budget?

Next steps

In an upcoming sprint I will create a MR for the new points discussed here, including:

  • New escalation path
  • Firefighting managers
  • Bebop rotations refactoring
  • Reducing Bebop firefighting hours

And we still need to figure out:

  • Payment of pager time for firefighting managers
  • Task to automatically create 1h tasks for all bebop non-firefighters each sprint

Missing reviews

@jill @pooja @mtyaka @demid @Cef @rafay @braden Unless I’ve missed it, I haven’t seen your comment in the current thread - could you post it now?

Serenity-specific

@gabor Good to have in mind! Something to put in a new “pages reduction” epic, to schedule after the sandboxes and the Harmony migration are completed? It could also be something to work on iteratively on firefighter hours, when corresponding alerts comes up?

3 Likes

Yes, it is possible to do this with some tweaks to the existing automations. Do we really want to create new tasks for every sprint, though? This will result in automatic spillovers during vacations. We could do one of the following approaches instead:

  1. Create a recurring task for everyone in Bebop. Then, we’ll just reduce the FF ticket estimate by 1h.
  2. Subtract 1h from the remaining hours in SprintCraft. We’re already doing this - SprintCraft takes 1h from the remaining time to block it for the sprint management. Again, we would still reduce the FF ticket estimate by 1h.
2 Likes

Sure, I can be a firefighter manager, that makes sense.

Sorry no, we don’t have a FF training course. And I understand that this is more of a “coordination” role, but I think it’s worth a small Training task for @rpenido to work with someone on Serenity and/or Bebop to learn how to access our infrastructure and client services, and diagnose issues. I could use a refresher too; a lot has changed since I was FF!

FF Manager will need a procedure for escalating incidents, including what to do if no one else is awake/working right now.

Agreed. Escalation rules needs to be clear, otherwise it ends up being “no one’s responsibility” :)

Since we can’t log time to the FF tickets, they don’t really serve any purpose, so if SprintCraft can block out the hours, then that’s better.

1 Like

I don’t have a particular preference for the order of the escalation. As long as we have the role of the FF manager clearly defined I think it would work in either option.