DevOps Cell Firefighting

Hi @bebop and @serenity

With my imminent departure, Serenity will be without a firefighter since I was a backup while in Bebop. Mostly to assist with fires if Serenity folks were unavailable. Either unscheduled or when on holiday.

Since Serenity’s size is capped for the short term, we thought it’d a good to have a discussion around finding an alternative method of helping the cell out with firefighting.

What do ya’ll think is the best way to handle this going forward?

  1. Bebop FF’s double as Serenity FF’s, ie. they will also deal with fires related to infra
  2. Add another dedicated Serenity FF. This person can either per sprint or assigned permanently.
  3. Don’t have extra firefighter at all? Although the implications here are dire.
  4. Increase the FF time for Serenity. Though this won’t help if @gabor goes on holiday/gets sick.
  5. Any other options?

Log time for this discussion on SE-5983.

I like options 1 and 2.

IIRC, we made a clear separation between which fires Serenity covers vs the rest of the cells for the sake of the sustainability, so from that perspective option 2 is more favorable than 1. Additionally, I think many of the Bebop members are out of context in regards to how to triage and fix some of the issues with the new infrastructure. In the ideal scenario, it would be great if we all learned and got experience, as it would not only help the firefighting the internal infrastructure, but also the clients infrastructure, but that might be unreasonable/unsustainable for now.

There was an additional benefit of having you, Keith, in Bebop, as you had a lot of context on what was happening in Serenity, especially in regards to the Grove, and I feel like you’ve helped a lot with the clients’ setup, without hurting the sustainability or stretching Gabor even thinner. So in that regard it would be beneficial to have some one like that going forward.

@keithgg From your personal experience, are there any downsides to that approach (option 2)?

@maxim I can’t think of any, besides being on rotation all the time. So you know if something happens, you know it’s likely gonna be on you. When FF’s are rotated you can get away from that for a sprint or two.

Other than that, I share your preference for option 1 or 2.

Same here about option 1 vs 2 preferences. The 2nd option is more sustainable.

This can be mitigated by a rotating “Serenity backup FF” role, just like the regular FF role – just as you mentioned. Also, this would give some opportunity to have some insights for everyone about infra.

Another vote for option 1 or 2.

In theory option 1 is better because it would allow more people to get familiar with the infra and be able to help, but I suggest we start with option 2 because it’s going to be more efficient in the short term until we get over our sustainability issues.

Thank you all for the input!

It sounds like everyone would be on board with options #1 and #2.

And yes, permanently assigning another dedicated Serenity FF as described in #2 would have a lower impact on sustainability, which makes it the preferable option (at least for now).

So essentially, we’d be continuing with the same firefighting setup as before – 3 rotating FFs with 10h/sprint of firefighting time, plus 1 permanent FF with 6h/sprint of firefighting time.

The permanent FF would be taken off the regular FF rotation schedule.

@bebop Who would like to take on the permanent FF role?

@bebop Friendly ping on this :arrow_up:

Could you please reply here by Wed EOD, so that we can get a ticket created and assigned for Sprint 303?

@tikr I was hesitant to volunteer, mainly because I don’t feel confident in DevOps. However, if no one else wants to take, then I would take the role.

Two quick questions:

  • Would it make sense to create some sort of onboarding docs? I feel like it would also make it helpful for other members, in case they would need to at some point fix issues, and for the future, in case I would want someone else to take over the role.
  • What would happen with my DD and FF rotations in Bebop? I feel like it would be even worse in the sprints where I have both of those in one sprint (e.g. next one :D).

Thanks for the info @maxim! Since I asked for replies by the end of today, there’s still some time for others to chime in, so let’s see what happens.

In the meantime I’ll answer your questions:

I’m not sure about the need for dedicated onboarding docs. (It might just be me but no matter how good these docs would be, I wouldn’t be able to read over them once and memorize everything… I’d definitely have to go back and reference them again later on :sweat_smile:)

But knowing where to find relevant info in the day-to-day of working in this role would definitely be important. Maybe @gabor and/or @mtyaka could provide a list of relevant resources? We could then put that list into the recurring ticket for Bebop’s permanent FF.

For FF rotations:

For DD rotations, I think it would depend on the default criteria. I.e., if the permanent FF is full in terms of client and/or epic ownership, they’d get taken off the rotation, but if they’re not, they’d stay on (just like the other members of the cell).

@maxim OK, there were no additional comments so I went ahead and created SE-6003 for your for next sprint.

Thanks for taking this on! :rocket:

CC @bebop @serenity

Serenity’s time is mostly spent on Grove, our older and newer infrastructure (ie. Ansible-based vs Terraform based). The Grove documentation has its dedicated page (grove.opencraft.com), however, we have no additional sources for the infrastructure.

The new infrastructure is written in Terraform. All modules are documented in their respective readme, and we assign meaningful descriptions for (almost) every variable which is used. I hate to say things like this, but the terraform code is really self-documenting in this case. Also, there are bunch of comments where it is needed in the code. Like this comment.

The question is: What documentation are you interested in? We probably have answers for almost everything, but we have no giant knowledge base.

1 Like

@tikr

@maxim OK, there were no additional comments so I went ahead and created SE-6003 for your for next sprint.

Thank you! I’ll move it to the recurring column. I will also remove my self from the Bebop’s FF rotations and add keep myself in DD.

Should I also add myself on perma rotation on OpsGenie? cc. @gabor

I think that’s a good idea as that rings in case of a fire. I’m on the permanent rotation as well and it is not as disturbing as I thought first. Actually, I turned it off while I was on holiday to be able to rest, so I’ll have to turn it back on. PRO TIP: If you add yourself to the rotation, make sure to not set the rotation for the whole day, otherwise you will have light evenings sometimes.

@gabor I’ll let @maxim continue that conversation with you. The info from your comment above will serve as a starting point for him, and what’s most relevant will probably become apparent as he starts gaining experience with the role in the coming weeks :slightly_smiling_face:

CC @mtyaka

1 Like