Splitting up "DevOps" and my vision for next steps

braden · May 21, 2021, 7:16pm

Hey team,

During recent 121s, a couple people brought up with me the issue of lack of strong leadership around devops and the overall messy situation we have today. So I’ve been thinking about it this week and now I want to summarize some of the problems and a proposed solution.

Ticket for this post: MNG-2210.

Feel free to skip the challenges part if you already know all these things

OCIM Challenges

Ocim is used and developed by everyone but often not clearly owned by anyone, and though we occasionally have discoveries where we agree on the next steps of its evolution, it is currently a chimera with a lot of different ideas stitched together into a hacky system - albeit one that works.

The original idea (I think) was that Ocim would contain the code to automate our entire company, which is why the repo is still called opencraft, which can be confusing. Needless to say, this vision was never realized at all, and the code for automating our company is split into many different projects (Sprints, accounting, crafty 1, crafty 2, crafty 3, …).
Since we thought Ocim might deploy different things besides Open edX, there is a complicated (potentially confusing) abstraction layer that I wrote called Instance, which is subclassed by OpenEdXInstance. Every instance has two different IDs, its OpenEdXInstance ID and its InstanceReference ID, and you have to not confuse them.
Xavier’s original idea was for Ocim to be entirely driven by GitHub Pull Requests, and that every interaction with Ocim would be through comments and commands on GitHub pull requests. I then tried to take it in a different direction, with a more UI-first approach for launching instances and compartmentalizing the PR code into a separate pr_watcher module. Since then I don’t even know what direction has been taken, but needless to say these are very different approaches and they all leave some technical debt.
Ocim was designed to use OpenStack as an abstraction layer so we could deploy to “any OpenStack Cloud host” and not be tied to one provider (OVH). However, due to OpenStack having a lot of issues with complexity, maintenance, design, politics, etc. it turned out that there were virtually no other public OpenStack cloud providers besides OVH at a reasonable price point. So we’ve ended up tied to OVH in the end anyways, and they haven’t been very reliable. In fact some of our servers once went offline in a literal fire.
Ocim has two different UIs, https://console.opencraft.com/ and https://manage.opencraft.com/ which use totally different technology stacks.
Ocim has half-baked support for important things like multi-appserver deployments
Parts of Ocim’s infrastructure are very complex due to redundancy (e.g. the load balancers), but others have single points of failure (all servers for a service in the same data centre).
Ocim is not continuously deployed
Ocim relies on a single cluster of MySQL for all instances, instead of having 1 cluster per N instances.

Of course, I’m focusing on problems here, and any big project inevitably gets technical debt. Don’t forget that Ocim has done a lot of awesome things for us - letting us offer clients fully independent installations of edX (not very shared), letting us do complex rolling upgrades, and (very importantly) providing a reliable sandbox service that has given us an edge in our upstream PRs to edx-platform.

Infrastructure Challenges

Our infrastructure is complex. There are many repos that we use to manage it, and things are deployed in various providers (OVH, SoYouStart, AWS), accounts, data centres, and regions. Documentation is sometimes good, sometimes lacking, but also scattered and can be hard to find (I’ve started working on improving documentation). For security reasons I won’t go into too many details now (this thread is public), but suffice it to say that although things aren’t terrible, they’re a bit messy and confusing, and we can do better.

GitLub

This one is a bit minor but we currently use a mix of GitLab+GitHub and many of our projects are actually on both. It would be good to finish moving everything to GitLab if that’s what we want to do, and use GitHub only for upstream/external projects.

Proposed New Approach

To clean this situation up, we’re going to need leadership (among other things, also buy-in, budget, etc.). But “OpenCraft DevOps” is too big and varied for one person to lead (at least not without a lot of help), so I’d like to propose splitting our devops strategy into three big projects with a person leading each.

Hosting 2.0 Project

(ties to the Ocim Investment Priority)

The person in charge of this project (@shimulch? @giovannicimolin ?) should guide us as a team to:

Adopt & help establish new community standard for containerized deployments (likely Tutor)
Simplify the Ocim codebase as much as possible by leveraging existing projects like Tutor (for dealign with Open edX complexity/upgrades), Terraform (abstracting, defining, and deploying Kubernetes clusters, DNS records, and other infrastructure), and GitLab CI - as shown in my Tutorraform demo
- I’m not opposed to starting a new codebase actually, as long as we can re-use the existing React frontend code.
Make Ocim able to deploy Open edX into containers on any hosting provider that offers managed Kubernetes service - with initial support for DigitalOcean and AWS
Give Ocim the ability to deploy an entire “Open edX Cluster” (Kubernetes + MySQL + MongoDB + S3 + backups) on a chosen provider, so that big customers like HMS can have dedicated clusters, and smaller customers get better performance as we can assign N Open edX instances per cluster and spin up new clusters as we onboard more small hosting clients.
Update Ocim to match the stronger open source project requirements we have established
Make it easier to onboard developers to Ocim, to run it locally, etc.
Consider all our Ocim-related infrastructure (MySQL cluster, MongoDB cluster, RabbitMQ, load balancers, Consul, Prometheus, etc.) part of Ocim - actually part of Ocim’s legacy codebase.
- Ensure everything is documented and defined by Terraform, marked as part of Ocim-legacy, and separate from the rest of our infrastructure (see below).
- Plan to eventually remove all of these things once Ocim has the capability of easily spinning up new clusters that have all of these same services but running on AWS/DigitalOcean.

OpenCraft Tools Project

(ties to the Automation Investment Priority)

This project is focused on deployment, maintenance, and security of the tools that we need to function as a company, including but not limited to:

Jira
Sprints
Mattermost
Discourse Forum
Vault
Google Workspace
Accounting service
Hosting of www.opencraft.com Wordpress site
OpsGenie
logs.opencraft.com
Mailing Lists

The person in charge of this project (@toxinu?) should guide us as a team to:

Completely separate the deployment of these tools from Hosting infrastructure (improves security, redundancy, and management)
Create a single repository that contains clear documentation and terraform scripts that explain and define the deployment. Include clear examples of how to do things like restart Jira.
Finish moving services to OpenCraft’s infrastructure FAL-313 (Move MX from mail.plebia.org, move mailing lists)
Consider moving our infra to another host like DigitalOcean or Hetzner if OVH outages happen again

Custom Hosting Standards Project

This project is focused around our current large clients that use non-Ocim hosting (Campus, HMS, etc.). In practice this just means our AWS clients.

The person in charge of this project (@jill?) should guide us as a team to:

Set standards and documentation for AWS Hosting that all cells/clients can adopt
- And in particular to clearly separate these standards, processes, and infrastructure from the Ocim infrastructure
Convert all AWS deployments to be fully defined by Terraform, i.e. to own and develop our AWS Open edX Terraform Scripts
Get/keep all our AWS clients upgraded to the latest Open edX release and to eliminate requirements for forked versions of Open edX through upstreaming (and plugins / integrated apps/services if necessary)
Plan to move these clients from deploying on AWS VMs to deploying onto Ocim Kubernetes clusters, once Ocim support is ready.

Note that individual client/epic owners in various cells are still responsible for each separate AWS client/project; this project is about supporting them and making that easier/better.

For each of these projects, the idea is to have one person taking leadership, but not necessarily doing the work. So they should be reviewing each sub-epic and each discovery related to these things, but other people/epics/cells can be used to actually do the work, as long as the leader is following, guiding, and reviewing it.

Anyhow, as you can see, my main idea here is just to conceptually split what we currently call “devops” into three clearly separate projects. But I think that really helps us with clarity and direction, and separation of concerns. And this is not the first time we’ve done such a split.

What do you think?

farhaan · May 24, 2021, 9:08am

This was a much-needed breakdown and all the points and every challenge makes total sense. Something that is bothering me is I don’t see a roadmap of how we will achieve this. But probably when we put our heads together we can come up with a roadmap and checkpoints.

I also want to see how do we manage the internal projects that we decided to be working on, with these split and work coming up, how do we prioritize the work.

Both are equally necessary and interesting.

Thanks @braden for putting this out

jill · May 25, 2021, 2:07am

Yes, thank you @braden for laying out the issues here, and dividing them into (still admittedly large) but more manageable chunks. Our lack of direction and leadership on our operational infrastructure is a source of anxiety for me, but it’s also has so much potential for interesting work.

I don’t want to be negative, but I agree. Internal priorities have always been a struggle for us, which is a lot of why we’re at the this point @antoviaque 's investment priorities allow budget for much of this within our teams, but since we’re struggling to expanding our capacity, it doesn’t even feel possible to start planning for these steps yet, let alone setting timelines to action them. We’ve had to push back on new and existing client work, so how can we take on more?

I don’t feel comfortable taking on any more responsibilities at this point I’m at capacity for the number of clients and projects I’m managing, and don’t have brain space for more. If I could hand off some of the dev projects, then I could swap this in, but I can’t just keep adding on.

I’m so sorry… I wish I could say yes!

mtyaka · May 25, 2021, 8:35am

Thanks for putting down these notes and recommendations. The proposed approach sounds great to me!

I think that’s the best option. The move to containerized deployments will make a lot of code obsolete anyway, and it will be easier to use both the knowledge we gained and new open source tools that have emerged since Ocim’s inception if we start from a clean slate.

giovannicimolin · May 25, 2021, 12:55pm

Thanks for putting this down and splitting our DevOps things into distinct areas! Kudos

@jill @braden I’m interested in taking this up in the near future (once LTI and KKUx are delivered).

This addresses my concern about AWS deployments and Ocim

A big for this. I’d increase the scope of this point to also maintain a repo linking all of our plugins/themes/XBlocks/docs (something like GitHub - enaqx/awesome-react: A collection of awesome things regarding React ecosystem).

braden · May 25, 2021, 6:48pm

Perfect, @giovannicimolin! I know you’ve already done a bit of work on this already, so that makes sense Could you create an epic for that? (Or I can if you’d like.)

I know, and I hear you. For now I just want to get us aligned on the direction, and my hope is that we’ll be able to make capacity for these in the coming months. I think you can also see that my thoughts here don’t really involve any new projects or work, but are instead a change how we plan and think about the devops work that we already want to do.

For the capacity issues, since we already have a thread about it, let’s keep the discussion going there.

Also: I am going to be personally focused on two things for the next few weeks: core committers work, and improving our documentation. Once the docs project is wrapped up I can help out with any of these projects that need help.

jill · May 26, 2021, 5:13am

Yay! Glad to hear that @giovannicimolin , you’ll do a great job here.

antoviaque · May 28, 2021, 1:06pm

Thanks @braden – I agree that our DevOps infrastructure and management could use some improvement. And the points you have listed make sense; in particular, having specific people designated to manage and be responsible for well-defined areas seem like the critical point to improve. When something is everyone’s responsibility, it usually end up becoming nobody’s…

That said, the point of availability and priorities for the work you bring up @jill and @farhaan is definitely important to consider too – there will always be more than we want to do than time available to do it. It’s just the reality for any software company or project, and the important and difficult bit is to prioritize, otherwise we try to do everything, and end up completing nothing.

We have priorities for this year already, so I’d like to stick to those. Gladly, they include significant budget size and scope for some of the biggest items listed here – so that’s not incompatible. But I want to make sure we don’t grow that list, at least not until we have completed the work we already set on to do.

What we can do though, is to make sure that the work prioritized is handled according to the types of responsibilities and scope areas that you have defined above @braden – and basically, these new roles would be starting by working on their respective parts of the scope from the 2021 plan, split accordingly.

Does that make sense?

Also, once we know who handles what, it would make sense to update the relevant sections of the handbook, to describe and split the DevOps roles & responsibilities there.

braden · May 28, 2021, 5:27pm

@antoviaque Yes, that makes sense, and I already had that more or less in mind, which is why I made the link between these items and the 2021 plan in my post. It’s fine if we focus on the areas where there is the most overlap with that plan, as long as the overall direction is still taken into account.

The project about standardization of how we host clients on AWS doesn’t directly tie into the 2021 plan but does tie into our general corporate priorities (upstreaming) and technical standards (infrastructure as code, terraform moving forward). So I understand that we won’t have any dedicated budget for it, but I do hope we can continue to make incremental progress on it anytime we’re working on AWS clients and particularly if onboarding any new ones.