Hey team,
During recent 121s, a couple people brought up with me the issue of lack of strong leadership around devops and the overall messy situation we have today. So I’ve been thinking about it this week and now I want to summarize some of the problems and a proposed solution.
Ticket for this post: MNG-2210.
Feel free to skip the challenges part if you already know all these things
OCIM Challenges
Ocim is used and developed by everyone but often not clearly owned by anyone, and though we occasionally have discoveries where we agree on the next steps of its evolution, it is currently a chimera with a lot of different ideas stitched together into a hacky system - albeit one that works.
- The original idea (I think) was that Ocim would contain the code to automate our entire company, which is why the repo is still called
opencraft
, which can be confusing. Needless to say, this vision was never realized at all, and the code for automating our company is split into many different projects (Sprints, accounting, crafty 1, crafty 2, crafty 3, …). - Since we thought Ocim might deploy different things besides Open edX, there is a complicated (potentially confusing) abstraction layer that I wrote called Instance, which is subclassed by
OpenEdXInstance
. Every instance has two different IDs, itsOpenEdXInstance
ID and itsInstanceReference
ID, and you have to not confuse them. - Xavier’s original idea was for Ocim to be entirely driven by GitHub Pull Requests, and that every interaction with Ocim would be through comments and commands on GitHub pull requests. I then tried to take it in a different direction, with a more UI-first approach for launching instances and compartmentalizing the PR code into a separate
pr_watcher
module. Since then I don’t even know what direction has been taken, but needless to say these are very different approaches and they all leave some technical debt. - Ocim was designed to use OpenStack as an abstraction layer so we could deploy to “any OpenStack Cloud host” and not be tied to one provider (OVH). However, due to OpenStack having a lot of issues with complexity, maintenance, design, politics, etc. it turned out that there were virtually no other public OpenStack cloud providers besides OVH at a reasonable price point. So we’ve ended up tied to OVH in the end anyways, and they haven’t been very reliable. In fact some of our servers once went offline in a literal fire.
- Ocim has two different UIs, https://console.opencraft.com/ and https://manage.opencraft.com/ which use totally different technology stacks.
- Ocim has half-baked support for important things like multi-appserver deployments
- Parts of Ocim’s infrastructure are very complex due to redundancy (e.g. the load balancers), but others have single points of failure (all servers for a service in the same data centre).
- Ocim is not continuously deployed
- Ocim relies on a single cluster of MySQL for all instances, instead of having 1 cluster per N instances.
Of course, I’m focusing on problems here, and any big project inevitably gets technical debt. Don’t forget that Ocim has done a lot of awesome things for us - letting us offer clients fully independent installations of edX (not very shared), letting us do complex rolling upgrades, and (very importantly) providing a reliable sandbox service that has given us an edge in our upstream PRs to edx-platform.
Infrastructure Challenges
Our infrastructure is complex. There are many repos that we use to manage it, and things are deployed in various providers (OVH, SoYouStart, AWS), accounts, data centres, and regions. Documentation is sometimes good, sometimes lacking, but also scattered and can be hard to find (I’ve started working on improving documentation). For security reasons I won’t go into too many details now (this thread is public), but suffice it to say that although things aren’t terrible, they’re a bit messy and confusing, and we can do better.
GitLub
This one is a bit minor but we currently use a mix of GitLab+GitHub and many of our projects are actually on both. It would be good to finish moving everything to GitLab if that’s what we want to do, and use GitHub only for upstream/external projects.
Proposed New Approach
To clean this situation up, we’re going to need leadership (among other things, also buy-in, budget, etc.). But “OpenCraft DevOps” is too big and varied for one person to lead (at least not without a lot of help), so I’d like to propose splitting our devops strategy into three big projects with a person leading each.
Hosting 2.0 Project
(ties to the Ocim Investment Priority)
The person in charge of this project (@shimulch? @giovannicimolin ?) should guide us as a team to:
- Adopt & help establish new community standard for containerized deployments (likely Tutor)
- Simplify the Ocim codebase as much as possible by leveraging existing projects like Tutor (for dealign with Open edX complexity/upgrades), Terraform (abstracting, defining, and deploying Kubernetes clusters, DNS records, and other infrastructure), and GitLab CI - as shown in my Tutorraform demo
- I’m not opposed to starting a new codebase actually, as long as we can re-use the existing React frontend code.
- Make Ocim able to deploy Open edX into containers on any hosting provider that offers managed Kubernetes service - with initial support for DigitalOcean and AWS
- Give Ocim the ability to deploy an entire “Open edX Cluster” (Kubernetes + MySQL + MongoDB + S3 + backups) on a chosen provider, so that big customers like HMS can have dedicated clusters, and smaller customers get better performance as we can assign N Open edX instances per cluster and spin up new clusters as we onboard more small hosting clients.
- Update Ocim to match the stronger open source project requirements we have established
- Make it easier to onboard developers to Ocim, to run it locally, etc.
- Consider all our Ocim-related infrastructure (MySQL cluster, MongoDB cluster, RabbitMQ, load balancers, Consul, Prometheus, etc.) part of Ocim - actually part of Ocim’s legacy codebase.
- Ensure everything is documented and defined by Terraform, marked as part of Ocim-legacy, and separate from the rest of our infrastructure (see below).
- Plan to eventually remove all of these things once Ocim has the capability of easily spinning up new clusters that have all of these same services but running on AWS/DigitalOcean.
OpenCraft Tools Project
(ties to the Automation Investment Priority)
This project is focused on deployment, maintenance, and security of the tools that we need to function as a company, including but not limited to:
- Jira
- Sprints
- Mattermost
- Discourse Forum
- Vault
- Google Workspace
- Accounting service
- Hosting of www.opencraft.com Wordpress site
- OpsGenie
- logs.opencraft.com
- Mailing Lists
The person in charge of this project (@toxinu?) should guide us as a team to:
- Completely separate the deployment of these tools from Hosting infrastructure (improves security, redundancy, and management)
- Create a single repository that contains clear documentation and terraform scripts that explain and define the deployment. Include clear examples of how to do things like restart Jira.
- Finish moving services to OpenCraft’s infrastructure FAL-313 (Move MX from
mail.plebia.org
, move mailing lists) - Consider moving our infra to another host like DigitalOcean or Hetzner if OVH outages happen again
Custom Hosting Standards Project
This project is focused around our current large clients that use non-Ocim hosting (Campus, HMS, etc.). In practice this just means our AWS clients.
The person in charge of this project (@jill?) should guide us as a team to:
- Set standards and documentation for AWS Hosting that all cells/clients can adopt
- And in particular to clearly separate these standards, processes, and infrastructure from the Ocim infrastructure
- Convert all AWS deployments to be fully defined by Terraform, i.e. to own and develop our AWS Open edX Terraform Scripts
- Get/keep all our AWS clients upgraded to the latest Open edX release and to eliminate requirements for forked versions of Open edX through upstreaming (and plugins / integrated apps/services if necessary)
- Plan to move these clients from deploying on AWS VMs to deploying onto Ocim Kubernetes clusters, once Ocim support is ready.
Note that individual client/epic owners in various cells are still responsible for each separate AWS client/project; this project is about supporting them and making that easier/better.
For each of these projects, the idea is to have one person taking leadership, but not necessarily doing the work. So they should be reviewing each sub-epic and each discovery related to these things, but other people/epics/cells can be used to actually do the work, as long as the leader is following, guiding, and reviewing it.
Anyhow, as you can see, my main idea here is just to conceptually split what we currently call “devops” into three clearly separate projects. But I think that really helps us with clarity and direction, and separation of concerns. And this is not the first time we’ve done such a split.
What do you think?