Tech talk/demo: Deploying multiple Open edX instances onto a Kubernetes Cluster with Tutor

@braden thanks for all the work here. :slight_smile: I see one of the leverages of this setup is we will be free from vendor locking. I am very keen on how do we monitor health and how can we build fault tolerance.

The architecting of the whole package is also fascinating, how can we model different IDAs to interact with each other and probably increase the resiliency of the system.

I may be sounding childish here so please pardon me for that. I will probably try to run the whole setup and play around with it to get a better idea.

@braden I actually ended up fitting a very small task to look at this quickly this sprint. :) Kudos, you seem to have done a lot of the exploratory work to answer the remaining questions regarding using Tutor with Kubernetes, thanks for that!

I haven’t dug deep enough to be able to have a properly informed technical opinion, but it seem like an approach that is consistent with the general direction the project is taking – and it heavily relies on existing projects and infrastructure to avoid reinventing the wheel, while providing a standardized way to deploy Open edX in production and potentially at scale.

A few questions that remained after my superficial look – they would probably be answered by a deeper review and a better understanding of the different pieces, so some of these will be naive questions, but I thought they might be useful to answer them explicitly:

  • One core issue with Kubernetes is the amount of ongoing maintenance such a cluster is known to require, when managed directly – I’m not clear on where that maintenance burden lies, currently. Is it on us? On AWS/Digital Ocean? On the maintainer of the Terraform scripts (and would that be us, or an upstream? seem quite small) What’s the best way to delegate this as much as possible?
  • What would be Ocim’s role, in this context? I like that it would make its scope smaller, but it would be useful to define it precisely, so it’s clear going forward
  • Would it be worth posting this video/test on the official forums, as well as a link to the current thread? It could be good to get the opinion of others in the community – including edX who is also evaluating how to use the different pieces available in the community for their infrastructure, and Regis for Tutor.

In any case, kudos for this work, it looks like a great step forward.

Feel free to shut this down – I’m going to create my own to test your repo and docs.

Cool. BTW if you want to test DigitalOcean, perhaps you can create an official “OpenCraft” DigitalOcean account that we can all share for future tests? (And if we decide to start using it for hosting)

Sure, will do, and will put it in Vault once we’ve got the payment method switched off my credit card :slight_smile:

1 Like

@antoviaque That’s great, thanks for checking it out!

Yup, that’s what I was going for here! This whole experiment is just a bunch of “glue” scripts to make other existing tools work well together.

Yes, Kubernetes can be quite complex, so we definitely want a managed service.

In my (limited) experience, DigitalOcean’s managed Kubernetes service is the nicest offering (and conveniently, it’s also one of the most affordable). DigitalOcean takes care of all the complexity of managing Kubernetes itself, as well as auto-scaling the cluster, running the Kubernetes Dashboard, etc. (They also maintain the official DigitalOcean Terraform provider, which I used here.) I think it’s a pretty nice separation of concerns, where we are responsible for our application + databases, and they are responsible for the Kubernetes cluster itself and everything it relies on.

Amazon EKS is an alternative, but is way more expensive, and still relies on you to manage a lot yourself. We’ve been using it in production for several years now for LabXchange, so we have a good understanding of it. Perhaps @toxinu might want to weigh in on how much maintenance has been required lately? In the case of EKS, Amazon manages a much more minimal Kubernetes service, and you have to implement things like auto-scaling, the Kubernetes dashboard, and other features yourself - so it’s more complicated. It also relies on “community supported” / semi-official terraform modules like this.

So I think that using DigitalOcean would give us less to manage ourselves than we do today. If we used their managed MySQL service (which is also an option), we could have even one less thing.

Things we (and I’m including “Ocim” in this “we”) still have to manage are:

  • MySQL (or just use RDS/DigitalOcean MySQL)
  • MongoDB (currently Tutor deploys a separate MongoDB instance for each Open edX instance, and I don’t think it has automatic backups)
  • HTTPS certs (Tutor deploys a separate load balancer for each instance, which would get expensive; we can use a very similar approach but with one load balancer per cluster, which will be much cheaper)
  • DNS records

That’s part of what I’d like the reviewers of this project to comment on. So I don’t have a precise answer to that yet.

In some sense, this project acts like a “mini Ocim”, and provides some overlapping functionality (i.e. deploying and managing lots of Open edX instances) just using Gitlab.

Here’s how I’m thinking of the major pieces:

  1. Customer portal: To me the most obvious thing that it lacks which Ocim provides is a secure customer portal that allows customers to deploy their own instances, manage things like theming / billing, etc. So I think that’s a clear role for Ocim going forward.
  2. Orchestration: As for orchestrating upgrades and running the various tools like Terraform or Tutor, we can choose to use GitLab CI to do that as I’ve shown here (with Ocim triggering it as needed using the GitLab API) or implement that in Ocim (which wouldn’t be too hard, as it mostly involves running external commands which Ocim already does). I think we need a bit more discovery/experience with GitLab CI to answer this.
  3. Managing Open edX: Some Ocim code is related to specific nuances of Open edX, like the upgrade process. But Tutor also includes similar code to deal with those things, and has more community usage. So I think we should see if Tutor will cover our use cases and if we have an effective relationship with Tutor as a key upstream project; I think that’s a good goal and based on my experiments so far it seems like it’s possible. However, if not, we’ll need to continue having Ocim handle that.

If we can say that Ocim will mostly focus on (1) and we’ll integrate it with GitLab CI or GoCD for (2), and Tutor for (3), I would say that’s a big win in terms of simplification and re-using existing tools. But I’m not yet clear on how much if any of (2) we can/should move out.

Yes! I’m just waiting for our internal reviewers first:

1 Like

This is really exciting @braden! Both the containerized deployment mechanism, and DigitalOcean as a cloud provider, frankly.

I assume we can run multiple instances/deployments on a single DigitalOcean cluster?

So from what I can see from DigitalOcean, we could:

Alternatively, we can continue to run our haproxy servers and point at the droplets, like we do now.

Yep, need to share scalable instance(s) here, with proper backups. Managed hosting would be dreamy… And even better if we can actually shard our clients over multiple database hosts, so our backups aren’t so monolithic.

Ocim driving Gitlab or GoCD is interesting… We’d still need to link the Customer portal part in.

Would we then:

  • Keep a private configuration for each instance managed by Ocim (as branches of a single private repo or multiple private repos, whatever)
  • Ocim passes changes to config by pushing to git, triggering redeployments.
  • Would get rid of redeployment bottlenecks on Ocim server

I’d rather improve Tutor to manage upgrades better than continue to maintain our own separate upgrade pathways. Then, more people can use our work.

@braden, thanks a lot for this, and in particular, your insights on how Ocim could make use of this going forward. I’m essentially on the same page: let Ocim be an awesome frontend, and delegate everything else.

We can point Open edX/Tutor instances to existing Mongo servers, just like for MySQL. The question, also like for MySQL, is one of provisioning and backups. Who’s going to do it? Ocim? Terraform? Gitlab CI?

This would be fabulous. It would solve one more problem, too: that of revision-controlling config changes. +100

Another +1. Plus, it’s the kind of contribution Regis would be very open to, I think.

Can you keep us posted on this, @jill? I’d like to play around with this next sprint, too. Thanks!

Yes, that’s one of the main goals here and is already working.

Yes, though we can also use DigitalOcean DNS. Either way, Terraform can be used to manage it by code with version control.

It can be simpler than that. The DigitalOcean load balancer essentially just provides an external IP for the cluster and routes inbound traffic to a service that you specify. In the case of Tutor, it routes traffic to Caddy which then provides automatic HTTPS (Let’s Encrypt) and forwards the traffic on to whichever service is appropriate (LMS, Studio, ecommerce, etc.).

The “problem” is that Tutor currently deploys a separate DigitalOcean load balancer ($10/mo each) and Caddy instance for each Open edX instance. We only need one DigitalOcean load balancer and one Caddy instance, and Caddy can then route traffic to every LMS/Studio/etc. as needed. We only need to update the Caddy configuration (or perhaps it can auto-detect, see below), and the DigitalOcean load balancer’s config never changes.

I’m not very familiar with Caddy, but I know it can handle this use case. I personally have previously used traefik which is very similar, and has the nice feature that you don’t need a central config file - you just put tags/annotations on the various LMS/Studio/ecommerce/etc containers that run on your cluster, and it auto-detects them and routes traffic to them. But we’ll probably go with Caddy here (for consistency with Tutor and because the open source version of Traefik on Kubernetes doesn’t support High Availability if using Lets Encrypt).

As far as I know, DigitalOcean load balancers don’t support LetsEncrypt when used with Kubernetes, so we have to use Caddy for that. But that’s fine.

Yep, it’s all just a single private repo (see how it works when you test this out). To do a change (add a new instance, change the version and re-deploy all instances, etc.) you just open a merge request.

The Tutorraform scripts that I wrote here can be extended to deploy a managed MySQL cluster for each Kubernetes cluster, and to deploy a couple VMs that we can then deploy Mongo onto. So Terraform + Gitlab CI + ansible.

1 Like

I have posted on the Open edX forum too:


I agree that the experience of managing Kubernetes clusters on AWS is not that smooth even if we haven’t used the autoscaling a lot for LX. I have zero experience with DigitalOcean but based on your feedback I am already excited to try Kubernetes on DigitalOcean! :smiley:

And I also agree that having to rely on a community-based module is not that great. DO documentation about their Kubernetes managed service looks pretty sick too. :star_struck:

@braden I plan on deploying the services on my personal accounts as well. No need to keep them up.

I mistakenly put the review ticket in the wrong sprint, so I’ll only review the deployment next sprint :frowning:


Is this the full list of services that need to be pulled out and made share-able in order to de-risk this approach a cost-effective, containerized solution for Ocim on DigitalOcean?

  • MySQL
  • Mongo
  • load balancer
  • caddy
  • k8s cluster: can already be shared; just need to set K8S_NAMESPACE to something unique for each deployment.

Credentials are in Vault, but be careful please :grinning_face_with_smiling_eyes: I haven’t shifted billing to OpenCraft yet.

I think so, yes. Perhaps also an S3-like bucket for each instance (likely easier to back up and more reliable than the other option of using tutor’s minio plugin to deploy minio)

Ah gotcha… your terraform deploys a a single Spaces bucket (or AWS S3 bucket) for each cluster, but we’d want to move that into Tutor, one for each instance.

Ooh and we’ll need logrotation too, for monitoring and analytics/tracking logs.

OpenCraft gets the bills for our DigitalOcean account now. But still be careful :slight_smile:

@braden I started reviewing the implementation and testing things around. Awesome work! The scripts make deploying Open edX a painless experience (compared to the full AWS setup). :rocket:

To get to a working Open edX instance it took me around 2 hours without any prior knowledge of the approach.
Contrary to @jill, I’ve bumped on a few issues (some related to docs and some related to the setup I’ve used). I went for the AWS deployment since DO one was tested by @jill.

Overall, the setup is pretty quick and easy, and it’s possible to reuse the cluster to deploy multiple Open edX instances with just a few lines of code.

I also share the opinion that we should move some of the services from the k8s cluster to managed services (or shared services that we host), namely MySQL, MongoDB and S3 buckets.

Issues during provisioning:

  • Some missing documentation: needs ruby installed, GitLab Operations need to be enabled, get_kubeconfig_path fails on Linux - but works if you manually run the code inside the command (will open a PR soon)
  • SSL termination using Caddy didn’t work for me when using a HSTS domain ( - I ended up provisioning the Open edX instance with no SSL termination then I used my personal reverse proxy setup to link to the instance and provision the certificate.
    • This shows that Tutor already supports delegating the SSL provisioning to an external tool :slight_smile:
  • There’s something wrong with the storage (course import/export are not working - and instructor tasks are failing as well) - @braden @jill Did you run into these issues in your setup?

Instance link: - I’ll only keep this up until the end of the sprint - k8s on AWS is expensive.

Infrastructure provisioning vs Instance provisioning

Currently, the provisioning works like this (@braden correct me if I got anything wrong):

  1. Terraform is used to provision the k8s cluster and it’s dependencies.
    Done once to start-up the environment.
  2. Tutor provisions an Open edX instance inside the k8s cluster and sets up routing/load balancer/etc.
    Done every time a change or new instance is deployed.

In order to use shared services, we’ll also need a Terraform step to be run every time a new instance is provisioned.
For that, I think it’s best if we have two separate Terraform repositories: one for the deployment infrastructure and shared services (k8s cluster, managed databases - current tutorraform repo), and another one for managing instance resources (to provision the DBs + DB users, S3 buckets + credentials, and so on). This should just be 1 extra step in the CI pipeline, replacing the current Ocim resource management with a more reliable one (terraform state), right?

Where Ocim fits on this?

This approach will simplify Ocim and remove the resource provisioning and management responsibilities from it. I think that the best approach would be to have a system with three moving parts:

  1. Tutorraform: handling the actual provisioning, migration, redeployment and scaling of instances.
  2. Image builder: something to build Open edX images from the base tutor images + client customizations and themes.
  3. Ocim: a management UI to let users customize their instance, handle billing add any XBlocks or extra dependencies. This would then communicate with (1) and (2) to provision the instance or deploy the changes - maybe through git commits or infrastructure/build repos?

Note about (2):
I don’t think this is the scope of this ticket, but this setup only deals with deploying prebuild Tutor images of Open edX instances. To move to the containerized approach, we’ll still a need to implement a “image builder” to automate client customization deployments even if it’s still inside Ocim.

@giovannicimolin Thanks for trying it out!

Ah, I thought that like python it was installed by default on most systems; I guess that’s just MacOS that does that?

Ah, yeah I’m pretty sure it uses HTTP-based verification, so if you don’t allow insecure HTTP it won’t be able to get a cert from Let’s Encrypt. But we’re planning to replace Caddy-per-instance with our own load balancer for the cluster, where we can configure additional options.

I didn’t test storage, and I think it’s basically unconfigured. There is a Tutor plugin to use minio for storage, but I didn’t test it out. We’ll likely create our own Tutor plugin that provisions a DigitalOcean Spaces / Amazon S3 bucket per instance.

Basically, yes. That per-instance provisioning could be done either in Terraform or in Tutor (with a custom Tutor plugin).

I was thinking of having it be part of the same repo, just a separate folder and separate Terraform module, and have GitLab CI auto-generate the tfvars from the list of instances, and then Terraform would use that to provision all the databases etc. But it might simpler or more efficient to do it as a Tutor plugin; not sure. What I like about the Tutor plugin approach is that it still lets you use GitLab CI to just deploy/update one instance at a time.

If we go with Terraform and we want it to be a separate module (you could combine it into the existing Tutorraform Terraform modules actually if you wanted, but probably not ideal), then yeah it would be an additional step in the CI pipeline, before Tutor.

Everything you said there makes sense to me :)

1 Like

For most Linux systems it is, but maybe not for the build-your-own systems like Arch, and it might be a pretty old version even if so. It wouldn’t surprise me if some distros are still shipping Python 2.7, for instance.

+1 to everything here, and above. I also think there should be a separate repo for shared infra.

1 Like