Tech talk/demo: Deploying multiple Open edX instances onto a Kubernetes Cluster with Tutor

I have posted on the Open edX forum too: https://discuss.openedx.org/t/tech-talk-demo-deploying-multiple-open-edx-instances-onto-a-kubernetes-cluster-with-tutor/4641

3 Likes

I agree that the experience of managing Kubernetes clusters on AWS is not that smooth even if we haven’t used the autoscaling a lot for LX. I have zero experience with DigitalOcean but based on your feedback I am already excited to try Kubernetes on DigitalOcean! :smiley:

And I also agree that having to rely on a community-based module is not that great. DO documentation about their Kubernetes managed service looks pretty sick too. :star_struck:

@braden I plan on deploying the services on my personal accounts as well. No need to keep them up.

I mistakenly put the review ticket in the wrong sprint, so I’ll only review the deployment next sprint :frowning:

Awesome!

Is this the full list of services that need to be pulled out and made share-able in order to de-risk this approach a cost-effective, containerized solution for Ocim on DigitalOcean?

  • MySQL
  • Mongo
  • load balancer
  • caddy
  • k8s cluster: can already be shared; just need to set K8S_NAMESPACE to something unique for each deployment.

Credentials are in Vault, but be careful please :grinning_face_with_smiling_eyes: I haven’t shifted billing to OpenCraft yet.

I think so, yes. Perhaps also an S3-like bucket for each instance (likely easier to back up and more reliable than the other option of using tutor’s minio plugin to deploy minio)

Ah gotcha… your terraform deploys a a single Spaces bucket (or AWS S3 bucket) for each cluster, but we’d want to move that into Tutor, one for each instance.

Ooh and we’ll need logrotation too, for monitoring and analytics/tracking logs.

OpenCraft gets the bills for our DigitalOcean account now. But still be careful :slight_smile:

@braden I started reviewing the implementation and testing things around. Awesome work! The scripts make deploying Open edX a painless experience (compared to the full AWS setup). :rocket:


To get to a working Open edX instance it took me around 2 hours without any prior knowledge of the approach.
Contrary to @jill, I’ve bumped on a few issues (some related to docs and some related to the setup I’ve used). I went for the AWS deployment since DO one was tested by @jill.

Overall, the setup is pretty quick and easy, and it’s possible to reuse the cluster to deploy multiple Open edX instances with just a few lines of code.

I also share the opinion that we should move some of the services from the k8s cluster to managed services (or shared services that we host), namely MySQL, MongoDB and S3 buckets.

Issues during provisioning:

  • Some missing documentation: needs ruby installed, GitLab Operations need to be enabled, get_kubeconfig_path fails on Linux - but works if you manually run the code inside the command (will open a PR soon)
  • SSL termination using Caddy didn’t work for me when using a HSTS domain (cimolin.dev) - I ended up provisioning the Open edX instance with no SSL termination then I used my personal reverse proxy setup to link to the instance and provision the certificate.
    • This shows that Tutor already supports delegating the SSL provisioning to an external tool :slight_smile:
  • There’s something wrong with the storage (course import/export are not working - and instructor tasks are failing as well) - @braden @jill Did you run into these issues in your setup?

Instance link: https://test123.cimolin.dev - I’ll only keep this up until the end of the sprint - k8s on AWS is expensive.


Infrastructure provisioning vs Instance provisioning

Currently, the provisioning works like this (@braden correct me if I got anything wrong):

  1. Terraform is used to provision the k8s cluster and it’s dependencies.
    Done once to start-up the environment.
  2. Tutor provisions an Open edX instance inside the k8s cluster and sets up routing/load balancer/etc.
    Done every time a change or new instance is deployed.

In order to use shared services, we’ll also need a Terraform step to be run every time a new instance is provisioned.
For that, I think it’s best if we have two separate Terraform repositories: one for the deployment infrastructure and shared services (k8s cluster, managed databases - current tutorraform repo), and another one for managing instance resources (to provision the DBs + DB users, S3 buckets + credentials, and so on). This should just be 1 extra step in the CI pipeline, replacing the current Ocim resource management with a more reliable one (terraform state), right?


Where Ocim fits on this?

This approach will simplify Ocim and remove the resource provisioning and management responsibilities from it. I think that the best approach would be to have a system with three moving parts:

  1. Tutorraform: handling the actual provisioning, migration, redeployment and scaling of instances.
  2. Image builder: something to build Open edX images from the base tutor images + client customizations and themes.
  3. Ocim: a management UI to let users customize their instance, handle billing add any XBlocks or extra dependencies. This would then communicate with (1) and (2) to provision the instance or deploy the changes - maybe through git commits or infrastructure/build repos?

Note about (2):
I don’t think this is the scope of this ticket, but this setup only deals with deploying prebuild Tutor images of Open edX instances. To move to the containerized approach, we’ll still a need to implement a “image builder” to automate client customization deployments even if it’s still inside Ocim.

@giovannicimolin Thanks for trying it out!

Ah, I thought that like python it was installed by default on most systems; I guess that’s just MacOS that does that?

Ah, yeah I’m pretty sure it uses HTTP-based verification, so if you don’t allow insecure HTTP it won’t be able to get a cert from Let’s Encrypt. But we’re planning to replace Caddy-per-instance with our own load balancer for the cluster, where we can configure additional options.

I didn’t test storage, and I think it’s basically unconfigured. There is a Tutor plugin to use minio for storage, but I didn’t test it out. We’ll likely create our own Tutor plugin that provisions a DigitalOcean Spaces / Amazon S3 bucket per instance.

Basically, yes. That per-instance provisioning could be done either in Terraform or in Tutor (with a custom Tutor plugin).

I was thinking of having it be part of the same repo, just a separate folder and separate Terraform module, and have GitLab CI auto-generate the tfvars from the list of instances, and then Terraform would use that to provision all the databases etc. But it might simpler or more efficient to do it as a Tutor plugin; not sure. What I like about the Tutor plugin approach is that it still lets you use GitLab CI to just deploy/update one instance at a time.

If we go with Terraform and we want it to be a separate module (you could combine it into the existing Tutorraform Terraform modules actually if you wanted, but probably not ideal), then yeah it would be an additional step in the CI pipeline, before Tutor.

Everything you said there makes sense to me :)

1 Like

For most Linux systems it is, but maybe not for the build-your-own systems like Arch, and it might be a pretty old version even if so. It wouldn’t surprise me if some distros are still shipping Python 2.7, for instance.

+1 to everything here, and above. I also think there should be a separate repo for shared infra.

1 Like

@jill @shimulch @giovannicimolin and anyone else interested: Great news! DigitalOcean now has managed MongoDB, which would be one less thing we need to manage ourselves using this new infrastructure. From looking at their Terraform digitalocean_database_cluster/digitalocean_database_db it does seem like we can provision one large cluster with many smaller per-instance databases on it.

3 Likes

Unfortunately it seems they only support Mongo 4.4. It could be an issue if they automatically upgrade instances before we are prepared for it.

Yes, this looks like a great thing. But as Kshitij pointed out, if there is no option to turn off the automatic upgrade that will be a bummer. I see a glimpse of settings in their demo video where it looks like we can choose the window (say Sunday 12 AM) to upgrade. But we would like to have it turned off and upgrade on demand.

Do we know what part of edX is incompatible with newer versions? Most of the modulestore code (which I’ve been working with lately - Log in - OpenCraft ) uses only very simple mongo operations like structures.find_one({'_id': key}) which should be pretty forward compatible.

I think it’s just a matter of a) the underlying library supporting the newer versions (which required an update for Mongo 4.0, for example), and b) somebody testing it and reporting back that it works. Both of which we can do, given enough lead time.

Open edX will have to support 4.2 soonish, in any case. If we validate 4.4 and upstream findings, I don’t think anybody will complain. :slight_smile:

I think the bigger issue is falling out of sync. Even if Open edX supports 4.4 now, and we use it, what happens when 4.4 or 4.6 or 5.0 come out and DigitalOcean automatically upgrades us, but the platform is still lagging.

We’ve had this issue before. As @adolfo says above, it’s also a matter of libraries. A few releaes ago, course exports would break under 3.6 even though they worked otherwise. Recently some work was done to enable support for 4.0. I personally had to do some work to add support for the authentication method Atlas uses, becuase it wasn’t something edx-platform supported.

@adolfo I think you mean that MongoDB 4.0 will reach the end of its support window in early 2022 and that’s the reason for „having“ to support 4.2 as the minimum Mongo version, right?
But it looks like it already supports MongoDB 4.2.

I’m asking because Campus is using MongoDB 3.2 and we’ll likely migrate them to 4.2, and I’m assuming that 4.2 will work well (in Koa).

Yup! But if it already supports 4.2, so much the better!

These are assumptions:

I was assuming it because Koa and Mongo 4.2 overlap in the support timeline.

But do we have more solid proof that Open edX (Koa) works will with MongoDB 4.2? Or Lilac with 4.4?
Or will we need PRs to add support for higher versions, like this one?
The devstack still uses 4.0 right now, and we do too.