Tech talk/demo: Deploying multiple Open edX instances onto a Kubernetes Cluster with Tutor

Hey everyone,

I’m excited to have a little demo video to share about my experiment to integrate Tutor + Terraform + GitLab CI. If you’re interested in this, check it out:

@jill @adolfo and @giovannicimolin have already planned to do a formal review of this work and share their ideas for next steps, as we think about things like converting Ocim to use Terraform and deploying Open edX using containers. If anyone has questions or ideas though, please share below! After we’ve done an internal review, I’d like to post on the upstream forum and also get Régis’s opinion.

BTW, given the ongoing challenges with OVH, I’d like us to seriously look at DigitalOcean as a viable alternative provider, if our infrastructure is containerized. Their managed Kubernetes service is really nice (compare the AWS vs. DigitalOcean Terraform and you’ll see), and they’re about 1/4 the cost of AWS for this use case.

Ticket for this post: MNG-1925

Links:

9 Likes

Is Tutor a good alternative to the edX devstack for local development?

Wow, that’s a nice tutorial! :tada:

1 Like

@braden, thank you for writing the code and doing this demo! This looks awesome! :tada::+1:

I haven’t tried out Tutor till date and I still have to learn Kubernetes. So this appears to be all magic to me :stuck_out_tongue:

1 Like

I think there are goals of using it for that eventually, and some people use it that way now, but I would recommend the usual devstack for now, to be consistent with the rest of the team and compatible with all the different plugins and versions of stuff that we use.

That’s really cool @braden! Thanks for doing it. :clap:

1 Like

I’m on the same boat at @guruprasad that I haven’t yet had the chance to explore tutor or Kubernetes. This looks pretty exciting, looking forward to working with it! I’ve already set it up on my personal DigitalOcean account to play with.

1 Like

@braden As newcomer that never tried “old” dev enviroment and I will try your new dev stack setup, and tell you results :slight_smile:

Upd: Ooops. I must be read all comments first. It is not for development yet. But everything is launched fine on mac.

@artem.hruzd Exactly - this is not really a dev environment, so I still recommend using https://github.com/edx/devstack for that.

Update: In two days (end of the month), I’m planning to shut down the demo servers / Kubernetes cluster that I have running on DigitalOcean (on my personal account). If anyone wants me to leave them up longer, as part of testing this out, please let me know. (You can also spin up your own, very easily, using your own DigitalOcean account. And if we like the DigitalOcean service, we may consider setting up an OpenCraft account and using it for hosting in the future.)

CC @jill @adolfo @giovannicimolin

@braden I haven’t had time to look at it yet, so I wouldn’t mind keeping it up a bit longer – though I’m not yet sure when I’ll be able to, and ultimately my review isn’t strictly necessary, so it’s also fine to take it down.

@braden thanks for all the work here. :slight_smile: I see one of the leverages of this setup is we will be free from vendor locking. I am very keen on how do we monitor health and how can we build fault tolerance.

The architecting of the whole package is also fascinating, how can we model different IDAs to interact with each other and probably increase the resiliency of the system.

I may be sounding childish here so please pardon me for that. I will probably try to run the whole setup and play around with it to get a better idea.

@braden I actually ended up fitting a very small task to look at this quickly this sprint. :) Kudos, you seem to have done a lot of the exploratory work to answer the remaining questions regarding using Tutor with Kubernetes, thanks for that!

I haven’t dug deep enough to be able to have a properly informed technical opinion, but it seem like an approach that is consistent with the general direction the project is taking – and it heavily relies on existing projects and infrastructure to avoid reinventing the wheel, while providing a standardized way to deploy Open edX in production and potentially at scale.

A few questions that remained after my superficial look – they would probably be answered by a deeper review and a better understanding of the different pieces, so some of these will be naive questions, but I thought they might be useful to answer them explicitly:

  • One core issue with Kubernetes is the amount of ongoing maintenance such a cluster is known to require, when managed directly – I’m not clear on where that maintenance burden lies, currently. Is it on us? On AWS/Digital Ocean? On the maintainer of the Terraform scripts (and would that be us, or an upstream? https://gitlab.com/opencraft/dev/tutorraform/-/blob/main/provider-digitalocean/k8s-cluster/cluster.tf seem quite small) What’s the best way to delegate this as much as possible?
  • What would be Ocim’s role, in this context? I like that it would make its scope smaller, but it would be useful to define it precisely, so it’s clear going forward
  • Would it be worth posting this video/test on the official forums, as well as a link to the current thread? It could be good to get the opinion of others in the community – including edX who is also evaluating how to use the different pieces available in the community for their infrastructure, and Regis for Tutor.

In any case, kudos for this work, it looks like a great step forward.

Feel free to shut this down – I’m going to create my own to test your repo and docs.

Cool. BTW if you want to test DigitalOcean, perhaps you can create an official “OpenCraft” DigitalOcean account that we can all share for future tests? (And if we decide to start using it for hosting)

Sure, will do, and will put it in Vault once we’ve got the payment method switched off my credit card :slight_smile:

1 Like

@antoviaque That’s great, thanks for checking it out!

Yup, that’s what I was going for here! This whole experiment is just a bunch of “glue” scripts to make other existing tools work well together.

Yes, Kubernetes can be quite complex, so we definitely want a managed service.

In my (limited) experience, DigitalOcean’s managed Kubernetes service is the nicest offering (and conveniently, it’s also one of the most affordable). DigitalOcean takes care of all the complexity of managing Kubernetes itself, as well as auto-scaling the cluster, running the Kubernetes Dashboard, etc. (They also maintain the official DigitalOcean Terraform provider, which I used here.) I think it’s a pretty nice separation of concerns, where we are responsible for our application + databases, and they are responsible for the Kubernetes cluster itself and everything it relies on.

Amazon EKS is an alternative, but is way more expensive, and still relies on you to manage a lot yourself. We’ve been using it in production for several years now for LabXchange, so we have a good understanding of it. Perhaps @toxinu might want to weigh in on how much maintenance has been required lately? In the case of EKS, Amazon manages a much more minimal Kubernetes service, and you have to implement things like auto-scaling, the Kubernetes dashboard, and other features yourself - so it’s more complicated. It also relies on “community supported” / semi-official terraform modules like this.

So I think that using DigitalOcean would give us less to manage ourselves than we do today. If we used their managed MySQL service (which is also an option), we could have even one less thing.

Things we (and I’m including “Ocim” in this “we”) still have to manage are:

  • MySQL (or just use RDS/DigitalOcean MySQL)
  • MongoDB (currently Tutor deploys a separate MongoDB instance for each Open edX instance, and I don’t think it has automatic backups)
  • HTTPS certs (Tutor deploys a separate load balancer for each instance, which would get expensive; we can use a very similar approach but with one load balancer per cluster, which will be much cheaper)
  • DNS records

That’s part of what I’d like the reviewers of this project to comment on. So I don’t have a precise answer to that yet.

In some sense, this project acts like a “mini Ocim”, and provides some overlapping functionality (i.e. deploying and managing lots of Open edX instances) just using Gitlab.

Here’s how I’m thinking of the major pieces:

  1. Customer portal: To me the most obvious thing that it lacks which Ocim provides is a secure customer portal that allows customers to deploy their own instances, manage things like theming / billing, etc. So I think that’s a clear role for Ocim going forward.
  2. Orchestration: As for orchestrating upgrades and running the various tools like Terraform or Tutor, we can choose to use GitLab CI to do that as I’ve shown here (with Ocim triggering it as needed using the GitLab API) or implement that in Ocim (which wouldn’t be too hard, as it mostly involves running external commands which Ocim already does). I think we need a bit more discovery/experience with GitLab CI to answer this.
  3. Managing Open edX: Some Ocim code is related to specific nuances of Open edX, like the upgrade process. But Tutor also includes similar code to deal with those things, and has more community usage. So I think we should see if Tutor will cover our use cases and if we have an effective relationship with Tutor as a key upstream project; I think that’s a good goal and based on my experiments so far it seems like it’s possible. However, if not, we’ll need to continue having Ocim handle that.

If we can say that Ocim will mostly focus on (1) and we’ll integrate it with GitLab CI or GoCD for (2), and Tutor for (3), I would say that’s a big win in terms of simplification and re-using existing tools. But I’m not yet clear on how much if any of (2) we can/should move out.

Yes! I’m just waiting for our internal reviewers first:

1 Like

This is really exciting @braden! Both the containerized deployment mechanism, and DigitalOcean as a cloud provider, frankly.

I assume we can run multiple instances/deployments on a single DigitalOcean cluster?

So from what I can see from DigitalOcean, we could:

Alternatively, we can continue to run our haproxy servers and point at the droplets, like we do now.

Yep, need to share scalable instance(s) here, with proper backups. Managed hosting would be dreamy… And even better if we can actually shard our clients over multiple database hosts, so our backups aren’t so monolithic.

Ocim driving Gitlab or GoCD is interesting… We’d still need to link the Customer portal part in.

Would we then:

  • Keep a private configuration for each instance managed by Ocim (as branches of a single private repo or multiple private repos, whatever)
  • Ocim passes changes to config by pushing to git, triggering redeployments.
  • Would get rid of redeployment bottlenecks on Ocim server

I’d rather improve Tutor to manage upgrades better than continue to maintain our own separate upgrade pathways. Then, more people can use our work.

@braden, thanks a lot for this, and in particular, your insights on how Ocim could make use of this going forward. I’m essentially on the same page: let Ocim be an awesome frontend, and delegate everything else.

We can point Open edX/Tutor instances to existing Mongo servers, just like for MySQL. The question, also like for MySQL, is one of provisioning and backups. Who’s going to do it? Ocim? Terraform? Gitlab CI?

This would be fabulous. It would solve one more problem, too: that of revision-controlling config changes. +100

Another +1. Plus, it’s the kind of contribution Regis would be very open to, I think.

Can you keep us posted on this, @jill? I’d like to play around with this next sprint, too. Thanks!

Yes, that’s one of the main goals here and is already working.

Yes, though we can also use DigitalOcean DNS. Either way, Terraform can be used to manage it by code with version control.

It can be simpler than that. The DigitalOcean load balancer essentially just provides an external IP for the cluster and routes inbound traffic to a service that you specify. In the case of Tutor, it routes traffic to Caddy which then provides automatic HTTPS (Let’s Encrypt) and forwards the traffic on to whichever service is appropriate (LMS, Studio, ecommerce, etc.).

The “problem” is that Tutor currently deploys a separate DigitalOcean load balancer ($10/mo each) and Caddy instance for each Open edX instance. We only need one DigitalOcean load balancer and one Caddy instance, and Caddy can then route traffic to every LMS/Studio/etc. as needed. We only need to update the Caddy configuration (or perhaps it can auto-detect, see below), and the DigitalOcean load balancer’s config never changes.

I’m not very familiar with Caddy, but I know it can handle this use case. I personally have previously used traefik which is very similar, and has the nice feature that you don’t need a central config file - you just put tags/annotations on the various LMS/Studio/ecommerce/etc containers that run on your cluster, and it auto-detects them and routes traffic to them. But we’ll probably go with Caddy here (for consistency with Tutor and because the open source version of Traefik on Kubernetes doesn’t support High Availability if using Lets Encrypt).

As far as I know, DigitalOcean load balancers don’t support LetsEncrypt when used with Kubernetes, so we have to use Caddy for that. But that’s fine.

Yep, it’s all just a single private repo (see how it works when you test this out). To do a change (add a new instance, change the version and re-deploy all instances, etc.) you just open a merge request.

The Tutorraform scripts that I wrote here can be extended to deploy a managed MySQL cluster for each Kubernetes cluster, and to deploy a couple VMs that we can then deploy Mongo onto. So Terraform + Gitlab CI + ansible.

1 Like