Running our infrastructure on Kubernetes...what do you think?

keithgg · August 31, 2022, 2:12pm

Xavier mentioned something poignant on SE-5468 that not everyone is aware of the changes with regards to moving our infrastructure to Kubernetes.

In lieu of upgrading some servers to Ubuntu 20.04, it was decided that moving to Kubernetes makes sense at this juncture since that would make future upgrades and maintenance a lot easier.

For Open Edx app servers we’ll use Grove for deployments whereas for our own infrastructure we’ll make use of the infrastructure repo.

Currently crafty-bot and Listaflow are already running on Kubernetes in production.

Discourse is being migrated next with Mattermost to follow. In addition we have the below tickets for the remaining services.

SE-5596 Migration: Vault - Prod only
SE-5597 Migration: Sprintcraft - Staging and Prod
SE-5598 Migration: Matomo - Prod only
SE-5601 Migration: Hyperkitty - Prod only
SE-5602 Migration: Wordpress of opencaft.com - Staging and Prod

With respect to the components that are needed to keep the cluster going:

Digital Ocean is our provider of choice.
Monitoring is still handled by Prometheus/Alert Manager and Sentry.
Managed services will be used where possible eg. for databases, Redis, etc.
Velero will handle cluster backups, which will be stored on S3/DO Spaces.

This forum post serves a focal point for us to discuss anything related to this move and to alleviate any concerns from the team.

Agrendalath · September 1, 2022, 9:49am

That’s great news. Kudos for moving this forward!

One question - are we planning to start self-hosting Sentry or to upgrade our plan (e.g. to have unlimited users)? It’s a great tool, but we’re not gaining a lot without actually using it - e.g. we have 5 open ListaFlow issues (this deadlock looks particularly interesting).

This is likely related to the fact that I created our Sentry account with my email (piotr@opencraft.com) in 2019 when we were using it only for SprintCraft. I’ve been receiving and triaging error alerts from SprintCraft since then, but I don’t have any context on crafty-bot and ListaFlow that started using this account in the meantime. Therefore, I believe we should formalize its usage a bit and pass the responsibilities of checking errors and creating follow-ups to project owners (or FFs, in case we start using this for some crucial services). As a first step, I’ve added¹ an ops+sentry@opencraft.com email as a secondary address to our account (). Then, I changed email routing to forward crafty-bot and ListaFlow emails there. @gabor, @Fox, if you would like to get notifications about project-specific tracebacks directly, please add your emails to the Sentry account and change the routing as I’ve described above.

I am not sure who exactly is monitoring the ops@ list now, so I didn’t change the primary email on the Sentry account. If you would like to non-project emails there by default, feel free to change it.

¹ Noting it, because this part was not obvious to me and took some log reading to understand: our mailman instance is using the Administrivia rule, so I had to accept the email verification message here, as it was being held for moderator approval.

keithgg · September 1, 2022, 1:11pm

@Agrendalath I’m not sure. Personally I think self-hosting is best and it’s not that much trouble to keep it running and updated as the cost per event is fairly pricey IMO.

However, for now I think we’re fine with the account you set up. Thanks for starting with formalizing it’s use. I’ll ping @gabriel to get us onto the paid plan if possible.

maxim · September 1, 2022, 2:49pm

From a few occasions where I used the Sentry when debugging starting a new sprint in SprintCraft, it has been amazing and quiet useful, compared to other monitoring tools. Obviously we should consider the costs of hosting it ourselves vs paying for an upgrade, but I just want to express my +1 for using Sentry for more of our tools.

antoviaque · September 2, 2022, 11:45am

@keithgg Thank you for posting about this, and having that discussion!

Btw, could we make this post public, or is there something that should remain confidential?

Agrendalath · September 2, 2022, 5:37pm

Does it affect us, though? We have 5K errors and 10K transactions in the free plan, which is sufficient for our use case. The $29/month Team plan includes 50K errors and 100K transactions.
If we set our on-demand cap to 0, in case something breaks and starts sending tons of requests (they still have spike protection, though), then it should be less than the cost of maintaining our instance.

If we are planning to use it for our Open edX instances, though, then indeed - self-hosting would be a much better option.

keithgg · September 5, 2022, 6:57am

Agreed. It makes sense to try it out first and see if we hit the limits, rather than self-host from the jump.

For sure! Some of the links are private, but there’s nothing here that precludes a public discussion.

tecoholic · September 6, 2022, 6:03am

@keithgg Great work migrating things to K8s. Despite all the complexity that people usually complain about, I like the abstraction it brings in managing services from a single control plane. Thank you for creating this post and keeping the team in loop. :)

kshitij · September 7, 2022, 12:12pm

There have been a lot of changes to the way our infrastructure is defined and deployed recently, the addition of Terraform and move to Kubernetes being two of the largest ones.

Since I didn’t really have much of a background in these technologies, I feel like I need to invest some time into getting familiar with them to fully understand the current state of our infrastructure and deployment process.

I think it would be useful if we can get some kind of forum post / demo / tech talk etc. I don’t know how much learning budget there is for this, but I think having a good overview of how our own setup could save a lot of time overall.

I don’t think this should be a full-on intro to Kubernetes or Terraform, but how they apply for us, and in Grove.

gabor · September 7, 2022, 3:18pm

@kshitij The nutmeg upgrade epic already has some tasks (SE-5638, SE-5639) for having a learning material (and allowing folks to catch up), though a quick intro to the OC infrastructure wouldn’t hurt. We (Serenity) already extended the onboarding course, but a quick video-based recap may won’t hurt. What are your thoughts about this @serenity ?

keithgg · September 8, 2022, 2:14pm

@gabor @kshitij that’s a good shout. +1 to having a short intro video to the infrastructure.

maxim · September 9, 2022, 3:48pm

I think the infrastructure can be improved by making it easier to setup for people who are new to it. The README is not always up to date, which is understandable, because many changes are being made frequently.

I would suggest to people who are familiar with how to set it up, to try to set it up from a clean state following only the instructions from the repo, and update the README to fill in the gaps.

For people who face any issues, if you don’t have time to make a proper PR with a fix, try to create an issue, which can be later addressed by the Serenity team, or someone else.

Also automating as many setup steps as possible, at least where it wouldn’t take unreasonable amount of time, would make it easier to work with.

Finally, thank you for your efforts! I think it’s a large improvement over how it was done before.

gabor · September 12, 2022, 7:33am

@maxim That’s a great feedback. I’m going to open a ticket for looking into this for the next sprint or the one after that.

farhaan · September 16, 2022, 4:57am

Thank you for all the effort to bring such a wonderful change, I love the way we use terrafrom, although we don’t have enough budget from our clients to set it up but I have used it and really loved it.

And now we have k8s which is pleasant addition I have already asked a lot of doubts on MM and will keep on troubling when needed