DigitalOcean has become increasingly unreliable over the past couple of years, resulting in multiple incidents impacting OpenCraft’s clients. This was resulting in several notifications during weekends, unreliable internal networking that caused connection losses to the managed databases, and lately, an insufficient number of Kubernetes worker nodes in the desired instance family we use.
As a consequence, we created a discovery to see which major providers are available, what the approximate cost would be to run a similar hosting infrastructure with them, and what their strengths and weaknesses.
The evaluation shows that hyperscale providers—Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure (Azure)—offer clear improvements in operational maturity, availability, and ecosystem depth compared to DigitalOcean. AWS stands out for its proven reliability, scalability, and extensive tooling. GCP provides the strongest combination of network performance, Kubernetes maturity, and cost efficiency, with a more automated and predictable operational model. Azure offers competitive capabilities, particularly for enterprise use cases, but shows less consistency in reliability and operational experience.
From a cost perspective, DigitalOcean remains simple but inefficient at scale, with no meaningful discount mechanisms and a baseline monthly cost of ~$1030 for the evaluated workload. GCP emerges as the most cost-effective option at approximately ~$807/month, followed by Azure (~$1078) and AWS (~$1124–$1150), all while delivering significantly stronger reliability and scalability guarantees compared to DigitalOcean. Thanks to the flexible saving plans on the AWS side, committed use can further reduce the costs.
Based on these findings, a migration to a hyperscale provider—most notably GCP or AWS—presents a clear opportunity to reduce incident frequency, improve system resilience, and achieve a more sustainable cost-to-performance ratio.
After the review of the discovery, we came to the conclusion with @braden that while the diversity is great in theory, operationally AWS would be a better fit for OpenCraft right now. Unless the client explicitly requests a non-AWS option and is willing to cover at least part of the associated costs, it may not be worthwhile for us to invest in implementing a GCP alternative.
As food for thought, it may be worth exploring how many of the “smaller” clients would be up for moving to a shared cluster, reducing their and OpenCraft’s costs as well.
With that being said, this is a topic that affects almost everyone at OpenCraft; therefore, I’m posting it on the forum as well for further discussions before anything else. So, this is a discussion that is meant to be OpenCraft-internal given the topic, client names, costs, and business decisions.
Reading the discovery can be technically useful, but the most relevant part is already in this post. To give it a proper timeline, please make sure to come to a conclusion by the end of the month, 30th of April—this should give enough time for everyone.
@gabor Don’t those providers have higher hosting costs? Also, AWS is a company that has very hostile practices, and we need to steer Open edX away from proprietary platforms, not towards it. Similar arguments can be made of Google & Azure.
I have not yet read the discovery, but I don’t see the option mentioned so I would like to make sure we properly consider a more bare metal provider like Hetzner. It doesn’t have all the bells and whistles of the big cloud providers, but it is reliable and several orders of magnitude cheaper for resources. There would for sure be a diff to cover in terms of adapting to it - I’m not sure how big that gap is, but on principle I’d rather spend our time, energy and budget helping to make Open edX a bit more open, rather than cementing it further in AWS and feeding inflated prices to the big ones.
I’ll just add a few notes here from what we’ve heard in BizDev conversations:
The strong default expectation from most leads is AWS. The only exceptions we see are some Enterprise clients requesting Azure and a small number of on-premise deployments. We haven’t seen inbound demand for DigitalOcean, GCP or other alternatives so far. This means that positioning a non-AWS option introduces an additional layer of friction, particularly around trust, familiarity and risk.
Data residency and control over region selection comes up in basically every conversation. This is one of the main reasons most leads don’t consider shared hosting. If we were able to address that constraint, it would definitely increase the attractiveness and adoption of our shared infrastructure. Most leads want to choose shared (mostly because of the price) but can’t as the data centre selection is a strict requirement, yet it’s their only requirement that necessitates independent over shared.
@antoviaque On the point about moving toward more open infrastructure, I fully agree with the direction, and there’s obvious value in reducing our dependence on large proprietary providers. At the same time, from a commercial perspective, reliability and uptime tend to be absolute dealbreakers. Even small/reoccurring incidents can irreversibly impact client trust and renewals. Introducing any platform where reliability is less proven is quite risky.
With that said, I’m not familiar enough with Hetzner to make a direct comparison. From a BizDev standpoint, it would likely just require additional effort to build confidence with prospects compared to AWS. If the uptime/stability backs it up of course, it’s not so much of an issue.
@jordan The main issue-- the data residency, wouldn’t be fixed by moving to AWS. If we move the shared cluster to AWS, we still have to select a region for that cluster to reside in, and if someone can’t use that region, they can’t.
I share @antoviaque 's desire to make sure there’s an alternative-- preferably one that maximizes our opportunity to use non-proprietary tools. I’m not familiar with Hetzner. Looking things up, it seems we’d basically need to build and manage our own Kubernetes cluster with it. I’m not opposed to this-- I run my own Kube cluster out of Raspberry Pis at home, and have some idea what that would mean.
However that DOES mean a lot of additional work. Auto-scaling would no longer be a given-- we’d either need to provision more nodes manually or set up an integration to their API which can do so (or pay for a service which does it for us). We’d no longer have managed database services-- we’d be rolling our own, or else running them on top of the general cluster hardware with something like CloudNativePG.
Hetzner at least provides object storage so we wouldn’t need to roll our own S3 buckets, though if we wanted to go the extra mile, we could provision a set of servers that act as a Ceph cluster and run that too, or else set up something like Longhorn or Garage. This would have the advantage of being easier to migrate.
Honestly our overall spend on infra, especially on labor, would likely go up, as we’d be taking more responsibility everything from server maintenance to backups. Serenity might not be large enough for the challenge at its current size, and we might need to fold another team member into it. The migration would take a lot of additional work.
I think if we do this, especially if we move to Hetzner, we should consider provisioning as much of the resources on-cluster, rather than using the cloud service’s S3 or managed DB (if they have it on offer). The reason for this is we might have to make this move again if things don’t work out, and if everything is Kubernetes, it’s much easier to pick up and move. The downside is maintenance overhead, and possible performance penalties.
Hetzner has a reputation as a solid, reliable, affordable provider - but basic. If we have to build and maintain all the extra infra for a Kubernetes cluster on top of their VMs or bare metal, it would definitely be a ton of work, using up a ton of our team’s time, and it may not be any more reliable than DigitalOcean has been, given all the additional points of failure and scalability issues that we’d be responsible for.
I doubt any of our clients want this, unless we could show that the price is dramatically cheaper while being just as reliable. And that would be a questionable claim at this point.
On the other hand, many clients appreciate the AWS option: we let them control their own account, it’s industry standard, it’s reliable/scalable/full-featured, and there are lots of options for data location etc. In fact, many of our clients require AWS, so this discussion is really only about those clients that have no strong preference in terms of infrastructure provider.
At the end of the day, I think we have to pick our battles. In an ideal world, we’d:
Use Open edX instead of proprietary LMSs
Use Mattermost instead of Slack
Use ??? instead of Zoom / Google Meet
Use Nextcloud or Proton Drive instead of Google Drive
Use Proton Mail or other open source email+calendar product
Use ??? instead of Calendly
Use ??? instead of Jira (this has been a big pain to figure out)
Use ??? instead of Docker Desktop
Use ??? instead of Figma
Use GitLab instead of GitHub
Use Invoice Ninja instead of FreshBooks
Use ??? instead of Sublime Merge / Fork
Use ??? instead of Toggl
Use ??? instead of Claude Code
Use ??? instead of OpsGenie
Use ??? instead of New Relic
Use ??? instead of Dead Man’s Snitch
Use a fully-open-source Kubernetes + infra layer on top of bare metal from Hetzner or any other basic provider. (Using OpenStack?)
Use only Framework hardware?
As far as I can tell, at the moment almost nobody we work with (Axim, clients, eduNEXT, etc.) uses anything on the left as an organization. So not only does everything on the left tend to require more work to set up, more onboarding, etc, but we often still need to use the proprietary alternative anyways - either that or have more friction in collaboration.
And unfortunately, this is a zero-sum game: the more hours we spend on internal IT support and devops, the less unbilled budget we have available for contributions toward Open edX and other open source projects.
So: I think we should focus our open source adoption efforts in the places where we feel it will make the most impact. (Not sure what criteria though; projects we can contribute to the most? projects we can evangelize? projects that are true open source?)
If we want to use a non-hyperscaler option because of their business practices, and we decide it is a priority for investing essentially our “contribution” hours toward, then I would suggest at least that we look at Akamai Linode Kubernetes Engine, RedHat OpenShift, Scaleway, Vultr, or one of the others that provides a managed k8s service.
But at the end of the day, whether we’re using LKS instead of EKS or setting up our own cluster on Hetzner, I don’t think this sort of work really shifts the needle much for any open source project nor meaningfully makes a difference in anyone’s business practices. I’d rather spend our time, energy and budget helping to contribute toward elemo, resume sponsoring open source projects directly, or just increase our CC hours toward Open edX and make Open edX better.
When we did this with OpenStack in the past, I don’t think anyone else followed suit or used our work, despite us promoting it and contributing it all upstream etc? (Same even with DigitalOcean in fact, though it wasn’t so much work for us.) So I don’t recommend investing in this without building some community buy-in first. I think we could get buy-in on better supporting GCP as an AWS alternative, for example. I doubt we’d get any takers on Hetzner support.
A lof of these are not something we use to collaborate outside of OpenCraft though, so I feel we should look for alternatives to use within OpenCraft. Some of them do have good open source alternatives like Cal.com for Calendly, PenPot for Figma, Solidtime for Toggl, etc.
If nothing else we can move to reduce our reliance on these. For instance, even if we use Google Drive, why do we use Google Docs? For all my personal file I now create .docx, .xlsx files. These can be edited online but you retain the documents in a somewhat more open format than however Google stores them.
My point in saying this is that there are still a lot of things we can do to move in a more open and open source direction if we have the time. And as @braden said it’s a zero-sum game. We can invest a lot of time and effort into k8s over Hetzner, potentially many hundreds of hours and dozens more in maintenance, where this could go to moving us away from some of the other proprietary or closed platforms we use.
I voted for GCP initialy but changed to “no opinion” because I feel I’d be equally okay with AWS, GCP or the smaller providers mentioned here. I do feel we need to delegate the cluster auto-sclaing etc. because we’re honestly just never going to have as much expertise in managing it than a provider that deals in only that.
Something that I feel I’m missing from the discovery is what specific reliability concerns we need to address. It seems that reliability is the biggest point, which I find a little ironic considering we have everything in k8s, databases in clusters, etc. which are all supposed to make things very reliable.
I wonder if it’s worth looking at this from another angle too: what architecture changes can we make to improve reliability? For example, if managed databases getting auto-upgrades are causing downtime, can we look at scheduling manual upgrades before the auto-upgrades happen, or using self-managed databases hosted in k8s?
Agreed that this would increase maintenance overhead, but if this is a way to move toward being provider-agnostic and the maintenance increase isn’t that much, this could be an excellent step.
An aside on this: the new Launchpad stack is actually locked in to GitHub, due to the workflows being built around GitHub Actions. So we’ve taken a step away from GitLab there.
Agreed that it’s a tricky position to be pushing toward using opensource software - it’s a great ideal, but for many things it’s unfortunately not feasible right now.
As much as I love this ideal, if it makes business sense to stay with certain proprietary options for now, then let’s do it. I definitely don’t believe we’re a large enough company to be looking at maintaining our own OpenStack for example; that’s a huge amount of work. We might be fine to manage our own K8s, but will the cost of us maintaining it be less than the cost of a managed k8s?
+1 agreed. I’d much rather be doing this than doing maintenance on cloud projects that we could have managed for us. Automate the boring stuff. ;)
To answer the original question on provider: AWS seems an obvious choice for now: we already have support for it, there’s the expectation from leads that @jordan mentions, it’s reliable, we already have things hosted on it (so we’re not building out backends for yet another cloud provider). Dropping DigitalOcean and going all in on AWS reduces our maintenance efforts too, as we only need to be familiar with one provider, and only need to maintain scripts for one provider.
And with our savings, we can put some work towards making our tooling provider-agnostic so we can migration with little friction in future if required.
I have read the discovery and voted in the poll. Just noting one thing that stuck out to me.
I am really troubled by this. While I understand non-viability of a multi-region cluster in technical terms of “raft consensus is affected by latency causing etcd corruption causing k8s instability”, I am wondering if it would be a solvable problem.
Would something like Rancher be a solution? Could we be running small K3S clusters in different regions catering to clients from that region? And upgrade to full blown managed clusters when the demand is high for a specific region?
I’m not really sure where to start this reply. I’ll try my best to answer most logically, even if it feels a bit off.
Not necessarily. DigitalOcean costs about ~$1030 per month for hosting. In contrast, almost the same setup would be ~$807 per month with GCP. That is ~$223 less than DigitalOcean. For AWS it would be ~$120 more per month. However, due to the platform’s reliability and maturity, the $120 extra will probably easily turn into savings due to less maintenance needed on our end.
As @braden mentioned, we can pick our battle: we choose the operational or cloud costs.
As the discovery mentions, this article has a good reality check on the lower-tier providers as of now. As it is lengthy, to save time for everyone who is interested, I asked AI to summarize this article:
Hetzner’s main advantage—low pricing—comes with trade-offs that become evident during critical situations. The platform offers minimal support, meaning developers are largely on their own during outages or incidents. Combined with oversubscribed CPUs, lack of automatic failover, no managed services, and limited global presence, Hetzner is best suited for non-critical or highly self-managed workloads rather than production systems requiring reliability.
Vultr, positioned as a middle-ground provider, struggles with inconsistency and hidden complexity. Performance varies across regions, bandwidth overages can lead to unpredictable costs, and reliance on local (non-replicated) SSD storage increases the risk of data loss. Like Hetzner, it also lacks managed services and flexible scaling, forcing developers to overpay for bundled resources and handle infrastructure management themselves.
Overall, the article concludes that developers in 2026 expect cloud providers to include high availability, replicated storage, independent resource scaling, DDoS protection, and real human support as standard features. As applications become more critical, the “cheap and simple VPS” model is no longer sufficient, prompting a shift toward more robust, managed, and production-ready infrastructure solutions.
I linked this article in the discovery, as it is reflecting clearly what I have seen and experienced over the past years. DigitalOcean has been suffering from these issues since their IPO a few years ago.
It won’t be solved by any provider or solution as is. Kubernetes is relying on etcd, which is not tolerating high latency; setting up a single Kubernetes cluster across multiple regions is not viable/reliable.
There are solutions for this—whether it is Rancher, GKE Fleets, or connected EKS clusters—that could solve the given problem, but not the way we would like to.
+1 on what Braden wrote. Lack of managed services will result in exponentially increasing operational expenses. I’ve been operating three production-grade Kubernetes clusters used by hundreds of microservices and 100M+ users on a monthly basis. Regardless of the scale (as the issues are not coming from the scale but complexity), it is not fun at all. Operating such clusters (again, regardless of scale) would require a dedicated team (read: at least 3 DevOps people) who are actively working ahead to ensure nothing will break by the next cluster upgrade, node scheduling works, vertical scaling is not falling apart, etc.
I agree with this, especially that I remember questions towards GCP and Azure support in Grove and Harmony as well.
Actually, this is the best evaluation in the thread regarding the work that would need to be done with a self-managed Kubernetes cluster.
Regarding the discovery with specific reliability concerns, I’ll go back and collect a few of those issues that were happening with DigitalOcean. However, it is testifying for itself that the discovery was created as DigitalOcean was unable to allocate nodes for the Kubernetes cluster in the given instance family (m-2vcpu-16gb) we would need. I checked the Kubernetes nodes during the weekend and just right now as well: we have “dead” nodes that are saying “upgrading,” but they do nothing and reset the creation date occasionally without doing anything.
I would say it is rather sad and disappointing that a cloud provider cannot keep up to standards but promotes itself as production-ready. However, your assessment here is not quite on the right track:
Kubernetes itself is reliable; in fact, we don’t really have issues with the cluster itself. The issues are coming from the underlying infrastructure that operates the clusters. i.e., DigitalOcean-flavored Kubernetes control planes, improper node allocations, faulty internal networking, etc.
Database clusters are reliable; we have no track record that the cluster has fallen apart or produced issues. On that note, the upgrades are causing issues. It will not solve the problem to manually click the upgrade button. We have been there and tried that. The issue is how DigitalOcean is handling the database cluster upgrade, not what triggers it. And this is a huge difference. We have high-availability settings turned on for the cluster with failover cluster included; regardless, the upgrades are causing a few minutes blip almost every time on the weekend—hence the setting for the weekend upgrades.
Technically speaking, this is (almost) the worst possible move. Kubernetes is not for stateful workloads. Kubernetes is built for ephemeral workloads: pods are transient and restart often, and databases need a stable, persistent state. This results in a higher likelihood of failovers/restarts.
You must handle backups, replication, scaling, and performance tuning. This results in an enormous effort to set up, configure, test, and maintain. This would cost a lot to OpenCraft. There is a reason we decided (on an organizational level) to go with managed services.
Yes, and it is not bad. Putting ideological concepts aside, a hard requirement we had for the new stack is to build on existing, community-provided solutions as much as technically possible to reduce the effort OpenCraft has to put into maintenance. This ensures community adoption is easier as well. We build on Picasso (by EduNEXT), which is using GitHub Actions. This means we cannot reuse that in GitLab—even if we could, we would rather not end up with a Frankenstein stack.
If we were self-managing the clusters, Rancher would be a practical solution and probably the only viable one. However, it circles back to the main issue: self-managing would be a gigantic overhead that we should not carry with ourselves.
@gabor I don’t think this is true anymore. It was certainly true in the early days of Kubernetes, but since then, PersistantVolumeClaims, StatefulSets, and better Storage Classes have landed. There are some things you have to do differently-- you do have to plan your nodes more around their disks and how those will be used, for instance. But it’s perfectly possible to do this today, and there are charts/manifests/CRDs for Postgres, MySQL, MongoDB, S3-compatible storage, etc.
I think I should have made clearer in my post that I’m very much saying this would increase maintenance costs, and that I think doing this would indeed require increasing Serenity’s size. But I’ve felt Serenity’s size is probably small right now anyway-- we’re pinging you at all hours of the night and @mtyaka doesn’t have the kind of hours to cover your offtimes as much.
We wouldn’t have to do it all at once, either. We can move to another provider and use standard managed services, then try moving to some of these internally-managed services on our own cluster, and see how they perform before using them on the shared cluster. If we find they just don’t cut the mustard, we go back.
I think that AWS is (unfortunately) the best choice for hosting Open edX instances at this moment. It is reliable, some clients demand it, no client objects it. The hosting costs to comparable providers are slightly higher, but worth it if it means we won’t have to deal with issues that we’ve seen with DO. As others have noted, going bare metal could reduce hosting costs significantly, but would require an order of magnitude more hours poured into supporting the infrastructure.
If we want to explore other providers an/or going bare metal (which can be fun), I would suggest we start with our internal infrastructure rather than Open edX instances.
For AWS, we will be running instances on it in any case, as some of our clients require it. But for the instances where we choose, either for our internal infra or for Open edX instances, I think it’s important to pick a hoster that isn’t Amazon/Google/Microsoft, as it forces us to also improve Open edX for those who can’t or don’t want to use them to host Open edX. It is an important forcing function for us, and for the Open edX project.
And in some territories or areas, none of the hyperscalers would be available. This is for example the case with Ethiopia, which needs (and requires by law) to host their Open edX instances within the country. The local cloud service providers might not yet be as reliable as AWS, but they still need to run Open edX themselves. And being able to run your software yourself is freedom zero of the free software definition.
Could we help ensuring Open edX runs and scales well for users outside of hyperscalers? Running our own Kubernetes on bare metal might still be too much for our current size, but could we find a managed Kubernetes hosting provider outside of Amazon/Google/Microsoft that we like, and move there? And in the process, improve the capability to switch & support alternatives? This way if the new one doesn’t work, we can more easily move somewhere else? @braden has mentioned a few, can we look into them (or others?):
I also like the two first design principles from the discovery, and advancing those further would help with increasing potential options when picking a hosting provider:
1. Reliability by Design
Infrastructure must be designed to tolerate provider-level failures without causing service disruption in Tier 1 systems. This includes:
Eliminating single points of failure
Using managed services with proven SLAs where appropriate
Designing for graceful degradation instead of total failure
2. Provider-Agnostic Abstractions
Where feasible, architecture should minimize tight coupling to provider-specific implementations:
Prefer Kubernetes-native and open standards-based solutions
Avoid proprietary services or open-core services, unless justified
Maintain portability to reduce vendor lock-in risk
Could we coordinate an effort on reinforcing those across projects, like Ethiopia’s or other providers from the community, to make sure we contribute to a shared effort, and that we don’t just do it for ourselves in our corner?
Also for the data – if hosting it in Kubernetes is not an option, we could run and manage the external databases ourselves when the one from the Kubernetes provider doesn’t suit us, outside of Kubernetes? It would be work, but that’s smaller than trying to manage a Kubernetes cluster, and we already have some experience with it.
PS: I don’t see anything that would require to keep the current thread private - some mentions of clients, but nothing that would be confidential? Could we open it, along with the discovery document? Remember the open first principle - the default is to keep it open, and in case of doubt to go for opening it.
I support this to not even try to run a Kubernetes cluster on our own. That is a DevOps team’s constant work which can be very time consuming.
Regarding providers, if not the big three, we can talk about Linode (which is bought by Akamai), Vultr, Scaleway, OVH, UpCloud, and DigitalOcean (where we are right now). Of course, we could have IBM cloud, Red Hat OpenShift, Alibaba cloud, Oracle cloud, but (except RedHat) these are not much better in terms of policies/market strategies than AWS/GCP/Azure.
I would say that we should choose based on the biggest pain points we have right now with DigitalOcean: be reliable, have few-to-no ad-hoc breaking changes in toolchain (like Terraform providers), have a helpful support team.
I think extending the discovery with all these wouldn’t make sense as that would be lot of unnecessary time spent. Especially that we could cross OVH out due to reliability issues, Vultr is kind of a bit cheaper DigitalOcean with less consistency and support, Scaleway could be a great contender, but still very immature in the market compared to others. I would say Linode maybe the best choice out of these, given the historical track record they have in the market. But I’m not sure if we are OK with Akamai or not.
We could include RedHat OpenShift as well, but that would mean self-managing the cluster, which is something we would like to avoid.
Based on these: which of the providers should be included in the discovery? This is an honest question as if it was on me, I would include Linode only.
The Harmony project (that provides the core of the hosting infrastructure) is already supporting these practices as much as possible. Our limitations are coming from provider incapabilities, and differences from incompatibilities. These principles were listed in the discovery to ensure we “don’t overlook” these when we evaluate a provider. If a provider is not capable of delivering any points above from “Reliability by Design”, they should not even be considered.
The second point is more like how we approach things, but should be supported by the provider as well. For example, if we have no option for hosted MySQL only a flavored alternative, that should be a huge red flag as we may depend on something that is not available in other clouds. However, this does not mean we should roll our own databases—there are many reasons for that, will be listed below.
I’m quoting this part as it has the best relevancy. There is a difference between “we could” and “we should”. The technology is given for sure, but I would say not for us. Hosting MySQL (or MongoDB!) would be a huge pain (again), that we were happy to get rid of. I still remember how much trouble the MongoDB cluster caused for us and how it felt when the MySQL cluster fell apart for “no reason” as clustering was not working as expected. I would not put any of these onto a Kubernetes cluster which has continuously moving nodes in the cluster that can affect the database clustering. S3 could be something we put there, but what would be the benefit of self-managing it?
Anyway, I feel we approached the issue around the databases from a wrong angle. We don’t have database reliability issues per se. We have issues with how DigitalOcean fails to properly perform database updates/upgrades. Which is a huge difference. We don’t have this issue with AWS nor in some other DigitalOcean regions (or we never noticed the latter one). This is an area where DO fails.
Honestly, I would say to keep the managed databases approach as we have no issues with the database, but DigitalOcean in that case. Switching to non-managed approach would mean that:
we have to make sure clustering works
we have to take care of multiple daily backups
we need to take care of upgrades and maintenance
we have to watch out for OS security updates, etc.
Removed the client name from the discussion starter and made it public.
Making sure the discovery document is open requires a public repository, though I cannot think of any good candidates. Maybe the repo related to Launchpad as that is the target hosting solution?
I think he just means marking the sharing settings of the discovery document as open for anyone to read. We don’t need to drop the discovery into a repo.
The primary benefit of self-managing it (and the other services) is that it makes migration easier, and makes it to where we’re not relying on provider-specific solutions-- so it can be run on a much wider range of cloud providers. The tradeoff is more maintenance, including the related pains you mention. The question is whether that’s a tradeoff worth making, and we might have historical data to answer that.
As you mentioned, we were at one point managing our own database clusters. There might be old tickets we can scavange to see how much time we spent rescuing/patching/upgrading those databases, and how they compare to what we’re spending working with them today. It wouldn’t be a perfect comparison-- the stack is different, and I think MySQL’s reliability has gone up since (though maybe that’s just me taking the hard work making hosted DB solutions for granted). It may give us some idea, though.
I think @braden 's point is also prescient, however-- ‘Who will follow us?’ I don’t think anyone will until the cloud providers get too troublesome to work with. Some clients, especially national programs, may want to run it all on their own hardware. But I think it’s reasonable to ask those clients to pay for any special work required to make their cluster more independent and open source the solution then rather than doing it out of pocket for our small number of clients.
A better argument might be ‘let’s get good at doing this so we can target more sovereign deployments’, but even then we already have some great contacts and clients who might need this or already want to do it. Can we talk with them about developing more solutions on this front rather than spearheading it with our shared cluster, an offering that is supposed to provide cost savings to clients?
@gabor Oh, these are criterias for the hosting provider? They should be criterias for Open edX and the services we run on top of the hosting provider too - and what I meant was to work on improving this in Open edX and our side of the infra, to compensate better for hosting provider failures:
Reliability by Design
Infrastructure must be designed to tolerate provider-level failures without causing service disruption in Tier 1 systems. This includes:
Eliminating single points of failure
Using managed services with proven SLAs where appropriate
Designing for graceful degradation instead of total failure
This is what allows to be less dependent on perfect reliability from the hosting provider, and have the infrastructure and Open edX built in a way that compensates for hosting failures. The Chaos Monkey approach, to help support a broader range of hosting providers. Can we work on improving that, so that we don’t need a AWS to run our infrastructure or Open edX?
Also, remember that we will need to get Open edX to run at (large) scale on relatively unproven hosting providers in the case of Ethiopia - so this is going to be important to work on. @gabor@tikr@kaustav@paulo@braden can we make sure to work together upstream as one project on this (OpenCraft’s infra, Ethiopia, and whoever else from the community wants to) to improve the ability to host Open edX in other places than AWS?
@gabor Some can be removed yes: OVH we have been there and not been satisfied with the support and reliability, so we can remove it. I don’t know Vultr but if they have bad reputation for support it’s usually warranted, so we can remove it too.
Scaleway isn’t that immature, it has been in market since 1999 and their managed Kubernetes launched in 2019. Worth including in the review.
For Akamai and IBM cloud, I’m not sure I’m a fan of them, or that they would be any nicer if they could corner the market, but at least they are more challengers currently. We don’t need a perfect hoster, whether technically or philosophically – we need a one that’s different from the main hoster (and AWS in particular), and support it in a way that makes it easier to pick any provider, and switch between providers when we want or need to. So Linode or IBM Cloud also sounds good to include, yes.
Not necessarily - it could be that they focused on something else, like running their Kubernetes cluster well. Since we are trying to expand the range of hosters that we can use, we have to broaden some criteria - if we just look for the unicorn we don’t end up with much choice. If a hoster runs a good Kubernetes managed service but no managed database, then we should consider using the provider and manage the database ourselves. It’s work we know how to do and that shouldn’t be scaring us; it’s worth considering the additional options it gives us.
Also, is the list complete? A quick search brought up others, like:
Ah if we’re adding more, let me throw https://fly.io/ in the mix: a relatively new cloud provider, but rather innovative, and they have a managed kubernetes.
One more to add to the list: https://stackit.com/ (aka the Lidl cloud). I find this one interesting because I believe they’re one of the few with enough capital to create a European alternative comparable to the big US providers. I do not have any experience with it though.
I have been using IBM Cloud for many years and they have been very reliable. Their technical support is usually pretty good as well. But I have only been using their classic bare metal servers and Object Storage, not their VPC or managed services, so I cannot vouch for those.