Replacing NewRelic with Uptime Kuma (for Synthetic checks)

gabor · May 2, 2025, 10:02am

As we discussed many times over the years, NewRelic is troublesome from time to time. For example, the access changes or the continuous flaky monitoring alerts which are false-positive.

Although we use it for performance monitoring in some cases, which would hopefully be replaced by OpenTelemetry on edX’s side, we could replace the synthetic monitors.

Why?

Flaky monitoring, false-positive alerts in the night, hard-to-use UI, and the list could grow.

Replace with What?

Serenity made a discovery about possible replacements, however, ~~OpenStatus~~ Uptime Kuma seems to be the best replacement.

~~OpenStatus~~ Uptime Kuma is an open-source monitoring tool with an easy-to-use UI. We would self-host it.

What would change?

Firstly, obviously the location where we would check the health statuses. Besides that, we need to create a new Tutor plugin to auto-create the monitors for an instance. Other than that, not much. The software would report directly to OpsGenie, so we can use the same tooling as we got used to.

When would the change happen?

I’m in the process of setting up the monitoring instance in a DigitalOcean region where we have no instances. However, the tutor plugin will not be created until the next sprint. (i.e., we are planning to work on the plugin the next sprint).

Log time to: SE-6167

demid · May 2, 2025, 1:38pm

@gabor, while I agree that NewRelic’s UI is terrible and there are other problems with it (including, probably, price?), it has proven to be very stable for some of our clients. In fact, I’ve never seen any false-positive alerts for HMS (@Agrendalath, please correct me if I’m wrong). I’ve replicated the HMS’s setup to monitor Aquent’s instance and it seems to be stable too. I’m not entirely sure, but I think the key difference between how we monitor instances in our shared cluster and HMS is that the latter is pinged from three different locations instead of one and has a different alert condition type.

gabor · May 2, 2025, 1:57pm

So you say that basically the flaky behavior is not experienced, because of the multi-regional monitoring? I was under the impression all Grove-based instances are using tutor-contrib-newrelic plugin, so registered the same way everywhere. What’s the difference then? Is the client uses synthetic monitoring or something else?

maxim · May 2, 2025, 2:03pm

I could be wrong, but I remember when we were testing the tutor-contrib-newrelic plugin, I spotted the difference (probably the same one that @demid is referring to) between the monitoring that was setup by hand and the one that was being created by the plugin, and the reply I remember was something about that there is no way to create the same config via the NewRelic API.

demid · May 2, 2025, 2:23pm

Yep. But there are also other differences in the config. You can compare it yourself: 1, 2.

No, the Aquent’s instance is Grove-based and isn’t using it. It’s using the same config as HMS.

Agrendalath · May 2, 2025, 4:12pm

Correct - this monitoring is reliable for HMS instances and has been useful many times, so I’d be cautious about replacing it.

@gabor, Other than APM you already linked, I’m also using the Errors (error inbox) to check exceptions without logging into every instance from the ASG separately and checking individual logs for each service we have in the supervisor (we currently run 23 services). I used to use our ELK stack for this, but we removed it some time ago.

Since we last considered self-hosting Sentry, it has become much more challenging. To replace New Relic fully, we should use a service that provides APM and error tracking, like GlitchTip. Alternatively (e.g., if OpenTelemetry is already supported), we could use Bugsink for error tracking (though this one does not expose a REST API).

That said, Uptime Kuma looks nice. I like that we can specify a proxy for the monitors, allowing us to set up multi-regional uptime checks. What slightly worries me is the lack of the API - it is an SPA app that uses websockets for all interactions. We would need to manage all monitors manually.
If this is problematic, and we would like to use GlichTip, we could look into its built-in uptime monitoring. It has fewer features, but the backend runs on Django, so we could implement and contribute things like configurable retries, if needed.

It should be possible. I have these monitors terraformed, so they are configured via the New Relic API.

antoviaque · May 2, 2025, 5:57pm

If it ain’t broke It might be worth digging into the source of the flakiness btw then, if it’s not New Relic.

Also - we definitely don’t want to host our alerting & monitoring tools; they are part of the last line that ensures we can detect quickly issues. We want to minimize the possibility of the monitoring tool going down during an outage, and aside from picking a different datacenter, one of the best ways to ensure this is to have it being managed by another organization entirely.

gabor · May 2, 2025, 6:46pm

tl;dr; There were some misunderstandings. In this case, we should give another shot to NewRelic, I believe.

I recall something, too, but cannot remember the exact details. I remember that NewRelic’s GraphQL API that they try pushing is not capable of everything. And maybe the old API was deprecated too, but I’m not sure right now.

@Agrendalath Great to know about this. Unfortunately, I found no information about other usage of NewRelic (like the one you mentioned). Should we extend the documentation about NewRelic with this info?

Yeah, if we had gone with self-hosting, that would be better.

We were discussing replacing it for other reasons (like its changed pricing model and something else too, I cannot recall). However, after all the above, it may not be possible to easily move away from NewRelic at this moment.

antoviaque · May 3, 2025, 7:47am

@gabor I can completely relate to wanting to get rid of New Relic for their pricing techniques, where they definitely have unfriendly policies. I would also love to get rid of them one day, and give that budget to a nice open source developer and provider instead. But that change is work and will be competing with other projects, and should come with at least the same guarantees as New Relic - in particular, being run by a third-party company, and be tested to show at least similar reliability.

Btw what about the things that are currently flaky, could we have a task to look more deeply into the cause of this flakiness, if it’s not New Relic?

maxim · May 5, 2025, 8:53am

I’m 95% sure it’s NewRelic being flaky. We did take a deep look several times, and each time it appears that NewRelic skips a ping, which causes an alert. See this thread for an example - there you can see examples of both NewRelic being flaky and errors on the server side. TLDR is that when there are errors or a cluster is not reachable, there are data points on the NewRelic side that say that, but when there are no data points in the expected intervals, there are no explanations other than NewRelic being flaky with this type of monitoring.

The reason it’s looking like it’s only happening with the instances hosted on grove-production-digitalocean, is because all the other clusters (at least to my knowledge) did not switch to the tutor-contrib-newrelic plugin and instead are using monitoring that was configured manually before, which are more reliable. The reason the old monitoring configs and the new ones are different is because, as Gabor said, we could not find a way to create the same monitoring configs via their GraphQL API, although, as Piotr mentioned, it should be possible to do it via Terraform, but AFAIK, there is no way to make a tutor plugin that would add custom Terraform configuration.

kshitij · May 5, 2025, 9:26am

I’d like to second this. I personally use a self-hosted Uptime Kuma for my own stuff and then use a hosted service to monitor Uptime Kuma. We’d definitely need to host it in an entirely different datacenter and even then, I think it is much too simple for our needs.

I’ve also used Sentry and loved it. Not sure what went wrong there, but I don’t think Uptime Kuma alone with cut it.

antoviaque · May 6, 2025, 2:59pm

@maxim Whichever the source of flakiness it is, it would be good to get to the bottom of that issue. Could you and @gabor create a task and agree on a scope, for digging into this more and figuring it out?

gabor · May 6, 2025, 5:58pm

on this! However, I would let it for @mtyaka if he would like to take it. Otherwise I can do that.

antoviaque · May 12, 2025, 9:20am

@gabor No worries on my side if @mtyaka takes it, or you. Just let me know which task it is, so I can follow the updates?

gabor · May 15, 2025, 5:36am

@mtyaka, gentle ping on thi.

mtyaka · May 19, 2025, 7:21am

@gabor @maxim I created SE-6462 and scheduled it for next sprint.