Replacing NewRelic with Uptime Kuma (for Synthetic checks)

As we discussed many times over the years, NewRelic is troublesome from time to time. For example, the access changes or the continuous flaky monitoring alerts which are false-positive.

Although we use it for performance monitoring in some cases, which would hopefully be replaced by OpenTelemetry on edX’s side, we could replace the synthetic monitors.

Why?

Flaky monitoring, false-positive alerts in the night, hard-to-use UI, and the list could grow.

Replace with What?

Serenity made a discovery about possible replacements, however, OpenStatus Uptime Kuma seems to be the best replacement.

OpenStatus Uptime Kuma is an open-source monitoring tool with an easy-to-use UI. We would self-host it.

What would change?

Firstly, obviously the location where we would check the health statuses. Besides that, we need to create a new Tutor plugin to auto-create the monitors for an instance. Other than that, not much. The software would report directly to OpsGenie, so we can use the same tooling as we got used to.

When would the change happen?

I’m in the process of setting up the monitoring instance in a DigitalOcean region where we have no instances. However, the tutor plugin will not be created until the next sprint. (i.e., we are planning to work on the plugin the next sprint).

Log time to: SE-6167

@gabor, while I agree that NewRelic’s UI is terrible and there are other problems with it (including, probably, price?), it has proven to be very stable for some of our clients. In fact, I’ve never seen any false-positive alerts for HMS (@Agrendalath, please correct me if I’m wrong). I’ve replicated the HMS’s setup to monitor Aquent’s instance and it seems to be stable too. I’m not entirely sure, but I think the key difference between how we monitor instances in our shared cluster and HMS is that the latter is pinged from three different locations instead of one and has a different alert condition type.

So you say that basically the flaky behavior is not experienced, because of the multi-regional monitoring? I was under the impression all Grove-based instances are using tutor-contrib-newrelic plugin, so registered the same way everywhere. What’s the difference then? Is the client uses synthetic monitoring or something else?

I could be wrong, but I remember when we were testing the tutor-contrib-newrelic plugin, I spotted the difference (probably the same one that @demid is referring to) between the monitoring that was setup by hand and the one that was being created by the plugin, and the reply I remember was something about that there is no way to create the same config via the NewRelic API.

Yep. But there are also other differences in the config. You can compare it yourself: 1, 2.

No, the Aquent’s instance is Grove-based and isn’t using it. It’s using the same config as HMS.

Correct - this monitoring is reliable for HMS instances and has been useful many times, so I’d be cautious about replacing it.

@gabor, Other than APM you already linked, I’m also using the Errors (error inbox) to check exceptions without logging into every instance from the ASG separately and checking individual logs for each service we have in the supervisor (we currently run 23 services). I used to use our ELK stack for this, but we removed it some time ago.

Since we last considered self-hosting Sentry, it has become much more challenging. To replace New Relic fully, we should use a service that provides APM and error tracking, like GlitchTip. Alternatively (e.g., if OpenTelemetry is already supported), we could use Bugsink for error tracking (though this one does not expose a REST API).

That said, Uptime Kuma looks nice. I like that we can specify a proxy for the monitors, allowing us to set up multi-regional uptime checks. What slightly worries me is the lack of the API - it is an SPA app that uses websockets for all interactions. We would need to manage all monitors manually.
If this is problematic, and we would like to use GlichTip, we could look into its built-in uptime monitoring. It has fewer features, but the backend runs on Django, so we could implement and contribute things like configurable retries, if needed.

It should be possible. I have these monitors terraformed, so they are configured via the New Relic API.

If it ain’t broke :slight_smile: It might be worth digging into the source of the flakiness btw then, if it’s not New Relic.

Also - we definitely don’t want to host our alerting & monitoring tools; they are part of the last line that ensures we can detect quickly issues. We want to minimize the possibility of the monitoring tool going down during an outage, and aside from picking a different datacenter, one of the best ways to ensure this is to have it being managed by another organization entirely.

2 Likes

tl;dr; There were some misunderstandings. In this case, we should give another shot to NewRelic, I believe.

I recall something, too, but cannot remember the exact details. I remember that NewRelic’s GraphQL API that they try pushing is not capable of everything. And maybe the old API was deprecated too, but I’m not sure right now.

@Agrendalath Great to know about this. Unfortunately, I found no information about other usage of NewRelic (like the one you mentioned). Should we extend the documentation about NewRelic with this info?

Yeah, if we had gone with self-hosting, that would be better.

We were discussing replacing it for other reasons (like its changed pricing model and something else too, I cannot recall). However, after all the above, it may not be possible to easily move away from NewRelic at this moment. :sweat_smile: