Resource Planning Stats Plugin

Hi @bebop !

Ticket to Log Time

After the sustainability reports came in, @antoviaque mentioned we could use some of the earned budget to work on tools or contributions that we think would provide value. I know we already have some upcoming plans (things like the Jira migration), but I wanted to propose an idea for a smaller thing that I think could have a big impact for our clients and for sales.

In marketing the platform, we speak of ‘monthly active users’ when trying to size hosting, and we currently estimate that our baseline package handles about 1000 monthly active users. However, there are some problems with this estimation:

  1. We don’t have a standard technical definition of what counts as a monthly active user
  2. We don’t have any recent measurements that confirm our intuition of how many monthly active users are supported on a standard installation
  3. We know that the actual monthly active users supported by an instance can vary considerably based on how that instance is used, such as plugins, use of third party services, or possibly even choice of hosting provider when the specs otherwise look the same.

I want to propose a plugin for measuring how many ‘monthly active users’ an instance has. This plugin would come up with a standard method of measuring that (which won’t be perfect but can be consistent.) We could use the gathered data to make predictions in general during sales, and more specifically for clients on what their hosting costs will be as they expand.

It also gives us a good way to show what cost savings a team may get on hosting if they choose to pay for it directly rather than pay a per-user price, which is what most of our competitors do.

There’s no one client clamoring for this, but I think it could benefit all of them and future sales. What do you guys think? Is this something worth spending some of the earned budget on? Or would you rather it go elsewhere? Do you have any thoughts on how we could approach the measurement?

7 Likes

Good idea :+1:

Another metric I’m interested in is: “how many users the specific setup can handle concurrently” (preferably in a few different scenarios). After edx-load-tests was discontinued, we’ve done this a few times with k6, but having this standardized would be great for tailoring the auto-scaling sensitivity for clients.

3 Likes

Sounds great!

I think a plugin that tracks Monthly Active Users using a few definitions would be most helpful.

e.g.

  • Number of users who accessed the site while logged in during the month (basic MAUs)
  • Number of users who submitted at least one problem / triggered a score/grade event during the month (learning MAUs)

However, wouldn’t it be better if we could retroactively compute this and compute it on finer scales (weekly etc.) using Aspects, rather than creating a new plugin that tracks these events? In other words, can we just configure this as an Aspects report, consuming existing data?

3 Likes

I like the idea. I think we can improve the way we monitor and track different stats. We do have Grafana and Prometheus, but we don’t use them much. And in regards to tracking active users, as a client owner, I get questions related to it a lot.

I also would like to volunteer to be one of the devs working on it, this seems quite interesting and refreshing.

1 Like

+1 to this, it would be important to develop it where it could be integrated upstream. Aspects is where user/engagement monitoring happens in the project.

This is very interesting as @Agrendalath already suggested that if we can arrive at a way to load test the instance as well, it will give us an edge over others. This will also paint a very honest picture of our client and their needs.

Yeah, I’m game for that. That would make it much more upstreamable. One thing that makes me pause is that I don’t think all of our clients use Aspects yet, so not all of the clients would have access to that. However, I think that it’s something we could pretty easily convince most of our clients to adopt-- it’s incredibly useful, and as we get better and better at rolling it out, and the tooling improves, it gets easier to do.

I wonder if it should be part of our default install.

+1 on using Aspects.

Aspects already tracks “user activity” by looking at navigation events (I don’t remember where I found it, so I won’t be able to reference it right now, but I had to look it up for something). Here is the list (I think it’s full, but would need to double-check):

NAVIGATION_EVENTS = [
    "edx.ui.lms.sequence.next_selected",
    "edx.ui.lms.sequence.previous_selected",
    "edx.ui.lms.sequence.tab_selected",
    "edx.ui.lms.link_clicked",
    "edx.ui.lms.sequence.outline.selected",
    "edx.ui.lms.outline.selected",
]

Via that “proxy” Aspects already has graphs that show user activity, you could just inject custom data view that would calculate it slightly differently and get approximation for active users per month. I would say it’s good-enoughTM for course authors, because it estimates learner’s “engagement”, but it’s not a good proxy to measure the performance.

Aspects concentrates on “learner data” and “data analytics”. It doesn’t know about the “performance”, number of CPUs and RAM on the node, and number of open connections/requests. Yes, I think you can configure the data pipeline to send all the stats to Clickhouse and then display it via Superset. But the way I understand it, it was not intended for it.

If we want to track the load on the cluster, measure number of requests, performance of the nodes, CPU and RAM load, etc, using Prometheus, Grafana and/or may be something else (i.e. the industry standard tools for monitoring and analytics) would make more sense in my humble opinion. I would ask @gabor to chime in here, perhaps he has a more knowledge/context on the ecosystem.

So it’s a -1 on Aspects from me.

1 Like

@maxim You bring up a valid point about trying to overload Aspects to be used as a infra monitor. I wonder if we should think about it in terms of CPU and RAM at all. As things run on a cluster as pods, the unit of measurement then comes out to be - number of pods required to support 1000 learners, and usually not the amount of CPU/RAM. Or in other words, node-metrics from Prometheus and Grafana can help in vertical scaling of individual nodes, but it looks like we are interested in horizontal scaling of pods, at this point.

In my head, when I think about monthly active users, the question is - “at what point does it become necessary for an instance to move from 1 pod of LMS to 2 pod?”. For that, Aspects based measurement looks sufficient, as I imagine these are going to be measured over a longer period of time like weeks/months.

The kind of measurement you mention is probably what’s needed for Piotr’s suggestion here:

Both measures together can give a pretty nice picture.

Hi all! I got pointed here and can speak to a couple of things that may inform the discussion. The first is that there is an existing model / chart in the Operator Dashboard for active users (defined as “distinct users who have any xAPI activity”) that can allow you to choose a time grain including month. The Operator Dash hasn’t gotten the love that the others have, but it could! If folks are interested in this kind of instance-level reporting, make sure to let Chelsea know.

The other bit is more speculative, but I had some time over the holidays and started trying to figure out how we could start to collect and measure telemetry data for the whole stack. I didn’t get super far into it, but enough to say that it’s a thing we could do using Open Telemetry (which has a basic implementation in GitHub - openedx/edx-django-utils: edX utilities for Django Application development. and is generally well supported in other products), an OTel collector that talks to ClickHouse, and a UI for displaying the metrics. The incredibly rough, hard coded to my environment, but semi-functional PoC is here: GitHub - bmtcril/tutor-contrib-clickstack: A highly speculative Tutor plugin proof-of-concept for collecting and displaying telemetry data

My experiment was based on wrapping ClickStack (a ClickHouse project using Open Telemetry and HyperDX) in a Tutor plugin that also attempted to configure LMS/CMS to send OTel events. HyperDX proved the hardest part of that and has a dependency on Mongo that doesn’t thrill me. We could probably eliminate that in favor of just using the OTel Collector and Aspects-style dbt & Superset implementation if that was a thing people wanted to invest in.

In any event, I think this could be a legitimate use case for an Aspects plugin and would be happy to work with folks to move it forward if you like! I’ve spent a fair amount of time noodling on it and have a few opinions.

5 Likes

Hi Ty! If someone else hadn’t looped you in, I was liable to soon :slight_smile:

Sweet. That definitely sounds like what we’re looking for on the ‘monthly active users’ front.

That’s excellent that you’ve already invested some time researching this. It does sound like you’re on the right track for what we’re wanting to build here. Does this help with measuring the resource usage of all Open edX components, including things like Redis for message queuing and what have you, or just the main LMS backend? Could it be effectively extended to get a good sense of how resource usage is across an Open edX instance’s components if not?

1 Like

OTel is designed to be extensible and there are a whole bunch of ways to get data from the different services, but the ecosystem is pretty robust. There are several different ways of integrating so we could collect logs, metrics, and/or spans and present ways to dig through logs, view performance metrics, or show complicated spans (ex: This Django request spent this long in this function, this long waiting for redis, this long waiting for MySQL. Spans are for trying to find long tail requests, requests that grow under load, or figuring out where your slow requests are spending their time).

Of note, the current OTel implementation in edx-django-utils doesn’t yet handle spans, but it should be fairly easy to add and then all of the existing spans that 2U has added for Datadog and New Relic should Just Work :trade_mark:.

There are receivers for redis, mongo, mysql, various k8s things, kafka, elasticsearch etc. Currently absent from that list are Caddy and Meilisearch but Caddy has OTel traces and Meilisearch has a metrics endpoint that could be used to populate metrics data.

I don’t think Superset is a great tool for trying to sort through large volumes of logs or displaying span data, but I think it would probably work well for aggregating and displaying metrics across the various systems and I think that would be the best place to start. To get into the functionality of the Datadog’s of the world, we would probably be back to looking at something like ClickStack which is purpose built for that kind of thing.

1 Like

@tyhob is right that we haven’t given the same attention to the Operator Dashboard that we’ve given to the other out-of-the box Aspects dashboards. Improving on this dashboard is something we could (and likely should) explore in the future. This is very much likely not something we can focus on delivering for this coming release, but could be something we start thinking about in terms of conducting discovery interviews and defining product requirements. Our immediate focus for Aspects is defining Aspects dashboard requirements for the data educators and admins will need once Learning Pathways are developed, but this may be something very much worth exploring after Aspects for Learning Pathways requirements are defined. Is this something your team is actively planning to devote effort to for a client soon or the platform at large soon? Feel free to follow up if you have any questions/comments! @Fox

1 Like

Hi @chelsearathbun ! We’re mostly discussing allocating some of our own internal budget to hit this in order to scratch our own itch-- we want better data on actual hardware requirements vs active users, both for marketing and future planning purposes. However, none of our clients is currently asking for this. If it had to be delayed a little bit in order to accomodate your bandwidth, I imagine that would be fine, but we’re happy to start soon as well-- we’d just need to run discovery to figure out the effort size so we can figure out what’s feasible.

If the proper place to hook in this data is the Operator Dashboard, we’d want to do that, so the community at large could benefit.

I think that we would want to make this a plugin and not put it in the Operator Dash since it has a non-trivial burden on the system (in terms of running at least one additional service and collecting / processing a potentially large amount of data). A fair number of operators already use other solutions and wouldn’t want the overhead.

Speaking of which, I’m not sure if you’ve seen this but the shortest path for just testing things and getting performance numbers without having to productize anything might be the Scout plugin on a free account: openedx-tutor-plugins/plugins/tutor-contrib-scout-apm at main · openedx/openedx-tutor-plugins · GitHub

1 Like

Thank you for your response @Fox! Great to hear OpenCraft is able to devote some time and thought to this.

And thank you for flagging your concerns about burdening the system if we were to fold this work into the Operator Dashboard, @tyhob.

@Fox - is the goal to simply display a monthly active users metric or are there additional, related metrics Open Craft is hoping to add as well?

Forgive the question if it’s an insane one, but I’m curious if this plugin could be designed in such a way that if enabled the user could view monthly active users metric in the Operator Dashboard alongside the other data about their instance? Or would the plugin have to live separate from the Operator Dashboard?

Sorry @chelsearathbun , there are two things getting conflated here. There’s the MAU, which already exists in the Instructor Dash and may or may not be exactly what OC needs. Then there is the metrics gathering for system performance (CPU, load, memory, etc) which is what I was suggesting might be a good plugin.

Due to the way Superset is designed we would probably need to make a separate dashboard for any plugins.

Ah got it - yes, I mashed the two together. This makes sense. Unfortunate the plugin would have to live in a separate dashboard entirely since both would presumably be for the same user in most cases, but I figured that might have to be the case.

@tyhob Thank you for sharing this! This looks very interesting and at least scratches my itch, I might experiment with it on a real instance.

1 Like