Should we bring back our "Maintained Instances" reference?

Lately when we’ve been paged for incidents, I find it hard to find information about each particular client’s setup - what specifically we’re hosting for them / responsible for, how to connect, what services it uses, sometimes even what client it is with that site. Part of the reason for this is that the information is distributed in a few different places like the CRM, JIRA, etc.

However, in the past we had a Maintained Instances document (private link for OpenCraft team) that listed all of this information in one place. This made fire fighting much easier. I’m thinking we should bring that back, as long as we can avoid making it redundant with other info we have in other places. What do people think?

CC @samuel

Ticket: MNG-4924

2 Likes

I really like this idea! I this in the past we discussed automating some of this and I think that would be a good idea as well. I think each client instance should have a readme file with all the relevant info, and we can have some automation to pull all these in to a single place.

We can also consider what’s used for Open edX repos. A catalog-info.yml file, which in our case could document each instance, and a system to pull all this into a sinlge searchable place.

1 Like

I believe the idea behind BB-7618 was to move all such information to the maintenance epic of each client (which is usually the correct place to log time). The problem is discoverability (which has been a recurring issue since we migrated from documentation → Monday → CRM → Jira). A central place (linked in the alerts channel) with at least links to these epics, URLs to the instances (for quick searching when there is only the URL in the alert), and contact people could be a good start here.

Also, it would probably be a good idea to review the epics to ensure they include sufficient details about the client’s instances. Here are my maintenance epics: BB-352 + BB-3154.

1 Like

The CRM was supposed to be the starting point here-- you’d go to the list of clients, click on their briefings, and then check their maintenance epic for each one you needed to look at.

However there are two problems:

  1. This requires too many clicks to see the relevant info when we’re just trying to find all the instances at once.
  2. The CRM ate itself some time ago and we’ve been using a set of spreadsheets with the data it used to contain in the meantime, so this became even less user friendly (which if you knew the CRM, is saying something)

I’m in favor of finding some way to gather together the instances information. Even if that’s as simple as ‘yoink all the descriptions from all the active maintenance epics and yeet them into a protected file somewhere’.

+1 to this! I’m also after something similar for our repos in general too, to find info about who maintains it, what ticket general maintenance falls under, etc.

My 2c is that the difficult part of this is keeping that info up to date and discoverable. We’d need the information as close to the source as possible to encourage visibility to maintainers and keeping it up to date (eg. in the repo/readme, in the jira tickets), but then a way to find that (difficult when it’s spread over many jira tickets that are hard to find; some central spreadsheet-like thing would make it easier to find the info). @Fox 's comment feels en point:

As does @kshitij 's:

Also if we’re using automation to pull things together (not that I’d want to overengineer it…), but the client instances already have structured data with domains, services deployed, etc. that we could use.


Something that I’ve wanted here is more information in the alert itself - for some alerts it’s even impossible to tell whether it’s the staging or prod environment.

This is a different conversation, but I also believe we have opportunity for a wider epic to overhaul our monitoring/alerting systems, as we can definitely improve the alert information, the reliability of alerting (note SE-6610), workflow of alerts (acking/snoozing behaviour, etc.), etc.

2 Likes

I think these two things will be sufficient actually, and we don’t really need automation for it (for now), as it isn’t something that changes too often. Is anyone interested in taking this on, and making a simple new page in our private docs that links out to all the epics?

We already have something like that. The description of the Hosted Sites Maintenance lists all hosting clients and links out to the corresponding maintenance epics/tickets.

Checking it once per sprint and updating it as necessary is part of the responsibilities of the epic planning role.

From my perspective, making sure that the Hosting Details section in each maintenance epic/ticket stays up-to-date (and includes sufficient details in the first place, as @Agrendalath mentioned) seems to be the bigger issue.

We have a process now where each upgrade epic includes a ticket for client owners to update that section after completing the corresponding upgrade. So in theory, there’s two opportunities for reviewing and adjusting the info per year already. If FFs still find that relevant info tends to be missing, maybe we need to review the current format of the Hosting Details section and come up with a new template that client owners can reference to make sure they’ve included all necessary details?

@tikr That epic description doesn’t have the URLs, which is a key point. It’s not always obvious which site maps to which client, or what other related services are also part of the maintenance for that client.