Closed Bug 1741894 Opened 3 years ago Closed 2 years ago

MissionControl v2 doesn't list crash rate summary and crash incidence summary for beta

Categories

(Data Platform and Tools :: General, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aryx, Unassigned)

Details

  1. Open https://metrics.mozilla.com/public/sguha/mc2/missioncontrol_v2.html
  2. Click onto the "Beta" tab.

The "Crash Rate Summary" and "Crash Incidence Summary" tables are empty butt should be populated similar to "Release" and "Nightly". The source code in lines 9951 and 9952 shows the data is empty.

Could anybody check on the server side what's broken, please?

I was able to replicate this issue when I first checked, but it looks like it is resolved now. It is related to timing issues between the two jobs mentioned in bug 1658816 comment 3.

Job 1 completed at 15:37 UTC, but Job 2 started it's first run at 15:00 UTC, rendering the page with incomplete data. Since Job 2 is configured to run multiple times, it resolved itself at 16:02 UTC.

Any idea why this started now? This had also been observed this Tuesday at ~17:50 UTC and not for the last years.

Component: Datasets: General → General

We're going to be decommissioning this version of mission control this half: https://mozilla-hub.atlassian.net/browse/DSRE-671

We're going to mark this as resolved, and if the eventual consistency doesn't work for releng we can reopen.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED

MissionControl v1 is getting decommissioned, not v2?

Flags: needinfo?(fbertsch)

AIUI we will be decommissioning both, replaced with something new. Unfortunately that ticket is not very informative. Rob, can you confirm that we're planning on deprecating both v1 and v2 mission control?

Flags: needinfo?(fbertsch) → needinfo?(rmiller)

We're very soon going to dig in to assess the overall Mission Control situation. I'll summarize what I know so far and some of our constraints here, though, as a starting point.

My understanding is that MCV1 is pretty straightforward, in terms of the functionality and the computation involved. Unfortunately, it's built on a data pipeline infrastructure that is no longer in use and that we can't support; Data SRE won't even touch it at this point. We suspect that the OpMon telemetry monitoring infrastructure we've built can be used to provide similar functionality that will ultimately result in a Looker dashboard that meets the need. It wouldn't be a straightforward, self-serve style OpMon project as is described in the docs, and may even require some changes to OpMon itself, but we think we can get something that will work.

MCV2 is a one-off tool that was built by a data scientist who is no longer here at Mozilla. It's hosted in that user's personal folder on a utility server that used to be a core part of the Data Science team's infra, but which is no longer in use. It is a much more sophisticated beast, using (if I understand correctly) machine learning to try to predict what certain metrics should look like based on past trajectories, and flagging deviation from that prediction. I believe the real meat of MCV2's implementation is in a computational notebook of some sort.

The sad truth is that Data Engineering can't support this, either. Digging into the math at MVC2's core would require a skilled data scientist (any mistake would make the results untrustworthy) and it's never easy to get DS help, they're stretched real thin. I believe the server that is hosting the app is slated to be shut down. If there's another home available somewhere, we'll happily make all of the MCV2 code available and assist however we can in getting it set up and working in another environment, but it would be a hand-off.

Again, I'm not sure about many of the things I've said here; we've only looked at things superficially, are planning to dig in more deeply in the coming weeks. And I'll do what I can to see if any DS support might be available. But this is my best understanding of the situation.

Flags: needinfo?(rmiller)
You need to log in before you can comment on or make changes to this bug.