Closed Bug 1297146 Opened 8 years ago Closed 7 years ago

Write a dashboard like arewestableyet.com, but using telemetry data

Categories

(Toolkit :: Telemetry, defect, P2)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: chutten, Assigned: chutten)

References

()

Details

(Whiteboard: [fce-active-legacy])

arewestableyet.com uses socorro data. Socorro data is changing as e10s is being rolled out because submission rates on content and parent are wildly different.

Relman needs a stability dashboard that can do what arewestableyet.com does, but using telemetry data.
Some basic requirements for this dashboard would be:
1. To have thresholds for acceptable crash rates per channel and color code rates as red, yellow, green (much like arewestableyet.com)
2. Same as #1 but granular thresholds for browser, content and plug-in crashes.

I'll just throw the need for other dashboards (probably from this main one) which show stability trends on a channel across multiple versions. For example, Aurora47 vs Aurora48 vs Aurora49 vs Aurora50.
I think we absolutely need to have the # of startup crashes (probably not divided by khours, since I think it's almost meaningless but by ADI). At the opposite, I think shutdown crashes are not so important for stability.
Startup crashes are not presently captured well by Telemetry-based datasets due to the nature of how Telemetry sends data (specifically, the previous sessions' data is only sent about 60s after the next, successful launch. If a recurring startup crash doesn't permit startup to complete, then we have nothing). We have ideas on how to fix that, though. The plans are waiting on the outcome of the client-side stackwalking features.

Interesting point about startup crash rates. I think per kusagehour will do just fine since we collect kusagehour from the whole population, not just the crashing clients. I can't wait until we get startup crashes from Telemetry so we can put it to the test.

This dashboard is more of an exploration of the data (and its limitations) and how we can display it in a useful manner. arewestableyet will have to do most of the decisionmaking heavy-lifting for a while.
Adding my comments from the team meeting today - my requests maybe aren't so basic...could we have something like:

(1) A way to show crashes that move up to the top 5 rapidly/crash spikes
(2) Platform specific crash breakdown for Desktop (Windows/Mac/Linux)
(3) Android Version specific crash information for Mobile - this is useful to see when a new OS is released, like Android 7.0

I like the idea of seeing the data graphically or some other way to make it more impactful than it currently is on arewestable. Kairo did https://crash-analysis.mozilla.com/rkaiser/crash-report-tools/longtermgraph/ this, but even that I find a bit noisy. But this is the type of graphic that would be useful to show at the weekly Channel meetings.
Whiteboard: [fce] → [fce-active]
For speed and comparison purposes, I think it's important to start with analogous graphs. We don't have a good way to draw correlations between soccoro-based and telemetry-based ones right now.

However, it would be great to keep track of needs/ideas in this bug until such time as we can extend whatever arises from it.
So... here's what I've got so far. Who wants to take an early gander?

Dashboard: https://chutten.github.io/telemetry_crashes/ 
Code: https://github.com/chutten/telemetry_crashes

A few notes:
* There are no green/yellow/red thresholds for the numbers as of yet. It won't take much to add, once I figure out what they're supposed to be. Consider this a TODO.
* I'm not filtering by version, and I probably should be. Consider this a TODO.
* The query's currently limited to just the last 15 days to keep its runtime short and JSON small. I'm not sure what sort of interest there is in ultra-historical data.
* The graphs really illustrate the inflation problem we have with how much the kusagehours metric is lagging behind the crash reports.
* Though I thought I hooked this up to a query that refreshes on the regular, apparently it doesn't, so the dashboard isn't auto-updating as of yet. I am manually running it for the nonce.
This would be good for relman to look at now, since there's enough flesh on these bones to consider. NI Sylvestre (& team) to look at this and give some feedback.
Flags: needinfo?(sledru)
A suggestion, could we automatically detect and remove the points where data is not fully available yet? As it is now, there's always a huge spike at the end that hides differences in the rest of the graph.
It looks good in term of what we need.
Two questions:
* Can we get a longer period?
* Can we get startup crashes?
Flags: needinfo?(sledru)
(In reply to Marco Castelluccio [:marco] from comment #8)
> A suggestion, could we automatically detect and remove the points where data
> is not fully available yet? As it is now, there's always a huge spike at the
> end that hides differences in the rest of the graph.

We never have "fully available data", sadly. There are always crash ping and crash reports and main pings and everything that users just haven't gotten around to submitting just yet.

But I can put in some sort of threshold on not graphing rates on days when the kusagehours are less than, say, 60% of expected. Which will likely do close to the same thing :)

(In reply to Sylvestre Ledru [:sylvestre] from comment #9)
> It looks good in term of what we need.
> Two questions:
> * Can we get a longer period?
> * Can we get startup crashes?

* Nothing stopping me from expanding the period in the graphs. I just need to change the query. In what way would that be helpful? What questions could you answer with longer timeframes that aren't answerable with shorter ones? (want to figure out how I'd display it. A lot of nuance gets lost if you compress a long timeframe into a narrow column of pixels). How much in the past do you want to see?

* crash_aggregates currently doesn't separate out startup crashes (bug 1306013). Also, since it is fueled by pings, we can't get the unrecoverable startup crashes, because those prevent the user from starting Firefox for long enough for the pings to be sent. Getting those (and getting other browser crashes faster) is a primary focus of :ddurst's FCE team, and is he subject of a few gdocs and more than a few email threads that will be turned into bugs real soon now.
(In reply to Chris H-C :chutten from comment #10)
> (In reply to Marco Castelluccio [:marco] from comment #8)
> > A suggestion, could we automatically detect and remove the points where data
> > is not fully available yet? As it is now, there's always a huge spike at the
> > end that hides differences in the rest of the graph.
> 
> We never have "fully available data", sadly. There are always crash ping and
> crash reports and main pings and everything that users just haven't gotten
> around to submitting just yet.
> 
> But I can put in some sort of threshold on not graphing rates on days when
> the kusagehours are less than, say, 60% of expected. Which will likely do
> close to the same thing :)

Yeah, what I meant with "fully available data" was actually "reasonable data" :)

What I've noticed so far is that on any given day X, we have "reasonable data" for
X-3, "reasonable data" - ~15% for X-2, "unreasonable data" for X-1.
Depends on: 1306013
I've updated the dashboard with what I think are all of the requested changes and TODOs. Please take a look and let me know of any feedback or questions you have: https://chutten.github.io/telemetry_crashes/
Flags: needinfo?(sledru)
Flags: needinfo?(mcastelluccio)
I think one of the biggest questions is the limits used on https://arewestableyet.com/ that also determine the coloring -- they seem to be hard-coded (and not recently updated). This new dashboard is attempting to convey the same sense of comparison from week-to-week (with the percentages in parentheses, at least).

Seems like we should agree on the point of comparison and also confirm the numbers and appropriateness used in https://arewestableyet.com/.
Hi Chutten, is there a read-me on this dashboard? For instance, what are the %age #s in each cell?

When I hover over a cell, it would be nice to know what is the expected crash rate range to go green/yellow? Also, is the dashboard showing daily snapshots? or is it showing trends up to the date selected in the "analyzed date" drop down? It's neat.
Flags: needinfo?(chutten)
Please ignore my comment about read-me because I don't scroll down much. Sorry! ;)
(In reply to Ritu Kothari (:ritu) from comment #14)
> When I hover over a cell, it would be nice to know what is the expected
> crash rate range to go green/yellow? Also, is the dashboard showing daily
> snapshots? or is it showing trends up to the date selected in the "analyzed
> date" drop down? It's neat.

Sure. Mouseover text is sadly not very accessible or discoverable, but as a non-critical augmentation I don't mind adding it.

The grid is showing data for the selected "Analyzed Date". I guess I should put the dropdown below "Firefox Desktop" so that it's a little more clear.
Flags: needinfo?(chutten)
I like it, it is now better than arewestable yet. Bravo & many thanks
I think we have now to try it against reality :)
Flags: needinfo?(sledru)
By the way, about startup crashes, can bug 1295934 help?
I think it will, as soon as bug 1306013 brings it to the dataset endpoint (crash_aggregates) that the dashboard needs.
My comment has been addressed by the latest version of the dashboard, I don't have anything to add.
Flags: needinfo?(mcastelluccio)
(In reply to Chris H-C :chutten from comment #10)
> But I can put in some sort of threshold on not graphing rates on days when
> the kusagehours are less than, say, 60% of expected. Which will likely do
> close to the same thing :)

Perhaps we can estimate the number of khours when it is lower than normal? As it is now, we still have a spike (even if way lower than before) at the end of the graph.
For example, we could discard the data if it's lower than 60%, adjust it somehow if it's lower than 90%.

I'm afraid that people will start getting used to the last day of the graph being unreliable, which makes it the same as just discarding the day.
Flags: needinfo?(chutten)
(In reply to Marco Castelluccio [:marco] from comment #21)
> (In reply to Chris H-C :chutten from comment #10)
> > But I can put in some sort of threshold on not graphing rates on days when
> > the kusagehours are less than, say, 60% of expected. Which will likely do
> > close to the same thing :)
> 
> Perhaps we can estimate the number of khours when it is lower than normal?
> As it is now, we still have a spike (even if way lower than before) at the
> end of the graph.
> For example, we could discard the data if it's lower than 60%, adjust it
> somehow if it's lower than 90%.

We sadly don't have a model for determining what a day's kuh tally will be, or a day's crash tally. Both of those are necessary to predict in order to get a good number for the crash rates.

Let me twiddle the threshold a bit and also see if I'm accidentally using yesterday's data instead of today's. (there's some disagreement on whether the dataset updates at midnight utc or noon utc)
Flags: needinfo?(chutten)
It turns out that I _was_ accidentally using yesterday's data instead of today's. Timezones are hard. I've also pushed the updated thresholds so we should be pointing at a useful day right off of the bat.
Depends on: 1315996
Depends on: 1315998
Depends on: 1315999
The percentage numbers are confusing in two ways.

1) Percent of what? The fact that it's change from a week ago would be helpful to have in a caption.

2) They don't read as percent change. Maybe subtract 100 and put a '+' sign in front of the positive values would make that clearer. E.g. 96% => -4% and 110% => +10%
I do love the chart and graphs -- thank you! I offer comment 24 in case you think it might improve things, not as a criticism.
Ack, I see I've missed some early comments by :marcia:

(1) A way to show crashes that move up to the top 5 rapidly/crash spikes
Sadly this dataset has no idea about stacks or signatures, so Socorro will be the place to go to for this information. Once our crash ping improvements land, we might be able to provide this information more readily.

(2) Platform specific crash breakdown for Desktop (Windows/Mac/Linux)
I can't think of a way to display this. Displaying it on the graphs wouldn't work as each graph would be dominated entirely by Windows.

(3) Android Version specific crash information for Mobile - this is useful to see when a new OS is released, like Android 7.0
This dashboard is designed to handle Desktop only... though I suppose there's nothing stopping Android data from getting in. Like point #2 I'm having difficulty coming up with a way to display this sensibly...

And to :dveditz:
The percentages are explained in the README at the bottom of the page. Yours are not the first comments about the inscrutability and dubious usefulness of the percentages (outside, perhaps, of the kuh row), however, so maybe I should just remove them for a simpler display...
> (2) Platform specific crash breakdown for Desktop (Windows/Mac/Linux)
> I can't think of a way to display this. Displaying it on the graphs wouldn't
> work as each graph would be dominated entirely by Windows.

One idea: display it on the graphs, and add the ability to filter which platforms are shown/not shown. The y-axis might need to scale up and down when the platform filter changes.
(In reply to Nicholas Nethercote [:njn] from comment #27)
> > (2) Platform specific crash breakdown for Desktop (Windows/Mac/Linux)
> > I can't think of a way to display this. Displaying it on the graphs wouldn't
> > work as each graph would be dominated entirely by Windows.
> 
> One idea: display it on the graphs, and add the ability to filter which
> platforms are shown/not shown. The y-axis might need to scale up and down
> when the platform filter changes.

I don't think we need to do this on this literal dashboard. If we do (and even if we don't), take the easy route and just isolate each desktop platform, so a graph per each.
The dashboard is now at its new (permanent, I hope) home at https://telemetry.mozilla.org/crashes/
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Depends on: 1349536
Depends on: 1359645
Depends on: 1381873
Depends on: 1382236
Whiteboard: [fce-active] → [fce-active-legacy]
You need to log in before you can comment on or make changes to this bug.