Closed Bug 1838698 Opened 1 years ago Closed 1 year ago

Craft System Health Monitoring Dashboard for Reinstrumented Messaging System data

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: chutten, Assigned: chutten)

References

(Blocks 1 open bug)

Details

Attachments

(4 files)

pr 133: bug 1838698 - Preliminary Messaging System health monitoring dashboard 1 years ago Chris H-C :chutten 46 bytes, text/x-github-pull-request		Details \| Review
pr 135: bug 1838698 - Messaging System Health Dashboard refinements 1 years ago Chris H-C :chutten 46 bytes, text/x-github-pull-request		Details \| Review
pr 138: bug 1838698 - Add alerts to Firefox Desktop Messaging System opmon dash 1 years ago Chris H-C :chutten 46 bytes, text/x-github-pull-request		Details \| Review
pr 143: bug 1838698 - Add friendly names and descriptions to all metrics in the Messaging System config 1 year ago Chris H-C :chutten 46 bytes, text/x-github-pull-request		Details \| Review

Chris H-C :chutten

Assignee

Description

•

1 years ago

As described in the plan we need to provide some system validation not only now but into the future to establish and monitor the health of the reinstrumented Messaging System.

This will take the shape of an OpMon dashboard and it'll look at things like:

Ping volumes by pingType
Client volumes
Counts of unexpected data (nested or unknown keys)

Chris H-C :chutten

Assignee

Comment 1

•

1 years ago

Attached file pr 133: bug 1838698 - Preliminary Messaging System health monitoring dashboard — Details

Chris H-C :chutten

Assignee

Comment 2

•

1 years ago

Dashboard's starting to appear: https://mozilla.cloud.looker.com/dashboards/operational_monitoring::firefox_messaging_system

Gonna switch to compact_visualization = false and add in the opmon metrics for unexpected data.

Chris H-C :chutten

Assignee

Comment 3

•

1 years ago

Attached file pr 135: bug 1838698 - Messaging System Health Dashboard refinements — Details

Chris H-C :chutten

Assignee

Comment 4

•

1 years ago

Refinements have landed, we'll see how they look on Monday.

Things left to do, (for after the rebuild&redeploy and data gathering over the weekend):

Ask for OMC's feedback
Set up alerts (to OMC's specification)

Then we ought to be able to call it done, I think.

Chris H-C :chutten

Assignee

Comment 5

•

1 years ago

Attached file pr 138: bug 1838698 - Add alerts to Firefox Desktop Messaging System opmon dash — Details

Chris H-C :chutten

Assignee

Comment 6

•

1 years ago

Dan: Please have a look at the dashboard.

I'm interested in all feedback, but some of it might turn into opmon feature requests rather than changes to the dashboard before delivery. Stuff I'm definitely able to refine are additions/changes to what data is monitored or alerts are configured (though the alerts won't be ready for inspection until tomorrow) and things of that nature. If requests for changes are self-contained, we should consider filing them as follow-ups that can be taken by the OMC team to get used to how to find and adjust the dashboard.

Flags: needinfo?(dmosedale)

Dan Mosedale (:dmosedale, :dmose)

Comment 7

•

1 year ago

Thanks so much for making this happen!

I totally get that some of this is “opmon feature requests”; thanks. :-)

As far as spinoffs, I think followups that can be taken by the team are a great idea.

Here's some feedback with the important stuff first and the nits at the end:

What are the expected use cases?
** One that I can see is this:
*** Get an email alert, which should generally happen pretty rarely
*** Look at the dashboard to see if we can understand more about the issue and decide on next steps
** Are there others?
If the use case above is the only one, then having the channel and OS segmented make sense. If there are others, it seems like it would depend on what those others are…
I’m not sure how to apply the word “percentile” to the graphs in this particular dashboard, perhaps in part because the percentile claims to be settable, but if I set it to 5 and 95 (and 50), I see the exact same dashboard with all of them. Can you unpack how this is intended to be used?
I’m assuming that the only reason there are so many alerts is because we’re still building up historical data to use to set the correct context to choose when to alert.
** If that’s true, how long should we expect to wait for alerts to become pretty rare? I suspect we don’t want to turn on email alerting until they are…
** If I’m wrong, I suspect this frequency of alerts would be hard to keep up with, and maybe we want to fix something or structure things differently…
Active, sum, and point don’t help me (at least) understand the graphs, and to some degree make it more confusing.. Here’s what I’m guessing they would ideally be; it would be nice to get them re-labeled if/when opmon supports that…
** All of the “sum” subtitles: ? maybe this should just not be there?
** Ping Volume “active”: “ping count”
** Client Volume “active”: “active clients”
** Ping Volume By Ping Type & Unexpected Data
*** “point”: “ping count”
*** “active - ” prefix and “_volume” suffix for all of the ping types: ? If these are always going to be the same, the graph would be more readable without them. If there are other possible values that could sometimes appear here, it’d be nice to know what they are.
For the client volume graph, it would be interesting to have a line showing the number of clients sending any kind of glean pings on the given day there, as a way to see how the client volume fits into the bigger picture (and we might want to alert if the percentage changes sufficiently, I dunno).
It would be nice if the X-axes included ticks for each day, even the (alternating) days that aren’t labeled

Thoughts?

Flags: needinfo?(dmosedale) → needinfo?(chutten)

Chris H-C :chutten

Assignee

Comment 8

•

1 year ago

(In reply to Dan Mosedale (:dmosedale, :dmose) from comment #7)

What are the expected use cases?
** One that I can see is this:
*** Get an email alert, which should generally happen pretty rarely
*** Look at the dashboard to see if we can understand more about the issue and decide on next steps
** Are there others?

I can think of one: we've taken to looking at our monitoring dashboards as part of our weekly triage meeting. Of course, we're using non-opmon dashboards at the moment (our dashboards predate opmon by a fair chalk) so we don't have alerting so we can't take that approach. Maybe when we eventually (hopefully) switch to using opmon for our monitoring and can take advantage of the alerting we'll switch to "got an email alert" as our model.

I’m not sure how to apply the word “percentile” to the graphs in this particular dashboard, perhaps in part because the percentile claims to be settable, but if I set it to 5 and 95 (and 50), I see the exact same dashboard with all of them. Can you unpack how this is intended to be used?

That's for any histogram-shaped measures you might choose to add and is presently unused. (Plotting entire histograms over time would require something like a ridgeline plot or some sort of heatmap or some other datavis, and it wouldn't support alerting. By looking instead at the summary statistics over time (like 50th %ile) it can use the established line plots and be alerted upon in an intuitive way. )

I’m assuming that the only reason there are so many alerts is because we’re still building up historical data to use to set the correct context to choose when to alert.
** If that’s true, how long should we expect to wait for alerts to become pretty rare? I suspect we don’t want to turn on email alerting until they are…
** If I’m wrong, I suspect this frequency of alerts would be hard to keep up with, and maybe we want to fix something or structure things differently…

I'm actually not sure. With more than $window_size days of data I was expecting it to have settled, but maybe we need $previous_window_size + $current_window_size days worth of data (in which case it should settle after it gets its data from the 29th (ie, the morning of the 30th)). I've been meaning to pester Anna about it, but keep getting pulled off to other things, so... ni?Anna - Do you have any insights about why we're getting so many alerts?

Active, sum, and point don’t help me (at least) understand the graphs, and to some degree make it more confusing.. Here’s what I’m guessing they would ideally be; it would be nice to get them re-labeled if/when opmon supports that…
** All of the “sum” subtitles: ? maybe this should just not be there?

The "Sum" tells you that this is the statistic that we're examining over time. Presently opmon supports sum, percentile, mean, and count - it'd be important to distinguish amongst the statistics of the same metric that are being examined if we were looking at them.

Right now we're looking at coarse things at the ping level so sum and count are the same thing so we're only looking at the former. (I guess I could've used count instead. Oh well. That's what I get for following the examples.)

** Ping Volume “active”: “ping count”
** Client Volume “active”: “active clients”
** Ping Volume By Ping Type & Unexpected Data
*** “point”: “ping count”
*** “active - ” prefix and “_volume” suffix for all of the ping types: ? If these are always going to be the same, the graph would be more readable without them. If there are other possible values that could sometimes appear here, it’d be nice to know what they are.

_volume suffix is simply what the opmon metric is called. I went for "descriptive for readers of the toml file" over aiming for making most helpful for the legend. And, huh, it looks like maybe the friendly_name property would let me configure that... I'll give that a try.

For the client volume graph, it would be interesting to have a line showing the number of clients sending any kind of glean pings on the given day there, as a way to see how the client volume fits into the bigger picture (and we might want to alert if the percentage changes sufficiently, I dunno).

I don't think opmon has the capability for querying multiple tables (called data_source in the toml). (Another question for Anna).

However, since we're looking at absolute counts with alerts for historical variance, any change to "% of all users sending "messaging-system" pings" should only be due to a) A change to the numerator (the population of clients sending "messaging-system" pings) which will net you alerts from this dashboard, b) A change to the denominator (the overall user population) - which other people will get an alert about. So we might be covered.

It would be nice if the X-axes included ticks for each day, even the (alternating) days that aren’t labeled

That appears to be a limitation of Looker. If you click on the three dots of any of those plots and select "Explore from here" you'll get the fully-powered query+visualization experience that Looker provides. In a quick skim of the available plot visualization options I couldn't find a way to force grid lines or ticks or labels for each distinct value. Maybe this is because Looker assumes you're more often looking at daily values over the course of months (which at standard widths would be very busy if they had ticks for each day), or maybe it's because I didn't look hard enough.

Either way, though, "Explore from here" is a useful thing to have in your toolbox for when you find something in Looker and go "I'd like to see that, but with {X} as {Y} instead".

Flags: needinfo?(chutten) → needinfo?(ascholtz)

Anna Scholtz [:ascholtz]

Comment 9

•

1 year ago

I’m not sure how to apply the word “percentile” to the graphs in this particular dashboard

In addition to what chutten mentioned, there is an open bug to only show this filter option when percentiles are actually being used for any of the metrics: https://github.com/mozilla/opmon/issues/84

Anna - Do you have any insights about why we're getting so many alerts?

Looking into some alerts, they don't look completely off. Most of these metrics don't have the smoothest development over time. For example, looking at some of the alerts coming from Window release, there have been significant increases recently: https://mozilla.cloud.looker.com/x/vYkdAdiBNZWhb0Sz4iqjID
There might be some potential in increasing the max_relative_change value in the config to get fewer alerts: https://github.com/mozilla/metric-hub/blob/6b60ba02e1ee0b3ab93958359c5e13eb8900267e/opmon/firefox-messaging-system.toml#L175C1-L175C20 Or increase the window sizes to be looked at. Or use fixed thresholds instead of relative changes.

Active, sum, and point don’t help me (at least) understand the graphs, and to some degree make it more confusing.. Here’s what I’m guessing they would ideally be; it would be nice to get them re-labeled if/when opmon supports that…

There is another open bug to add some more context around what each statistic means to the dashboard: https://github.com/mozilla/opmon/issues/109

And, huh, it looks like maybe the friendly_name property would let me configure that... I'll give that a try.

Same for metric descriptions, there is an open bug there as well: https://github.com/mozilla/opmon/issues/61
Using friendly name where available could be implemented as part of this.
Getting more of the metadata to show up on the dashboard shouldn't be a huge amount of effort. I can see if Eduardo has some capacity the get this done soon, or I might have some time to implement it since this can potentially help a lot with usability here.

I don't think opmon has the capability for querying multiple tables (called data_source in the toml). (Another question for Anna).

You could define your own custom data_source that queries from multiple tables. The from_expression can be a SQL subquery instead of a reference to a table (e.g. from_expression = (SELECT * FROM whatever_complex_join_I_need))
Or you could define another metric_group that combines the two metrics.

It would be nice if the X-axes included ticks for each day, even the (alternating) days that aren’t labeled

Yeah, that seems to be a Looker limitation at the moment. The more data is shown, fewer days will be labeled explicitly

Flags: needinfo?(ascholtz)

Anna Scholtz [:ascholtz]

Comment 10

•

1 year ago

Also, feel free to open any issues or feature requests here: https://github.com/mozilla/opmon/issues
We are currently a bit short-staffed on the OpMon side, but if there is something that would significantly improve the experience with OpMon we can find a way to get it implemented quickly.

Chris H-C :chutten

Assignee

Comment 11

•

1 year ago

Attached file pr 143: bug 1838698 - Add friendly names and descriptions to all metrics in the Messaging System config — Details

chutten merged PR #143: "bug 1838698 - Add friendly names and descriptions to all metrics in the Messaging System config" in 15c7743.

Dan Mosedale (:dmosedale, :dmose)

Updated

•

1 year ago

Blocks: 1843406

Dan Mosedale (:dmosedale, :dmose)

Updated

•

1 year ago

Blocks: fxms-glean

Chris H-C :chutten

Assignee

Comment 12

•

1 year ago

I've reviewed the outstanding issues we identified in comments and it appears as though most have linked issues. The one notable exception is changing the monitoring to (or augmenting it with) affected proportions of the population.

To my eyes, my argument in comment #8 still stands: the only sorts of things we'd gain by looking at proportions instead of absolutes would be not alerting for things like the upcoming shift of a large proportion of our population from release to ESR. Given their historical rarity (this is only the second one I can remember happening in these past eight years), I don't know that it's necessary.

However, I'd happily file a follow-up for future consideration. :dmose, which would you like?

Flags: needinfo?(dmosedale)

Dan Mosedale (:dmosedale, :dmose)

Comment 13

•

1 year ago

I think you're right; it's not particularly important, no need to file.

Flags: needinfo?(dmosedale)

Chris H-C :chutten

Assignee

Comment 14

•

1 year ago

Alrighty, we'll call https://mozilla.cloud.looker.com/dashboards/operational_monitoring::firefox_messaging_system good enough to mark this resolved. Future work tasks to be tracked in individual bugs.

Status: ASSIGNED → RESOLVED

Closed: 1 year ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Craft System Health Monitoring Dashboard for Reinstrumented Messaging System data

Categories

(Toolkit :: Telemetry, task, P1)

Tracking

()

People

(Reporter: chutten, Assigned: chutten)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Attachments

(4 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Updated

Comment 12

Comment 13

Comment 14

Attachment

General

Description

File Name

Content Type