Closed Bug 1838698 Opened 1 year ago Closed 11 months ago

Craft System Health Monitoring Dashboard for Reinstrumented Messaging System data

Categories

(Toolkit :: Telemetry, task, P1)

task

Tracking

()

RESOLVED FIXED

People

(Reporter: chutten, Assigned: chutten)

References

(Blocks 1 open bug)

Details

Attachments

(4 files)

As described in the plan we need to provide some system validation not only now but into the future to establish and monitor the health of the reinstrumented Messaging System.

This will take the shape of an OpMon dashboard and it'll look at things like:

  • Ping volumes by pingType
  • Client volumes
  • Counts of unexpected data (nested or unknown keys)

Dashboard's starting to appear: https://mozilla.cloud.looker.com/dashboards/operational_monitoring::firefox_messaging_system

Gonna switch to compact_visualization = false and add in the opmon metrics for unexpected data.

Refinements have landed, we'll see how they look on Monday.

Things left to do, (for after the rebuild&redeploy and data gathering over the weekend):

  • Ask for OMC's feedback
  • Set up alerts (to OMC's specification)

Then we ought to be able to call it done, I think.

Dan: Please have a look at the dashboard.

I'm interested in all feedback, but some of it might turn into opmon feature requests rather than changes to the dashboard before delivery. Stuff I'm definitely able to refine are additions/changes to what data is monitored or alerts are configured (though the alerts won't be ready for inspection until tomorrow) and things of that nature. If requests for changes are self-contained, we should consider filing them as follow-ups that can be taken by the OMC team to get used to how to find and adjust the dashboard.

Flags: needinfo?(dmosedale)

Thanks so much for making this happen!

I totally get that some of this is “opmon feature requests”; thanks. :-)

As far as spinoffs, I think followups that can be taken by the team are a great idea.

Here's some feedback with the important stuff first and the nits at the end:

  • What are the expected use cases?
    ** One that I can see is this:
    *** Get an email alert, which should generally happen pretty rarely
    *** Look at the dashboard to see if we can understand more about the issue and decide on next steps
    ** Are there others?

  • If the use case above is the only one, then having the channel and OS segmented make sense. If there are others, it seems like it would depend on what those others are…

  • I’m not sure how to apply the word “percentile” to the graphs in this particular dashboard, perhaps in part because the percentile claims to be settable, but if I set it to 5 and 95 (and 50), I see the exact same dashboard with all of them. Can you unpack how this is intended to be used?

  • I’m assuming that the only reason there are so many alerts is because we’re still building up historical data to use to set the correct context to choose when to alert.
    ** If that’s true, how long should we expect to wait for alerts to become pretty rare? I suspect we don’t want to turn on email alerting until they are…
    ** If I’m wrong, I suspect this frequency of alerts would be hard to keep up with, and maybe we want to fix something or structure things differently…

  • Active, sum, and point don’t help me (at least) understand the graphs, and to some degree make it more confusing.. Here’s what I’m guessing they would ideally be; it would be nice to get them re-labeled if/when opmon supports that…
    ** All of the “sum” subtitles: ? maybe this should just not be there?
    ** Ping Volume “active”: “ping count”
    ** Client Volume “active”: “active clients”
    ** Ping Volume By Ping Type & Unexpected Data
    *** “point”: “ping count”
    *** “active - ” prefix and “_volume” suffix for all of the ping types: ? If these are always going to be the same, the graph would be more readable without them. If there are other possible values that could sometimes appear here, it’d be nice to know what they are.

  • For the client volume graph, it would be interesting to have a line showing the number of clients sending any kind of glean pings on the given day there, as a way to see how the client volume fits into the bigger picture (and we might want to alert if the percentage changes sufficiently, I dunno).

  • It would be nice if the X-axes included ticks for each day, even the (alternating) days that aren’t labeled

Thoughts?

Flags: needinfo?(dmosedale) → needinfo?(chutten)

(In reply to Dan Mosedale (:dmosedale, :dmose) from comment #7)

  • What are the expected use cases?
    ** One that I can see is this:
    *** Get an email alert, which should generally happen pretty rarely
    *** Look at the dashboard to see if we can understand more about the issue and decide on next steps
    ** Are there others?

I can think of one: we've taken to looking at our monitoring dashboards as part of our weekly triage meeting. Of course, we're using non-opmon dashboards at the moment (our dashboards predate opmon by a fair chalk) so we don't have alerting so we can't take that approach. Maybe when we eventually (hopefully) switch to using opmon for our monitoring and can take advantage of the alerting we'll switch to "got an email alert" as our model.

  • I’m not sure how to apply the word “percentile” to the graphs in this particular dashboard, perhaps in part because the percentile claims to be settable, but if I set it to 5 and 95 (and 50), I see the exact same dashboard with all of them. Can you unpack how this is intended to be used?

That's for any histogram-shaped measures you might choose to add and is presently unused. (Plotting entire histograms over time would require something like a ridgeline plot or some sort of heatmap or some other datavis, and it wouldn't support alerting. By looking instead at the summary statistics over time (like 50th %ile) it can use the established line plots and be alerted upon in an intuitive way. )

  • I’m assuming that the only reason there are so many alerts is because we’re still building up historical data to use to set the correct context to choose when to alert.
    ** If that’s true, how long should we expect to wait for alerts to become pretty rare? I suspect we don’t want to turn on email alerting until they are…
    ** If I’m wrong, I suspect this frequency of alerts would be hard to keep up with, and maybe we want to fix something or structure things differently…

I'm actually not sure. With more than $window_size days of data I was expecting it to have settled, but maybe we need $previous_window_size + $current_window_size days worth of data (in which case it should settle after it gets its data from the 29th (ie, the morning of the 30th)). I've been meaning to pester Anna about it, but keep getting pulled off to other things, so... ni?Anna - Do you have any insights about why we're getting so many alerts?

  • Active, sum, and point don’t help me (at least) understand the graphs, and to some degree make it more confusing.. Here’s what I’m guessing they would ideally be; it would be nice to get them re-labeled if/when opmon supports that…
    ** All of the “sum” subtitles: ? maybe this should just not be there?

The "Sum" tells you that this is the statistic that we're examining over time. Presently opmon supports sum, percentile, mean, and count - it'd be important to distinguish amongst the statistics of the same metric that are being examined if we were looking at them.

Right now we're looking at coarse things at the ping level so sum and count are the same thing so we're only looking at the former. (I guess I could've used count instead. Oh well. That's what I get for following the examples.)

** Ping Volume “active”: “ping count”
** Client Volume “active”: “active clients”
** Ping Volume By Ping Type & Unexpected Data
*** “point”: “ping count”
*** “active - ” prefix and “_volume” suffix for all of the ping types: ? If these are always going to be the same, the graph would be more readable without them. If there are other possible values that could sometimes appear here, it’d be nice to know what they are.

_volume suffix is simply what the opmon metric is called. I went for "descriptive for readers of the toml file" over aiming for making most helpful for the legend. And, huh, it looks like maybe the friendly_name property would let me configure that... I'll give that a try.

  • For the client volume graph, it would be interesting to have a line showing the number of clients sending any kind of glean pings on the given day there, as a way to see how the client volume fits into the bigger picture (and we might want to alert if the percentage changes sufficiently, I dunno).

I don't think opmon has the capability for querying multiple tables (called data_source in the toml). (Another question for Anna).

However, since we're looking at absolute counts with alerts for historical variance, any change to "% of all users sending "messaging-system" pings" should only be due to a) A change to the numerator (the population of clients sending "messaging-system" pings) which will net you alerts from this dashboard, b) A change to the denominator (the overall user population) - which other people will get an alert about. So we might be covered.

  • It would be nice if the X-axes included ticks for each day, even the (alternating) days that aren’t labeled

That appears to be a limitation of Looker. If you click on the three dots of any of those plots and select "Explore from here" you'll get the fully-powered query+visualization experience that Looker provides. In a quick skim of the available plot visualization options I couldn't find a way to force grid lines or ticks or labels for each distinct value. Maybe this is because Looker assumes you're more often looking at daily values over the course of months (which at standard widths would be very busy if they had ticks for each day), or maybe it's because I didn't look hard enough.

Either way, though, "Explore from here" is a useful thing to have in your toolbox for when you find something in Looker and go "I'd like to see that, but with {X} as {Y} instead".

Flags: needinfo?(chutten) → needinfo?(ascholtz)

I’m not sure how to apply the word “percentile” to the graphs in this particular dashboard

In addition to what chutten mentioned, there is an open bug to only show this filter option when percentiles are actually being used for any of the metrics: https://github.com/mozilla/opmon/issues/84

Anna - Do you have any insights about why we're getting so many alerts?

Looking into some alerts, they don't look completely off. Most of these metrics don't have the smoothest development over time. For example, looking at some of the alerts coming from Window release, there have been significant increases recently: https://mozilla.cloud.looker.com/x/vYkdAdiBNZWhb0Sz4iqjID
There might be some potential in increasing the max_relative_change value in the config to get fewer alerts: https://github.com/mozilla/metric-hub/blob/6b60ba02e1ee0b3ab93958359c5e13eb8900267e/opmon/firefox-messaging-system.toml#L175C1-L175C20 Or increase the window sizes to be looked at. Or use fixed thresholds instead of relative changes.

Active, sum, and point don’t help me (at least) understand the graphs, and to some degree make it more confusing.. Here’s what I’m guessing they would ideally be; it would be nice to get them re-labeled if/when opmon supports that…

There is another open bug to add some more context around what each statistic means to the dashboard: https://github.com/mozilla/opmon/issues/109

And, huh, it looks like maybe the friendly_name property would let me configure that... I'll give that a try.

Same for metric descriptions, there is an open bug there as well: https://github.com/mozilla/opmon/issues/61
Using friendly name where available could be implemented as part of this.
Getting more of the metadata to show up on the dashboard shouldn't be a huge amount of effort. I can see if Eduardo has some capacity the get this done soon, or I might have some time to implement it since this can potentially help a lot with usability here.

I don't think opmon has the capability for querying multiple tables (called data_source in the toml). (Another question for Anna).

You could define your own custom data_source that queries from multiple tables. The from_expression can be a SQL subquery instead of a reference to a table (e.g. from_expression = (SELECT * FROM whatever_complex_join_I_need))
Or you could define another metric_group that combines the two metrics.

It would be nice if the X-axes included ticks for each day, even the (alternating) days that aren’t labeled

Yeah, that seems to be a Looker limitation at the moment. The more data is shown, fewer days will be labeled explicitly

Flags: needinfo?(ascholtz)

Also, feel free to open any issues or feature requests here: https://github.com/mozilla/opmon/issues
We are currently a bit short-staffed on the OpMon side, but if there is something that would significantly improve the experience with OpMon we can find a way to get it implemented quickly.

I've reviewed the outstanding issues we identified in comments and it appears as though most have linked issues. The one notable exception is changing the monitoring to (or augmenting it with) affected proportions of the population.

To my eyes, my argument in comment #8 still stands: the only sorts of things we'd gain by looking at proportions instead of absolutes would be not alerting for things like the upcoming shift of a large proportion of our population from release to ESR. Given their historical rarity (this is only the second one I can remember happening in these past eight years), I don't know that it's necessary.

However, I'd happily file a follow-up for future consideration. :dmose, which would you like?

Flags: needinfo?(dmosedale)

I think you're right; it's not particularly important, no need to file.

Flags: needinfo?(dmosedale)

Alrighty, we'll call https://mozilla.cloud.looker.com/dashboards/operational_monitoring::firefox_messaging_system good enough to mark this resolved. Future work tasks to be tracked in individual bugs.

Status: ASSIGNED → RESOLVED
Closed: 11 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: