Closed Bug 1843406 Opened 10 months ago Closed 8 months ago

clean up Onboarding/FxMS alerting config for MVP

Categories

(Toolkit :: Telemetry, defect, P1)

defect

Tracking

()

VERIFIED FIXED

People

(Reporter: dmosedale, Assigned: chutten)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

After discussion in this week's meeting, we agreed that we'd need to iterate on massaging the alerting config in the FxMS/Onboarding Glean dashboard to get us to the point of being able to have it send emails to the OMC team (right now there are far too many alerts for that to be reasonabled).

A bunch of this work can probably be done on (at least) the nightly and beta channels now; release channel may need to accumulate some historical data to get a sense of what's going to be normal there...

Chris, what are your thoughts on how we work through this? It's clearly going to need some iteration...

Blocks: fxms-glean

The severity field is not set for this bug.
:chutten, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(chutten)

Looking at the alerts for July 27, we have (harvested by selecting them from the dashboard, copying them here, and manually sorting):

nightly	Linux	moments_ping_volume
nightly	Mac	moments_ping_volume
nightly	Windows	moments_ping_volume
nightly	Mac	other_ping_volume
nightly	Mac	spotlight_ping_volume
nightly	Linux	spotlight_ping_volume
nightly	Linux	infobar_ping_volume
aurora	Mac	moments_ping_volume
aurora	Windows	moments_ping_volume
aurora	Linux	moments_ping_volume
beta	Windows	moments_ping_volume
beta	Linux	moments_ping_volume
beta	Mac	moments_ping_volume
esr	Mac	moments_ping_volume
esr	Windows	moments_ping_volume
esr	Linux	moments_ping_volume
esr	Mac	null_ping_volume
esr	Mac	cfr_ping_volume
esr	Mac	spotlight_ping_volume
esr	Linux	infobar_ping_volume
esr	Windows	whats_new_panel_ping_volume
release	Windows	whats_new_panel_ping_volume
release	Linux	whats_new_panel_ping_volume
release	Mac	whats_new_panel_ping_volume
release	Windows	unknown_keys_volume
release	Linux	moments_ping_volume
	Windows	whats_new_panel_ping_volume
Other	Mac	moments_ping_volume
Other	Linux	moments_ping_volume
Other	Mac	other_ping_volume
Other	Mac	null_ping_volume
Other	Mac	spotlight_ping_volume
Other	Mac	whats_new_panel_ping_volume
Other	Windows	whats_new_panel_ping_volume

Let's go through them in order:

  • Nightly: the only alert across all three OSes is moments_ping_volume... which is totally legit. There was an order of magnitude drop on July 22.
  • Nightly: Mac other_ping_volume: There should not be any.
    • Proposal: File and fix a bug for this, like bug 1844360
  • Nightly: Linux/Mac spotlight_ping_volume - The titanic shift in volume for spotlight pings happened starting July 13 which is now out of the alert window. These two OSes are alerting because the absolute number of these pings is so low that smallish perturbations can trigger the alert.
    • Proposal: Not sure. This is a Low Volume Case. Maybe this would flatten out if we looked at it normalized by client volume? Maybe it wouldn't. Maybe we find some way to save alerting on Nightly for messages that are more stable in volume.
  • Nightly: Linux infobar_ping_volume - Wow, there's usually just 0 of these. So every time there's more than 0 of them, the alert will fire. An extreme case of the low volume case? Or worth looking into since it doesn't have Mac-like volumes and we should expect it to?
    • Proposal: File an investigation bug to see if Linux infobars are special. If they are, fold this into Low Volume Case. If they aren't, fix 'em.
  • Aurora: moments_ping_volume - Again, totally legit due to a cliff edge on July 22.
  • Beta: moments_ping_volume - A third time: totally legit due to a cliff edge on July 22.
  • ESR: moments_ping_volume - A fourth time: totally legit due to a cliff edge on July 22.
  • ESR: Mac null, cfr, and spotlight - starts July 16 and so almost certainly reflects the increase in overall population due to migration from release to ESR. We can ignore this.
  • ESR: Linux infobar_ping_volume - Huh, another instance of "Are Linux infobars special"
  • ESR: Windows whats_new_panel_ping_volume Unlike the other messages spiking due to old Windows' migration from release to esr, the Whats New Panel spiked and this alert is for the subsequent levelling down. Could be due to the "channel switch updates cause client_id reset" incident (now thought to be resolved). Wait and see on this one.
  • Release: Windows/Mac/Linux all spiking on whatsnew, huh. This is post-release-week levelling-off of messages. We see these happen more sharply on other channels where releases go out unfettered - on release channel, they're drawn out enough for the window to get properly used to them being here.
    • Proposal: Education fix. This is something to expect, and is totally cool. Worth it to have the regular cadence of alertness so that we catch weirdness between releases.
  • Release: Windows unknown_keys_volume this is bug 1844360
  • Release: Linux moments_ping_volume . It exhibits a large double-spike... which is weird. What's weirder is that all of the OSes are exhibiting this, and only Linux is small enough to alert on it on this day (the others, with two spikes, have their averages sufficiently high to allow for the 27th's low to fall inside the tolerance). I don't know what's causing this.
    • Proposal: File an investigation bug. Maybe some messaging's been going out this past week or two?
  • We're alerting on null normalized_channel
    • Proposal: Filter out null normalized_channel from monitoring and alerts. They're so small any data that slips through will alert. And there's nothing we'll intend to do about it.
  • We're alerting on Other normalized_channel
    • Proposal: The same as for null, only bigger.

In conclusion: Mostly reactions to real things happening, which is exactly what we want from alerts. A couple of things that might warrant additional investigation in follow-ups. A few things we should definitely update the dashboard to filter out. And then there's the Low Volume Case... how to deal with legitimately-low and legitimately-noisy volumes of messages? This is probably the one dashboard-inherent quirk we should consider adjusting.

But before we get to that, this is just one day's alerts. I shall perform this same analysis on Monday to see if we get spurious weekend alerts or in case some other avenues of discussion crop up when examining multiple days' alert loads.

Severity: -- → S4
Status: NEW → ASSIGNED
Flags: needinfo?(chutten)
Priority: -- → P1

Alerts for July 30:

nightly	Windows	moments_ping_volume
nightly	Windows	other_ping_volume
nightly	Mac	moments_ping_volume
nightly	Mac	spotlight_ping_volume
nightly	Mac	other_ping_volume
nightly	Linux	moments_ping_volume
aurora	Mac	moments_ping_volume
aurora	Windows	moments_ping_volume
aurora	Linux	moments_ping_volume
beta	Windows	moments_ping_volume
beta	Linux	moments_ping_volume
beta	Mac	moments_ping_volume
esr	Mac	moments_ping_volume
esr	Linux	moments_ping_volume
esr	Windows	whats_new_panel_ping_volume
esr	Linux	infobar_ping_volume
esr	Windows	moments_ping_volume
release	Windows	whats_new_panel_ping_volume
release	Windows	unknown_keys_volume
release	Mac	whats_new_panel_ping_volume
release	Linux	whats_new_panel_ping_volume
Other	Windows	whats_new_panel_ping_volume
Other	Mac	other_ping_volume
Other	Linux	moments_ping_volume
Other	Mac	moments_ping_volume
Other	Mac	spotlight_ping_volume
	Mac	cfr_ping_volume
	Windows	infobar_ping_volume

This is indeed essentially the same as the 27th's alerts suggesting minimal seasonality at this time.

This does reinforces that a stepwise shift in ping volumes by type will trigger alerts for many days. If this is a typical case, then to avoid alert fatigue it might be worth exploring these ping-volume-tied alerts as a regular (weekly?) task instead of as an email or Slack message. Or maybe we should choose a low-window high-threshold approach so the alerts "clear" quickly on stepwise changes? Food for thought.

Stuff to do to clean up the dashboard & alerting:

a) filter out "null" and "other" channels
b) move Undesired Pings to own graph
c) make "release" and "Windows" the default settings
d) switch alerting to just simple threshhold alerts (we decide that getting the alerts to work based on historical trends without substantial annoying artifacts that it doesn't deserve to be in MVP, especially since just having something basic will be way better than what we have now -- nothing).
e) investigate the usefulness of build ID, parameter, branch columns and consider filtering them if they're unlikely to offer any useful functionality, since they create visual clutter.

Chris, do these look reasonable?

Flags: needinfo?(chutten)

(In reply to Dan Mosedale (:dmosedale, :dmose) from comment #4)

Chris, do these look reasonable?
a) filter out "null" and "other" channels

Can do.

b) move Undesired Pings to own graph

Should be straightforward.

c) make "release" and "Windows" the default settings

These values are controlled by the URL of the dashboard. Instead of https://mozilla.cloud.looker.com/dashboards/operational_monitoring::firefox_messaging_system use https://mozilla.cloud.looker.com/dashboards/operational_monitoring::firefox_messaging_system?Normalized+Channel=release&Normalized+Os=Windows

d) switch alerting to just simple threshhold alerts (we decide that getting the alerts to work based on historical trends without substantial annoying artifacts that it doesn't deserve to be in MVP, especially since just having something basic will be way better than what we have now -- nothing).

Existing thresholdable alerts will remain (pending the investigation of bug 1844360) and should serve.

e) investigate the usefulness of build ID, parameter, branch columns and consider filtering them if they're unlikely to offer any useful functionality, since they create visual clutter.

This one I will look into, but I have no idea whether there'll be much I can do about it. (I've not changed OpMon itself before. It could be fun.


I'll get these done, report back here, then mark as FIXED.

Flags: needinfo?(chutten)

These are the dashboard changes ( a) through d) ).

This takes care of removing whichever of submission_date or build_id the opmon config's xaxis isn't. This is the only thing I think I can do for e) without a lot of education in lookml-generator and possibly also a lot of work.

Dan, I've performed the items from the list barring the removal of "parameter, branch columns" from item e). Ready to call the dashboard deliverable?

Flags: needinfo?(dmosedale)

Looks good; thanks so much for making this happen!

Status: ASSIGNED → RESOLVED
Closed: 8 months ago
Flags: needinfo?(dmosedale)
Resolution: --- → FIXED
Status: RESOLVED → VERIFIED
Summary: clean up Onboarding/FxMS alerting config so it can start sending mails to OMC team → clean up Onboarding/FxMS alerting config for MVP
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: