Closed Bug 1458188 Opened 7 years ago Closed 3 years ago

Monitoring/alerting for bouncer aliases

Categories

(Release Engineering :: Release Automation: Bouncer, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sfraser, Assigned: sfraser)

References

Details

The bouncer aliases are updated with each release, and it would be good to have an extra check on the end of a release graph to ensure the URL points to expected values. A periodic check would also be very useful. We have some choices about configuration for these checks, and where to alert. Alerting could go to: * Email * IRC * Slack Configuration could be: a. No configuration, just notification if the value changes. b. Configuration from Balrog rules for 'latest version' to ensure the values match c. Manually configured data source of expected values. 'a' is probably least effort, but requires a human to decide if the change is expected, and requiring context like that for operational health alerts is an anti-pattern. 'b' would work, assuming we can formulate balrog queries that are equivalent to the bouncer ones. The new rules would also not have been signed off, and so not live, when the in-graph test runs. 'c' has a layer of extra work attached, but is likely the most reliable indicator of an unexpected change. Since we'd like to run this as part of the release graphs as well as a periodic check, I'm not convinced nagios is the best check for this, although we could put the same code in multiple places and have nagios as well as the in-tree variant, it increases code support complexity. Given the multiple locations, I think I will: 1. Use python, and pytest, to manage the tests 2. Separate out the expected values from the test logic, so that they can be provided by nagios/balrog checker/something else 3. For a first deployment, get this into a periodic test, somewhere, as its actual location is less important than it running.
Assignee: nobody → sfraser
Thanks for filing this bug, this is great information gathered. Just FYI, there's a larger effort, not only for bouncer aliases but for all tasks that are leafs in the release graph to be tracked and tested before and after to prevent issues. Tracking bug is 1445946. I'll chain it here for reference as it might be useful later on.
See Also: → 1445946
mbrandt has some periodic tests checking bouncer aliases and if they align with product-details data. Should we use them instead or maybe adopt them somehow?
(In reply to Rail Aliiev [:rail] ⌚️ET from comment #2) > mbrandt has some periodic tests checking bouncer aliases and if they align > with product-details data. Should we use them instead or maybe adopt them > somehow? Seems like a good idea, we avoid reinventing too much. We could add another notification path to them, perhaps.
(In reply to Rail Aliiev [:rail] ⌚️ET from comment #2) > mbrandt has some periodic tests checking bouncer aliases and if they align > with product-details data. Should we use them instead or maybe adopt them > somehow? IIRC after talking to catlee about this, he was totally fine duping some of that work/logic. We definitely want to run those more often (or before/after we do changes). For periodic, I think we can rely on mbrandt's stuff, but in RelEng harndess we should be doing these checks before/after for sure. Concern was that, last we got hit by this - when beta aliases updated release aliases - even those tests that run periodically found the issue 1.5h after that fact. It was good, but could have been better if we chained a task after the bouncer aliases to sanitize that. And other as well, this is not the only task that needs coverage. ++ to more notification for this, great idea.
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #3) > (In reply to Rail Aliiev [:rail] ⌚️ET from comment #2) > > mbrandt has some periodic tests checking bouncer aliases and if they align > > with product-details data. Should we use them instead or maybe adopt them > > somehow? > > Seems like a good idea, we avoid reinventing too much. We could add another > notification path to them, perhaps. I am happy to assist in anyway that I can. Our bouncer tests currently run on a 15 cronjob. +1 to adding more/better notification paths. It would also be interesting to chain the tests to run as a step vs cron. We're using Jenkins, so this should in theory be configurable.
Component: General Automation → General
(In reply to Matt Brandt [:mbrandt] from comment #5) > (In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #3) > > (In reply to Rail Aliiev [:rail] ⌚️ET from comment #2) > > > mbrandt has some periodic tests checking bouncer aliases and if they align > > > with product-details data. Should we use them instead or maybe adopt them > > > somehow? > > > > Seems like a good idea, we avoid reinventing too much. We could add another > > notification path to them, perhaps. > > I am happy to assist in anyway that I can. Our bouncer tests currently run > on a 15 cronjob. > +1 to adding more/better notification paths. It would also be interesting to > chain the tests to run as a step vs cron. We're using Jenkins, so this > should in theory be configurable. How difficult would it be to add notification paths? Or to run the test on demand, as part of release promotion?
Flags: needinfo?(mbrandt)
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #6) > How difficult would it be to add notification paths? Or to run the test on > demand, as part of release promotion? Adding a notification path should be straight forward, were you thinking of email, IRC, etc? On demand would be a bit more work, out of my area of experience, but in theory also possible.
Flags: needinfo?(mbrandt) → needinfo?(sfraser)
Which notification path do you use at the moment? Let's get something running on the same method, but to RelEng, too, at least as a first pass. If you point me at the code I can have a look at running it on demand for our purposes.
Flags: needinfo?(sfraser) → needinfo?(mbrandt)
We're currently using several paths; irc, email, and treeherder. https://github.com/mozilla-services/go-bouncer/blob/master/tests/e2e/Jenkinsfile#L48-L81
Flags: needinfo?(mbrandt)
Flags: needinfo?(sfraser)
Apologies, course & travel, this got away from me. I'm not sure from reading the Jenkinsfile what actually does the work, there. Getting it to run in-tree would likely mean rewriting it. Could irc#releaseduty be added to notifications?
Flags: needinfo?(sfraser)
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #10) > Apologies, course & travel, this got away from me. I'm not sure from reading > the Jenkinsfile what actually does the work, there. Getting it to run > in-tree would likely mean rewriting it. No worries, I've been offline and on pto for a bit myself. Maybe we can explore this for next quarter. > Could irc#releaseduty be added to notifications? This looks fairly straightforward, I've opened this pr https://github.com/mozilla-services/go-bouncer/pull/243. How does sound :sfraser?
Flags: needinfo?(sfraser)
A caveat that I forgot to mention, if this were to get merged failed build would be reported to the channel in a format that includes a URL to the Jenkins build. To view the failure you'd need to configure a proxy to bastion, https://mana.mozilla.org/wiki/display/TestEngineering/qa-master.fxtest.jenkins.stage.mozaws.net.
(In reply to Matt Brandt [:mbrandt] from comment #11) > > Could irc#releaseduty be added to notifications? > This looks fairly straightforward, I've opened this pr > https://github.com/mozilla-services/go-bouncer/pull/243. How does sound > :sfraser? Works for me. Thank you!
Flags: needinfo?(sfraser)
Component: General → Release Automation: Bouncer
QA Contact: catlee

Mass-removing myself from cc; search for 12b9dfe4-ece3-40dc-8d23-60e179f64ac1 or any reasonable part thereof, to mass-delete these notifications (and sorry!)

I think we have a task that checks bouncer aliases, no? Resolved?

QA Contact: mtabara

Bug 1469803 took care of this maybe?

I think this is done, we run this frequence per each release tree afaik.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Depends on: 1469803
You need to log in before you can comment on or make changes to this bug.