nagios checks for stub installer bouncer entries

RESOLVED FIXED

Status

Release Engineering
General Automation
P2
major
RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: bhearsum, Assigned: hwine)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [nagios])

Attachments

(3 attachments, 2 obsolete attachments)

(Reporter)

Description

4 years ago
The firefox-beta-latest bouncer entry was broken for about 12h recently, meaning that all new beta installs through stub installer would have failed during that time.

This is pretty unaccepable, we should have nagios checks to make sure all of these bouncer entries are returning an acceptable response. Specifically:
firefox-latest
firefox-nightly-latest
firefox-aurora-latest
firefox-beta-latest
firefox-beta-stub
firefox-aurora-stub
firefox-release-stub
Web QA's tests caught this, but due to the primary people being on vacation, this didn't get escalated properly: http://qa-selenium.mv.mozilla.com:8080/view/Bouncer/job/bouncer.prod/

Updated

4 years ago
Severity: normal → major
Priority: -- → P2
Whiteboard: [nagios]
(Assignee)

Updated

4 years ago
Assignee: nobody → hwine
(Assignee)

Comment 2

4 years ago
Confirming requirement in comment 0 - for each of the above downloadable, an acceptable response is:
 - one HTTP 200 response after following redirects

Any other checks required?
Flags: needinfo?(bhearsum)
I'm not sure if Ben was suggesting that we test correctness too, but 
 http://viewvc.svn.mozilla.org/vc/libs/product-details/firefoxDetails.class.php?view=markup
is the usual way we track released versions on mozilla.org websites.

That could be useful for aurora/beta/release, dunno if hacks like those required to support windows-only 20.0.1 would cause a problem. It doesn't have version information for firefox-nightly-latest though, and looking that up in hg could be problematic (eg where RelMan bump the version at merge time but we haven't built a nightly or updated bouncer yet, so we get a false alarm for up to 24 hours).
(Reporter)

Comment 4

4 years ago
(In reply to Nick Thomas [:nthomas] from comment #3)
> I'm not sure if Ben was suggesting that we test correctness too, but 
>  http://viewvc.svn.mozilla.org/vc/libs/product-details/firefoxDetails.class.
> php?view=markup
> is the usual way we track released versions on mozilla.org websites.

> That could be useful for aurora/beta/release, dunno if hacks like those
> required to support windows-only 20.0.1 would cause a problem. It doesn't
> have version information for firefox-nightly-latest though, and looking that
> up in hg could be problematic (eg where RelMan bump the version at merge
> time but we haven't built a nightly or updated bouncer yet, so we get a
> false alarm for up to 24 hours).

I was actually just thinking of looking for a 2xx or 3xx response. I wouldn't object to correctness testing too, if we can find a way to do it that will work with some of the crazier things like single platform releases.
Flags: needinfo?(bhearsum)
(Assignee)

Comment 5

4 years ago
From discussions, the problem we're trying to catch here is a corner case around stub installers, funnel cake, bouncer entries, and releases.

While QA's tests (comment #1) should also catch this, we're adding an additional safety net.

There are two common ways for a bouncer entry error to manifest:
 - point to a non-existent link
 - incorrectly hard code one locale

To cover both, we'll check both:
 - the en-US link following redirects to an eventual 2xx status code (pass)
 - a non en link following redirects to an eventual 2xx status code AND a different location (except for stub installers)
For posterity, since the URL in comment 1 is to a VPN-internal server, our test repo is public, and lives at: https://github.com/mozilla/bouncer-tests
(Assignee)

Comment 7

4 years ago
(In reply to Hal Wine [:hwine] from comment #5)
> To cover both, we'll check both:
>  - the en-US link following redirects to an eventual 2xx status code (pass)
>  - a non en link following redirects to an eventual 2xx status code AND a
> different location (except for stub installers)

Nightly & aurora also can't be tested for locale sanity. Summarizing:
    Products that can only be tested for 2xx status code on en-US:
        firefox-nightly-latest
        firefox-aurora-latest
        firefox-beta-stub
        firefox-aurora-stub
        firefox-release-stub
        
    Products that can additionally be checked for locale differences:
        firefox-latest
        firefox-beta-latest
(Assignee)

Comment 8

4 years ago
Created attachment 744320 [details]
NRPE nagios plugin to check bouncer entries

Catlee for feedback on approach:
 - plugin in python
 - requires nagiosplugin module installed in venv
 - to run on cruncher, and be polled via NRPE from nagios master

bhearsum for content:
 - all new code
 - py2.6 changes are duck punched in for missing routine in py2.7
   standard libary
 - will attach nagiosplugin changes separately
Attachment #744320 - Flags: review?(bhearsum)
Attachment #744320 - Flags: feedback?(catlee)
(Assignee)

Comment 9

4 years ago
Created attachment 744325 [details] [diff] [review]
patch to make nagiosplugin work in py26

Incompatibility was ordinal in '{}'.format('spam') became optional in py2.7/3.1 - had to re-add for py2.6

Other minor tweaks to avoid using buildout
Attachment #744325 - Flags: review?(bhearsum)
(Assignee)

Comment 10

4 years ago
As "stress test", code running under user nagios via cron on cruncher, with output to /tmp/check_bouncer.out - script in ~hwine/nagios/
(Assignee)

Updated

4 years ago
Status: NEW → ASSIGNED
(Reporter)

Comment 11

4 years ago
Comment on attachment 744325 [details] [diff] [review]
patch to make nagiosplugin work in py26

Review of attachment 744325 [details] [diff] [review]:
-----------------------------------------------------------------

stampy stamp stamp stamp
Attachment #744325 - Flags: review?(bhearsum) → review+
(Reporter)

Comment 12

4 years ago
Comment on attachment 744320 [details]
NRPE nagios plugin to check bouncer entries

Do we have to take such a heavy handed approach? Nagios already comes with a "check_http" that seems like it should be able to do this through multiple different checks rather than one big one. Eg:
[bhearsum@cruncher.srv.releng.scl3 plugins]$ ./check_http -H download.mozilla.org --method=GET --url="/?product=firefox-aurora-latest&os=win&lang=en-US" --onredirect=follow
HTTP OK: HTTP/1.1 200 OK - 21977202 bytes in 1.751 second response time |time=1.751461s;;;0.000000 size=21977202B;;;0
(Assignee)

Comment 13

4 years ago
existing plugins could handle the "is URL responsive". However, they don't have state so can't do the "location is different per locale" state.

Since deleting the locale placeholder is an easy-to-make human error, I understood as part of the requirements to test for that. If not, we could make use of the built in.

The other slight advantage to the heavy approach is, once it's in place, we're independent of having nagios config changes made - we just update the plugin and re-deploy.
(Assignee)

Comment 14

4 years ago
I've had running via cron for a while now. Reviewing output of 1023 probes (~7 days) shows two reported issues:
   1021 BOUNCERENTRY OK - pass - 7 products checked
      1 BOUNCERENTRY UNKNOWN: Timeout: check execution aborted after 30s
      1 BOUNCERENTRY CRITICAL - FAIL - 7 products checked

The "UNKNOWN" entry should be benign - nagios logic should hide a single instance.

The "CRITICAL" entry might be problematic, as it apparently caught the transition between internal and CDN hosting:
    BOUNCERENTRY CRITICAL - FAIL - 7 products checked
    multiple locales for non-localized firefox-aurora-stub - found 2 different values https://download-installer.cdn.mozilla.net/pub/mozilla.org/firefox/nightly/latest-mozilla-aurora/firefox-22.0a2.en-US.win32.installer-stub.exe, https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-mozilla-aurora/firefox-22.0a2.en-US.win32.installer-stub.exe (check_bouncer.py:66)
    BOUNCERENTRY OK - pass - 7 products checked

I'm just noting for now, and will continue to monitor. I believe an occasional soft error as above won't cause any real problem.
(Assignee)

Comment 15

4 years ago
Comment on attachment 744325 [details] [diff] [review]
patch to make nagiosplugin work in py26

Checked into brain dump, as this will likely need tests if business logic gets any more complex.

https://hg.mozilla.org/build/braindump/rev/f01292d98e78
Attachment #744325 - Flags: checked-in+
(Assignee)

Updated

4 years ago
Depends on: 870596
Comment on attachment 744320 [details]
NRPE nagios plugin to check bouncer entries

What value do we get calling curl externally vs. using httplib/urllib or the 3rd party requests module to make the http calls directly from python?

Nits:

Typical python style for imports is standard lib imports first, then 3rd party imports you rely on, then your own modules. So the nagiosplugin imports at least should be grouped together last. argparse is new in 2.7, so we'll need that installed separately for machines with 2.6.

I'd use 'if not hasattr(subprocess, "check_output")' rather than "check_output" in dir(subprocess). Calling the function check_output rather than f will give better tracebacks too.
Attachment #744320 - Flags: feedback?(catlee) → feedback+
(Assignee)

Comment 17

4 years ago
(In reply to Chris AtLee [:catlee] from comment #16)
> Comment on attachment 744320 [details]
> NRPE nagios plugin to check bouncer entries
> 
> What value do we get calling curl externally vs. using httplib/urllib or the
> 3rd party requests module to make the http calls directly from python?

None - I was more familiar with that output than using urllib. Since that also gets rid of (some of the) 2.6 headaches, I'll look at making that change.

> 
> Nits:
> 
> Typical python style for imports is standard lib imports first, then 3rd
> party imports you rely on, then your own modules. So the nagiosplugin
> imports at least should be grouped together last. argparse is new in 2.7, so
> we'll need that installed separately for machines with 2.6.

Correct, and noted in the install notes, which have been added since at https://hg.mozilla.org/build/braindump/file/0e38c14970b6/nagios-related/check_bouncer.rst

> 
> I'd use 'if not hasattr(subprocess, "check_output")' rather than
> "check_output" in dir(subprocess). Calling the function check_output rather
> than f will give better tracebacks too.

Noted - I used copy/paste of reported good code, so didn't want to modify.
Another viewpoint is that using 'f' will give a head's up to someone debugging that this is not the stdlib implementation.
(Assignee)

Comment 18

4 years ago
Comment on attachment 744325 [details] [diff] [review]
patch to make nagiosplugin work in py26

wrong patch committed - this one no longer needed as py26 changes landed upstream.
Attachment #744325 - Flags: checked-in+
(Assignee)

Comment 19

4 years ago
Comment on attachment 744325 [details] [diff] [review]
patch to make nagiosplugin work in py26

nagiosplugin upstream now has py2.6 patches just past 1.0.0 - pull from there instead of maintaining fork.
Attachment #744325 - Attachment is obsolete: true
(Assignee)

Comment 20

4 years ago
Created attachment 748033 [details]
new python script

New revision, incorporating feedback from :catlee on prior. No more subprocess!

Also renamed to remove '.py' to match nagios plugin conventions.
Attachment #744320 - Attachment is obsolete: true
Attachment #744320 - Flags: review?(bhearsum)
Attachment #748033 - Flags: review?(bhearsum)
(Assignee)

Comment 21

4 years ago
Created attachment 748035 [details]
docs for installation

Docs based on need to install on bm36
Attachment #748035 - Flags: review?(bhearsum)
This caught us forgetting to update firefox-aurora-stub when re-enabling updates to Aurora last week, combined with me cleaning up the 22.0a2 builds bouncer was pointing at. There was no alert in #buildduty, so please remember to notifications once we get through the review process here.
(Assignee)

Comment 23

4 years ago
Metric looks stable, so enabling notifications to get benefit while waiting for review. See:
 http://nagios1.private.releng.scl3.mozilla.com/releng-scl3/
(Reporter)

Comment 24

4 years ago
Comment on attachment 748033 [details]
new python script

>#!/tools/venvs/nagiosplugin/bin/python

This shebang looks wrong.

>    def __init__(self, products=None, alt_locale=None):
>        self.products = products
>        if self.products is None:
>            self.products = default_products
>        self.locales = ['en-US']
>        if alt_locale is None:
>            self.locales.append('fr')  # major language, should be everywhere

Better to use 'zh-TW' for this - it's the locale that Sentry uses when scanning mirrors.

>class BouncerSummary(nagiosplugin.Summary):
>    def ok(self, results):
>        return "pass - %d products checked" % (len(default_products),)
>
>    def problem(self, results):
>        return "FAIL - %d products checked" % (len(default_products),)

It would be good to list some or all of the products/locales that failed. Maybe just en-US + one locale. Without that info it's very difficult to debug.

>
>@nagiosplugin.guarded
>def main():
>    argp = argparse.ArgumentParser(description=__doc__)
>    argp.add_argument('-v', '--verbose', action='count', default=0,
>                      help='may be used up to 3 times')
>    argp.add_argument('-t', '--timeout', default=30,
>                      help='abort execution after TIMEOUT seconds')
>    args = argp.parse_args()

I'd really love to have the list of products passed in as arguments. Without that, we need to deploy a new version of the plugin to change what it's looking at. Not a deal breaker, just unfortunate.

r=me with the first three things addressed. I'd really like to see the products passed in too, but I won't block on that.
Attachment #748033 - Flags: review?(bhearsum) → review-
(Reporter)

Comment 25

4 years ago
Comment on attachment 748035 [details]
docs for installation

r=me assuming the interface stays the same after the comments are addressed.
Attachment #748035 - Flags: review?(bhearsum) → review+
(Assignee)

Comment 26

4 years ago
Need to update to pull "which beta should be there" based on http://viewvc.svn.mozilla.org/vc/libs/product-details/firefoxDetails.class.php?view=markup
(Assignee)

Updated

4 years ago
Blocks: 885560
Product: mozilla.org → Release Engineering
Created attachment 8340827 [details] [diff] [review]
Improved coverage

This adds firefox-stub and firefox-latest-euballot, and passes the locales to be tested in explicitly (which allows some cleanup of an __init__). We don't have an en-US build for EUBallot, so we check en-GB instead.

This diff is against https://hg.mozilla.org/build/nagios-tools.
Attachment #8340827 - Flags: review?(hwine)
(Assignee)

Comment 28

3 years ago
Comment on attachment 8340827 [details] [diff] [review]
Improved coverage

Review of attachment 8340827 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm - nice refactor

::: nagios_tools/scripts/check_bouncer.py
@@ +12,5 @@
>                                re.MULTILINE + re.DOTALL)
>      re_last_location = re.compile(r'''..*^Location:\s+(\S+)''',
>                                    re.MULTILINE + re.DOTALL)
>  
> +    def __init__(self, product_name, locales=['en-US', 'fr']):

nit: non None default for mutable object
Attachment #8340827 - Flags: review?(hwine) → review+
Comment on attachment 8340827 [details] [diff] [review]
Improved coverage

Switched to a tuple on the locales default:
 http://hg.mozilla.org/build/nagios-tools/rev/d978a701311b

How do I deploy this to production ? All I can find is http://mxr.mozilla.org/build/source/puppet/modules/bouncer_check/manifests/init.pp
Attachment #8340827 - Flags: checked-in+
Oh, I need to bump the version in nagios-tools/setup.py, and then in that init.pp ?
(In reply to Nick Thomas [:nthomas] from comment #30)
> Oh, I need to bump the version in nagios-tools/setup.py, and then in that
> init.pp ?

Yes. Plus deploy the tarball to puppetagain.
My patch got deployed in bug 950723, where the firefox-release-stub check was removed.
(Reporter)

Comment 33

3 years ago
I think this is done?
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.