Closed Bug 1214487 Opened 9 years ago Closed 8 years ago

Implement uptake monitoring on slaves/workers

Categories

(Release Engineering :: Release Automation: Other, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rail, Assigned: mtabara)

References

Details

Attachments

(9 files, 1 obsolete file)

7.91 KB, patch
rail
: feedback+
Details | Diff | Splinter Review
58 bytes, text/x-review-board-request
rail
: review+
Details
58 bytes, text/x-review-board-request
rail
: review+
Details
58 bytes, text/x-review-board-request
Details
58 bytes, text/x-review-board-request
rail
: review+
Details
58 bytes, text/x-review-board-request
rail
: review+
Details
58 bytes, text/x-review-board-request
rail
: review+
Details
16.22 KB, patch
rail
: review+
Details | Diff | Splinter Review
7.30 KB, patch
rail
: review+
Details | Diff | Splinter Review
The current implementation runs as a part of master process, see http://hg.mozilla.org/build/buildbotcustom/file/d93d0b0b1c3e/scheduler.py#l218

Even though the surrounding scheduler code (boilerplate!) is a bit complicated, the actual business logic is here: http://hg.mozilla.org/build/tools/file/19c4bb07e1a8/lib/python/util/tuxedo.py#l66

Basically it just checks multiple URLs (API end points behind AUTH) and calculates some minimum values. The API is XML based.

We could use TC if the secret service is ready.
Assignee: nobody → tabara.mihai
Server side implementation: https://github.com/mozilla/tuxedo/blob/master/apps/api/views.py#L57

It accepts 3 params (https://github.com/mozilla/tuxedo/blob/master/apps/api/views.py#L57):

product: Firefox-$version
os: a platform (see http://hg.mozilla.org/build/tools/file/tip/lib/python/release/platforms.py#l7)
fuzzy: is set to true, will match Firefox-42.0 if you search for Firefox-4

Example request: https://bounceradmin.mozilla.com/api/uptake/?product=Firefox-36.0.4-Partial-35.0.1&os=win

example response:

<mirror_uptake>
  <item>
    <available>2000100</available>
    <product>Firefox-36.0.4-Partial-35.0.1</product>
    <total>2000100</total>
    <os>win</os>
  </item>
</mirror_uptake>

If you skip "os", the response will contain all "os"es:

https://bounceradmin.mozilla.com/api/uptake/?product=Firefox-36.0.4-Partial-35.0.1

<?xml version="1.0"?>
<mirror_uptake>
  <item>
    <available>2000100</available>
    <product>Firefox-36.0.4-Partial-35.0.1</product>
    <total>2000100</total>
    <os>linux</os>
  </item>
  <item>
    <available>2000100</available>
    <product>Firefox-36.0.4-Partial-35.0.1</product>
    <total>2000100</total>
    <os>win</os>
  </item>
  <item>
    <available>2000100</available>
    <product>Firefox-36.0.4-Partial-35.0.1</product>
    <total>2000100</total>
    <os>linux64</os>
  </item>
  <item>
    <available>2000100</available>
    <product>Firefox-36.0.4-Partial-35.0.1</product>
    <total>2000100</total>
    <os>osx</os>
  </item>
</mirror_uptake>


We use the first method (explicit "os") and make requests for every os we care about. This way we can add the contrib platforms (solaris, for example) and don't block ourselves on there uptake.

We add these products in http://hg.mozilla.org/mozilla-central/file/tip/testing/mozharness/scripts/bouncer_submitter.py using configs from http://hg.mozilla.org/mozilla-central/file/tip/testing/mozharness/configs/releases

We can probably reuse that code, esp the configs. We could mark some platforms as not participating in uptake checks, something like, "check_uptake": True/False. It will also should simplify generating the list of products to be checked and not rely on buildbot-configs
Note that we can't use this from Taskcluster atm, due to bounceradmin (tuxedo) not having a netflow from the public internet.

We do have the tuxedoUsername/Password on the masters, (not slaves) that can then be passed down to mozharness if we `use_credentials_file=True` from the buildbot factory for it.

And you can see, for example: https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/bouncer/submitter.py#20 for how mozharness grabs that data, for other uses.
Also, these configs could be useful to generate a list of products that we need to check: https://dxr.mozilla.org/mozilla-central/search?q=path%3Abouncer&redirect=false&case=true
Depends on: 1267629
Throwing this for a first glance. I still need to add:
* buildbotcustom builder
* tweak the plug-in within the releasetasks (make sure I pass all info I need)

+ testing
Attachment #8746211 - Flags: feedback?(rail)
Comment on attachment 8746211 [details] [diff] [review]
Uptake monitoring logic within mozharness

Review of attachment 8746211 [details] [diff] [review]:
-----------------------------------------------------------------

::: testing/mozharness/scripts/release/uptake_monitoring.py
@@ +21,5 @@
> +from mozharness.base.script import BaseScript
> +from mozharness.mozilla.buildbot import BuildbotMixin
> +
> +
> +def get_tuxedo_uptake_url(tuxedo_server_url, related_product, os=None):

I think you can simplify this and always require `os`. At least the only usage here is with os passed.

@@ +132,5 @@
> +                                                   related_product, bouncer_platform))
> +        # handle the partials as well
> +        subs = {
> +            "version": version
> +        }

I would move `subs` close to where you set subs["prev_version"] and just create it from scratch every time. This way the dict stay immutable.

@@ +137,5 @@
> +        prev_versions = self.config.get("partial_versions")
> +        for product, info in self.config["partials"].iteritems():
> +            product_template = info["product-name"]
> +            for prev_version in prev_versions:
> +                subs["prev_version"] = prev_version

as I said above:
  subs = {
      "version": version,
      "prev_version": prev_version
  }

@@ +162,5 @@
> +        def check_uptake(uptake, min_uptake=10000):
> +            self.info("Current uptake value to check is {}".format(uptake))
> +            if uptake >= min_uptake:
> +                return True
> +            return False

return uptake >= min_uptake 

:P

@@ +168,5 @@
> +        def poll():
> +            uptake = self._get_release_uptake(auth)
> +            return check_uptake(uptake)
> +
> +        def run_loop(poll_interval=60, poll_timeout=2*60*60):

2h is a bit long. It'd be better to know about possible issues early rather than late. I'd set it something like 20m. Usually it doesn't take more than 10 minutes for sentry to check.

@@ +170,5 @@
> +            return check_uptake(uptake)
> +
> +        def run_loop(poll_interval=60, poll_timeout=2*60*60):
> +            import datetime
> +            import time

these are builtin modules, you can move them to the top to reduce clutter here

@@ +178,5 @@
> +                if delta > poll_timeout:
> +                    self.error("Uptake monitoring sadly timed-out")
> +                    raise Exception("Time-out during uptake monitoring")
> +                r = poll()
> +                if r:

poll() sounds a bit off, because it not only polls, but also return some value. I'd probably combine poll() and check_uptake() together and called get_uptake or check_uptake. Or you can just use self._get_release_uptake(auth) here instead and get rid of poll() and check_uptake().

@@ +185,5 @@
> +                else:
> +                    self.info("Mirrors not yet updated, sleeping for a bit ...")
> +                    time.sleep(poll_interval)
> +
> +        run_loop()

you can even get rid of run_loop() by removing the function definition and dedenting it one level. ;) just need to define poll_interval and poll_timeout somewhere, maybe in super(UptakeMonitoring, self).__init__(..., config={..., poll_interval: xx, ...}) and access them as self.config[...]
Attachment #8746211 - Flags: feedback?(rail) → feedback+
Depends on: 1268824
Keywords: leave-open
Comment on attachment 8747009 [details]
MozReview Request: Bug 1214487 - Add uptake monitoring builder. r=rail

https://reviewboard.mozilla.org/r/49673/#review46435
Attachment #8747009 - Flags: review?(rail) → review+
Comment on attachment 8747010 [details]
MozReview Request: Bug 1214487 - Add uptake monitoring builder config. r=rail

https://reviewboard.mozilla.org/r/49679/#review46437

can you also add similar entries for jamun and date in `mozilla/project_branches.py`. Otherwise we won't be able to run it on those branches.
Attachment #8747010 - Flags: review?(rail)
Comment on attachment 8746993 [details]
MozReview Request: Bug 1214487 - Implement uptake monitoring on slaves/workers. r=rail. a=release DONTBUILD

https://reviewboard.mozilla.org/r/49653/#review46439

A nit:

::: testing/mozharness/scripts/release/uptake_monitoring.py:93
(Diff revision 1)
> +        import requests
> +
> +        url = get_tuxedo_uptake_url(tuxedo_server_url, related_product, os)
> +        self.info("Requesting {} from tuxedo".format(url))
> +        # TODO: wipe off below line
> +        url = 'https://people.mozilla.org/~mtabara/tuxedo/uptake.xml'

I dont' think you want to land this, do you? ;)
Attachment #8746993 - Flags: review?(rail)
Comment on attachment 8746993 [details]
MozReview Request: Bug 1214487 - Implement uptake monitoring on slaves/workers. r=rail. a=release DONTBUILD

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/49653/diff/1-2/
Attachment #8746993 - Flags: review?(rail)
Attachment #8747075 - Flags: review?(rail) → review+
Comment on attachment 8747075 [details]
MozReview Request: Bug 1214487 - Add uptake monitoring builder config. r=rail

https://reviewboard.mozilla.org/r/49711/#review46469
Attachment #8747078 - Flags: review?(rail) → review+
Comment on attachment 8747078 [details]
MozReview Request: Bug 1214487 - Implement uptake monitoring on slaves/workers. r=rail. a=release DONTBUILD

https://reviewboard.mozilla.org/r/49719/#review46473
Comment on attachment 8747070 [details]
MozReview Request: Bug 1214487 - Enable uptake monitoring flag in release runner. r=rail

https://reviewboard.mozilla.org/r/49709/#review46475
Attachment #8747070 - Flags: review?(rail) → review+
Comment on attachment 8746993 [details]
MozReview Request: Bug 1214487 - Implement uptake monitoring on slaves/workers. r=rail. a=release DONTBUILD

https://reviewboard.mozilla.org/r/49653/#review46477

Sqashing would be great before we push
Attachment #8746993 - Flags: review?(rail) → review+
There are a lot of patches to land, I'll put my thoughts down for now:
Note to self:
1. Squash + land buildbot-configs changes + merge
2. Squash + land gecko mozharness changes to inbound + cherry pick to aurora/beta/esr45
3. Land buildbotcustom changes (they depind on buildbot-configs and gecko mozharness)
4. Land release runner changes to default branch
5. Push PR for review, make sure tests pass and merge to mozilla/releasetasks (still-in-progress)

Test on date/jamun to see if the uptake monitoring is working properly.
For the record, PR for releasetasks changes is https://github.com/mozilla/releasetasks/pull/170
(In reply to Mihai Tabara [:mtabara] from comment #20)
> 2. Squash + land gecko mozharness changes to inbound + cherry pick to
> aurora/beta/esr45

and mozilla-release, otherwise uptake monitoring won't work for a dot release.
I squashed and pushed the gecko patch to jamun: https://treeherder.mozilla.org/#/jobs?repo=jamun&revision=9f6c2dca91e7

To test things on jamun we need:

* wait for the build to be ready or use mh revision in ship-it
* buildbotcustom and buildbot-configs changes landed/merged/reconfigured
* deploy the release runner and releasetasks changes on bm83, restart release runner
* start a release and see how it fails :D - it'l fail because sentry may not work as expected in staging. We can try to build a 46.0 release using older build and fresh mh revision set in ship-it.
Just a little status update with the leftovers:

0. 

* buildbotcustom/bb-configs changes are landed/merged/reconfiged
* bm83 is patched with https://github.com/MihaiTabara/releasetasks/tree/uptake_monitoring
* bm83 releaserunner is also patched with the change and restarted - all ready to start a new release


1. start a release and see how it fails :D - it'l fail because sentry may not work as expected in staging. We can try to build a 46.0 release using older build and fresh mh revision set in ship-it.

2. once we sort out and have a successfully run, we need to merge:
* merge releasetasks PR to mozilla/releasetasks
* push release runner changes to default

3. Squash + land gecko mozharness changes to inbound + cherry pick to aurora/beta/esr45/release (currently cherry-picked only in jamun)
Attached patch 1214487.diff (obsolete) — Splinter Review
We had a mock-up release on jamun this morning and few issues came up:
* a rookie mistake of mine
* the old logic was iterating over all products/partials from mozharness configs, including the candidates dir and partials candidates from mozilla-release configs.

The current patch addresses both of these issues and it's built atop of jamun.
Attachment #8748718 - Flags: review?(rail)
Comment on attachment 8748718 [details] [diff] [review]
1214487.diff

Can you add "check_uptake": False to other products, so it's explicit and we make less mistakes when copy/paste code?
Attachment #8748718 - Flags: review?(rail)
Comment on attachment 8748801 [details] [diff] [review]
Add check_uptake flag to products considered for uptake monitoring

Review of attachment 8748801 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM. Can you also write another patch for tb and fenenc configs, so we don't forget them in the future. The new flag is harmless to have there.
Attachment #8748801 - Flags: review?(rail) → review+
I added the check_uptake flag for TB/Fennec and esr since it may be the next in the promotion-ify process. If that's not the case, just let me know and I'll remove it.
Attachment #8748833 - Flags: review?(rail)
Attachment #8748833 - Flags: review?(rail) → review+
All the remaining pieces have been pushed to production:
* releasestasks https://github.com/mozilla/releasetasks/pull/170 merged
* tools repo changes http://hg.mozilla.org/build/tools/rev/6bc076797116
* gecko changes cherry-picked to inbound,aurora,beta,release,esr45 as Rail indicated above.

We've had successful uptake monitoring tests in several releases today on bm83.
I updated releasetasks/tools and restarted releaserunner on both bm83/bm85.

I'm closing the bug for now, will reopen if any issues come along.
Thank you Rail for all the help and patience in solving this bug!
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
I am investigating a linux issue with talos (bug 1271948) where on Monday May 9th our data became unreliable- backfilling showed this was an infrastructure issue vs an in-tree change.

I see this landed a few days before the issue started (I can safely ignore weekends due to low volume, we already have other patterns on weekends).

:mihai, can you explain what this does on the physical machines?  Does this affect the linux machines?  Does this run all the time or only as part of a specific job?
Flags: needinfo?(mtabara)
These jobs don't run on the talos machines, so probably aren't responsible.
Flags: needinfo?(mtabara)
Blocks: 1283853
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.