Closed Bug 1382204 Opened 7 years ago Closed 7 years ago

Enable coalescing for scm level 2,3 test tasks on macOS and win10 gpu and linux

Categories

(Testing :: General, enhancement)

Version 3
enhancement
Not set
normal

Tracking

(firefox57 fixed)

RESOLVED FIXED
mozilla57
Tracking Status
firefox57 --- fixed

People

(Reporter: mtabara, Assigned: pmoore)

References

Details

Attachments

(7 files, 5 obsolete files)

1.97 KB, patch
dustin
: review+
pmoore
: checked-in+
Details | Diff | Splinter Review
46 bytes, text/x-github-pull-request
pmoore
: checked-in+
Details | Review
55 bytes, text/x-github-pull-request
pmoore
: checked-in+
Details | Review
53 bytes, text/x-github-pull-request
pmoore
: checked-in+
Details | Review
57 bytes, text/x-github-pull-request
grenade
: review+
markco
: review+
pmoore
: checked-in+
Details | Review
3.50 KB, patch
dustin
: review+
pmoore
: checked-in+
Details | Diff | Splinter Review
22.08 KB, patch
dustin
: review+
Details | Diff | Splinter Review
I think we also need to add the coalescing keys to the tasks like this:

https://dxr.mozilla.org/mozilla-central/source/taskcluster/ci/build/linux.yml#64
Blocks: 1382360
No longer blocks: 1382360
mtabara> dustin: hey! quick question when you have a spare minute. Am I to assume that in completition to https://coalesce.mozilla-releng.net/v1/threshold, we also need to set "coalesce-name" flag for the builds we want coalesced? https://dxr.mozilla.org/mozilla-central/source/taskcluster/ci/build/linux.yml#64
21:57:56 <@dustin> mtabara: yes, I think so
21:58:42 <catlee> mtabara: we need the tests coalesced, not the builds
21:58:54 <@dustin> mtabara: it may make sense to refactor that so that it's always set, or calculated based on something else
22:02:12 <mtabara> catlee: hm. not sure ... that's doable. dustin - can we coalesce tests as well or only builds?
22:02:28 <@dustin> anything that runs in docker-worker
22:02:35 <@dustin> ..which does not include mac tests
22:02:38 <catlee> doh
22:02:45 <@dustin> I think -- unless we have superseding support in g-w now?
22:03:15 <@dustin> I think that hasn't happened since we thought coalescing was dead
22:03:44 <@dustin> but I could be wrong and it could be implemented :)
22:04:17 <mtabara> who should I poke next?
22:04:57 <@dustin> https://github.com/taskcluster/generic-worker/search?utf8=%E2%9C%93&q=supersed&type= and https://github.com/taskcluster/generic-worker/search?utf8=%E2%9C%93&q=coalesc&type=
22:05:02 <@dustin> suggest it's not implemented
22:05:25 <catlee> why did you think it was dead?
22:05:26 <@dustin> mtabara: pmoore is the go-to person for generic-worker
22:05:46 <@dustin> I only vaguely recall, as it was a while ago
22:05:55 <@dustin> but it was a conversation with dividehex and amy iirc
@dustin> I don't know how long it would take to implement in g-w
22:21:24 <@dustin> it was kinda complicated in docker-worker
22:21:26 <@dustin> we'll see
&pmoore|away> mtabara|zzz: catlee: coalescing indeed never made it into generic worker, is that something we should prioritise?
mtabara> pmoore|away: thanks for reaching back. how much effort does that imply? we basically need coalescing to queue collapse the (large) mac testers queue backlog we've been experiencing lately.
08:56:29 <mtabara> but as I understand from du.stin, coalescing only addressed docker-workers which doesn't include mac testers
&pmoore|away> mtabara|mtg: I'll check with jonas.fj if it makes sense for us to do this directly in the queue now that we have the new claimWork endpoint but otherwise we can implement on the remaining workers
09:30:40 <&pmoore|away> sounds like we'll need it :)

Waiting a resolution from pmoore and jonasfj.
It is true that coalescing was only rolled out to docker worker.

@Jonas, is this something we'd prefer to implement in queue.claimWork? At the time coalescing was added to docker-worker, I believe we were still restricted by using queue.claimTask, but I'm guessing we can implement this (once) in the queue now, and get the benefit across all workers?
Flags: needinfo?(jopsen)
@pmoore,
I was just talking it over with bstack and figured that there is effectively no good way to do coalescing in the queue yet.
Once we move to postgres we could look into introducing sufficient metadata into the queue to allow coalesces, without
dependency on some external service.

The current setup, with an external service is a bit of a hack... resulting from not know exactly how we wanted to solve the
problem, or how the definition of the problem-space might change over time. In particularly we didn't want to build it into
the queue before we have the right constructs.

Note: with claimWork it should be possible to move it into the queue at some point, but we would have to invent some metadata
for controlling it, and move the queue to use more flexible data storage like postgres.
Flags: needinfo?(jopsen)
IMO, this is probably a thing we should talk about more. Thanks for bringing it up, there is certainly a point to the fact
that it could be implemented now that we have claimWork. We just need major changes in the queue first.
No worries, then I'll set about adding it to generic-worker for now. Thanks!
Depends on: 1383024
Blocks: 1386625
Hey Wander,

I released this in generic-worker 10.2.0 (by running `./run.sh 10.2.0` in the generic-worker repo). Would you be able to roll it out to the OS X workers via puppet, while I am away on PTO?

Many thanks!
Pete
Flags: needinfo?(wcosta)
This version implements task superseding.
Flags: needinfo?(wcosta)
Attachment #8894108 - Flags: review?(dustin)
You'll need to check if there are new config settings that need to be added/changed with the new release. The easiest way to do this will be to compare output of `generic-worker --help` between both versions to see what has changed. Just upgrading binary might not be sufficient. Thanks guys!
Comment on attachment 8894108 [details] [diff] [review]
Upgrade generic-worker to version 10.2.0. r=dustin

Review of attachment 8894108 [details] [diff] [review]:
-----------------------------------------------------------------

Subject to Pete's warning, looks good.
Attachment #8894108 - Flags: review?(dustin) → review+
Attachment #8894108 - Attachment is obsolete: true
This version implements task superseding.
Comment on attachment 8894537 [details] [diff] [review]
Upgrade generic-worker to version 10.2.0. r=dustin

Added sentry project name.
Attachment #8894537 - Flags: review?(dustin)
Comment on attachment 8894537 [details] [diff] [review]
Upgrade generic-worker to version 10.2.0. r=dustin

Review of attachment 8894537 [details] [diff] [review]:
-----------------------------------------------------------------

I see "generic-worker" and "docker-worker" in sentry, so this looks right.
Attachment #8894537 - Flags: review?(dustin) → review+
I will hold landing until beta is out this week.
https://hg.mozilla.org/build/puppet/rev/50468416e6ab
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Ops, we still need to update in tree config...
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
@wcosta, thanks for taking care of the rollout while I was on PTO! :)


(In reply to Wander Lairson Costa [:wcosta] from comment #21)
> Ops, we still need to update in tree config...

Hi Wander, which config needs updating? Thanks!
Flags: needinfo?(wcosta)
So it looks like currently, only *builds* are coalesced in production (e.g. not tests) but our macs don't run builds (the mac builds are cross compiled on linux). So none of the tasks taken by the mac pool would be coalesced, if we apply the same coalescing strategy to macs as we have on linux.

Is this intentional? Should tests also be coalesced on linux/windows/mac?

(In reply to Pete Moore [:pmoore][:pete] (on PTO, back 21 Aug) from comment #22)
> @wcosta, thanks for taking care of the rollout while I was on PTO! :)
> 
> 
> (In reply to Wander Lairson Costa [:wcosta] from comment #21)
> > Ops, we still need to update in tree config...
> 
> Hi Wander, which config needs updating? Thanks!

Wander reminded me via IRC that the setting of `supersederUrl` need to be enabled in the gecko decision task for the tasks which need superseding. In working on a patch for this, I discovered that only builds are superseded/coalesced, and that the mac pool only runs tests.
Flags: needinfo?(wcosta) → needinfo?(catlee)
Yes, there's not really any point in coalescing builds. We only should be coalescing where we're constrained by hardware capacity.
Flags: needinfo?(catlee)
To further strengthen that point, we explicitly made the choice with buildbot to *not* coalesce builds (we used to) because it can non-trivially lengthen tree closure durations due to the added time needed for bisection when there's test failures during busy periods. So yes, please leave the builds alone in whatever you do :)
Thanks guys for the feedback.

I think it might make most sense to enable coalescing at a "kind" level, by implementing as a relatively trivial transform, that sets a boolean to true (but only for scm level 3). Using this, the payload builder for generic-worker (and docker-worker and any other worker implementations that support coalescing) could simply refer to this boolean to see whether they should generate a supersederUrl (and supersede task routes) in the task payload.

In this case, we would probably want to add the transform to:

  <gecko root>/taskcluster/ci/test/kind.yml

This way, it would be enabled for tests, but not builds, or other more obscure jobs.

@dustin/@catlee, do you think this would be sufficient, or do you see a need to be able to enable coalescing at a more granular level than just kind + scm level? Note, if the logic is more complex, we could just put the business logic for that decision in the transform itself.
Flags: needinfo?(dustin)
Flags: needinfo?(catlee)
Also note, currently on linux, currently coalescing is enabled for builds, not tests - so we probably want to switch that (but should do that in a different bug).
In other words, I'm proposing that the new transform we create would just look at scm level to decide whether to enable, coalescing, which would set a boolean flag, which the payload builder(s) would use to generate the actual payload portions needed. Alternatively it could generate the actual routes/supersederUrl, and the payload just uses them verbatim. Then if later we also want to alter the criteria for enabling coalescing, we would just alter the logic in the (coalescing) transform.

My last post wasn't so clear, so just wanted to write it more explicitly. :)
I think it makes sense to enable for all tests.  I'd be tempted to implement that in taskcluster/taskgraph/transforms/tests.py rather than a new file, and to set taskdesc['coalesce-name'] based on the unique attributes of the task (test platform, test name, chunk number, I think?).  The docker-worker payload builder already takes care of translating this, and it wouldn't be hard to add similar functionality (maybe by using a helper function) to the generic-worker payload builder.

I think that the coalescer configuration is going to get *huge* though -- there are thousands of distinct test tasks.  Maybe the config handling [1] needs to be upgraded to support some kind of pattern-matching on coalesce keys.  Maybe that's by regex?  Then something like

  r'tests:[^:]*:macosx64/[:^:]*:.*': {size: 5, age: 900},  # all on limited hardware
  r'tests:[^:]*:[^:]*:talos-.*': {size: 10, age: 1200},    # also on limited hardware
  r'tests:[^:]*:.*': {size: 40, age: 3600},                # in EC2 - give enough time for provisioning to catch up

of course, those patterns overlap, so you'd need some kind of precedence.  I'm also a little uncomfortable putting such verbose stuff into a config file buried in an app repo. Maybe it could be stored in a DB and editable via a UI, sort of like balrog rules.  But that's a project for later.

[1] https://github.com/mozilla/tc-coalesce/blob/ca5921e91eba52e16baee2cb3b003c45d66868a9/taskclustercoalesce/web.py#L127
Flags: needinfo?(dustin)
Attached patch bug1382204_gecko_v1.patch (obsolete) — Splinter Review
Hey Dustin, Jake,

This patch enables coalescing on all tests, and disables on all builds, for scm level > 1 (i.e. everything except try, I believe).

It won't work without changes to the coalescing service, since currently coalescing keys are predefined[1].

However, I think if the coalescing service is updated to set a default threshold for non-matching keys, we'd still have the ability to special-case threshold on particular tasks.

I favoured a simple implementation over something that might come at a higher maintenance/support cost.

Pete
Assignee: mtabara → pmoore
Status: REOPENED → ASSIGNED
Attachment #8901213 - Flags: review?(dustin)
Attachment #8901213 - Flags: feedback?(jwatkins)
Comment on attachment 8901213 [details] [diff] [review]
bug1382204_gecko_v1.patch

Review of attachment 8901213 [details] [diff] [review]:
-----------------------------------------------------------------

The hashing thing is kind of sad, and I expect we will want to do pattern-based configuration someday.

Tests have a good bit more information about what they're running -- and this transform only applies to tests.  Also, @jonasfj can correct me but I think the only hard limit is the total routing key length (255).  We chose 22-character identifiers so they could comfortably contain ten identifiers in a routing key.

It looks like the longest test platform is android-4.3-arm7-api-15-gradle/opt (34) and the longest test name is talos-perf-reftest-singletons-stylo (35).

I think we could do one of two things:
 - just use the longer identifiers and call it OK; or
 - check for identifiers that are too long and error out, then fix the errors, maybe by adding some kind of shortening function or by specifying shorter names in the task descriptions

I prefer the first option.

If that's the case, then this transform probably belongs in taskcluster/taskgraph/transforms/tests.py.  As it's written here, it only gets task descriptions as input, rather than test descriptions or job descriptions, so it's a little difficult to reverse-engineer a useful identifier.

::: taskcluster/taskgraph/transforms/coalesce.py
@@ +2,5 @@
> +# License, v. 2.0. If a copy of the MPL was not distributed with this
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +"""
> +Transform the push-apk kind into an actual task description.
> +"""

need to update this :)

@@ +13,5 @@
> +# This transform enables coalescing, setting the name to be equal to the the
> +# first 20 chars of the sha256 of the task name (label). The obfusication is
> +# simply used to keep the name short, unique, and alphanumeric, since it is
> +# used in AMQP routing keys. Disadvantage: will make it difficult (read
> +# "impossible") to set per-task threshold limits in coalescing service.

this should be a docstring
Attachment #8901213 - Flags: review?(dustin)
(In reply to Dustin J. Mitchell [:dustin] from comment #29)
> I'm also a little uncomfortable putting such verbose stuff into a config file buried in an app repo.

Yes, me too. Currently the only buried configuration in the coalescing service are thresholds. We should pull this configuration into the task definitions themselves. Once superseding thresholds are defined inside the task definitions, this allows the coalescing service to be a general purpose service that respects the declared values, and simply looks for unique coalesce-names to work with.


(In reply to Dustin J. Mitchell [:dustin] from comment #31)
> The hashing thing is kind of sad, and I expect we will want to do pattern-based configuration someday.

This would only be useful if the coalesce service manages configuration per coalesce-name. If we make all configuration explicit in the task definition, the coalesce-name becomes just a unique reference, like a git commit SHA or a task ID. The name in itself doesn't embed any information about what the coalesced task relates to, and therefore name matching provides no value.

> Tests have a good bit more information about what they're running -- and this transform only applies to tests.

Although it has been decided to coalesce only tests for now, coalescing is something that should apply equally well to any task kind. Therefore I'm keen to introduce naming that works for arbitrary task kinds, so that coalescing additional tasks should be trivial to implement, and not require special per-kind treatment. This should also help avoid conflicting names across types, with a global namespace that isn't kind-specific, and removes the need for special casing ("if kind = 'test' { .... } else if kind = 'l10n' { .... } ...."). As above, if the name doesn't embed information about the task (like a taskId also does not) then it can serve just as a unique name, by which other queries can be made in order to retrieve domain-specific information (test name, description, platform, etc). This can be done either by querying the coalesce service (if the data hasn't yet expired, and the task has not yet been resolved) or for resolved tasks, looking at its artifacts to see the tasks it superseded.

These are the objectives I have:

1) coalescing configuration (currently just threshold settings "size" and "age") should be part of the task definition
2) coalesce-name should be short (e.g. around 20 chars) and unique across task kinds
3) the coalesce service should be dumb, and just care about partitioning tasks into coalescing lists based on a unique id, rather than having/maintaining knowledge or caring about what those tasks do
Arranged a meeting in my vidyo room (pmoore) at 17:30 CEST today (0830 PDT, 28 August) to discuss. If others following this bug are interested in this topic, feel free to join the discussion.
We (Dustin and I) agreed to proceed with non-meaningful (hash-based) coalescing keys, and also agreed to relocate per-coalescing key configuration (threshold for age/task count) in task definition rather than in coalescing service (within the url path of the supersederUrl).

This affects the components as follows:

tc-coalesce
===========
Threshold settings should be read from url path, rather than taking values from internal configuration. No internal per-coalescing key configuration required, and can be deleted.

Gecko patch (from comment 31)
=============================
Set default threshold values, which could be optionally overridden in transformations, and include the threshold values in the generated supersederUrl. For now, all tasks, regardless of project/platform etc will use default threshold values, but this can easily be adapted later.

Workers (generic worker and docker worker)
==========================================
No changes needed! The existing design was already sufficiently generic (kudos to :dustin/:dividehex on the original design).


I'll work on the changes to tc-coalesce and gecko in this bug, and attach patches for review over the next days.
Comment on attachment 8901213 [details] [diff] [review]
bug1382204_gecko_v1.patch

Review of attachment 8901213 [details] [diff] [review]:
-----------------------------------------------------------------

Losing the readable key is kind of big deal but not a show stopper.  I can understand the reasoning to limit key length and char type but I think there are other ways to do that without going completely down the obfuscation route.  The original intent (although it didn't make it into the code) was to be able to pull stats per key from the service for monitoring/alerting purposes.  So hashing to get the key name need to take place elsewhere and debugging would require some trial and error reverse engineering.  I like Dustin's suggestions.  Both seem like solid alternatives.

As you mentioned, there are no default thresholds so the service would need that added.  Right now, keys without threshold always return an empty list.  Another alternative would be to build in glob matching so we could apply thresholds to namespaces as opposed to explicit keys.  Explicit keys could still take priority over glob matching.
Attachment #8901213 - Flags: feedback?(jwatkins)
Flags: needinfo?(catlee)
Pete and I spoke over vidyo.  We're in complete agreement with the plans mentioned in comment 4.  We also discussed ways to keep the readable label identifier (or the hash relationship thereof) if we were to ever push metrics from the coalescing service to a metrics/monitoring service.
This patch move the [age,size] thresholds out of the internal configuration of the tc-coalesce service, and instead age and size have been embedded in the list endpoint.

/v1/list/<key>
  ==>
/v1/list/<age>/<size>/<key>

I considered bumping to "v2" but since the service isn't used in production, it seemed reasonable to leave it at v1. In addition, the url path is now longer, and so requests in the old format will not match the new url path, so bumping the version number isn't required for avoiding URL collisions.

Since the service isn't used in production, I also didn't keep /v1/list/<key> for backward compatibility, which meant I could entirely remove the internal THRESHOLDS configuration.
Attachment #8903213 - Flags: review?(jwatkins)
Attachment #8903213 - Flags: review?(dustin)
Now set to:

/v1/list/<int:age>/<int:size>/<key>

to make sure we get a 404 for non integers for age/size.
Attached patch bug1382204_gecko_v2.patch (obsolete) — Splinter Review
This updated patch embeds the age, size thresholds into the supersederUrl property. I've also removed 'builds.' from the generated coalesce keys.
Attachment #8901213 - Attachment is obsolete: true
Attachment #8903567 - Flags: review?(dustin)
Attachment #8903567 - Flags: feedback?(jwatkins)
Comment on attachment 8903213 [details] [review]
Github Pull Request for tc-coalesce

Jake has deployed the updated tc-coalesce service with my patch. Many thanks!

So the only remaining part now is the gecko patch.
Attachment #8903213 - Flags: review?(jwatkins)
Attachment #8903213 - Flags: review?(dustin)
Comment on attachment 8903567 [details] [diff] [review]
bug1382204_gecko_v2.patch

Review of attachment 8903567 [details] [diff] [review]:
-----------------------------------------------------------------

::: taskcluster/taskgraph/transforms/task.py
@@ +144,5 @@
>  
> +    # Coalescing provides the facility for tasks to be superseded by the same
> +    # task in a subsequent commit, if the current task backlog reaches an
> +    # explicit threshold. Both age and size thresholds need to be met in order
> +    # for coalescing to be triggered.

++
Attachment #8903567 - Flags: review?(dustin) → review+
Pushed by pmoore@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5428066436e7
enable coalescing on tests and disable on builds,r=dustin
I'm really sorry for the incessant reviews Dustin! But this is just a one-liner, at least. :p

Thanks!
Attachment #8905005 - Flags: review?(dustin)
Backed out for busting tests on OS X and failing flake8:

https://hg.mozilla.org/integration/mozilla-inbound/rev/6e81b409e0776a4674995a910ac3bf02084a9f02

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=5428066436e75195b347352486ea7ca9e826fb25&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable

Failure log OS X: https://treeherder.mozilla.org/logviewer.html#?job_id=128896757&repo=mozilla-inbound
[taskcluster 2017-09-06T12:41:17.065Z] Task ID: Ya39sUGTSgmR9BOa65ogcA
[taskcluster 2017-09-06T12:41:17.065Z] === Task Starting ===
[taskcluster 2017-09-06T12:41:18.199Z] goroutine 1 [running]:
[taskcluster 2017-09-06T12:41:18.199Z] runtime/debug.Stack(0x422194cc0, 0x14c3b40, 0x180fdb0)
[taskcluster 2017-09-06T12:41:18.199Z] 	/home/travis/go/src/runtime/debug/stack.go:24 +0x79
[taskcluster 2017-09-06T12:41:18.199Z] main.(*TaskRun).Run.func3(0x421d50010, 0x421cc6000, 0x421d50018)
[taskcluster 2017-09-06T12:41:18.199Z] 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1013 +0x265
[taskcluster 2017-09-06T12:41:18.199Z] panic(0x14c3b40, 0x180fdb0)
[taskcluster 2017-09-06T12:41:18.199Z] 	/home/travis/go/src/runtime/panic.go:489 +0x2cf
[taskcluster 2017-09-06T12:41:18.199Z] main.(*SupersedeTask).Start(0x421d50228, 0x15b9f08)
[taskcluster 2017-09-06T12:41:18.199Z] 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/supersede.go:80 +0x9fb
[taskcluster 2017-09-06T12:41:18.199Z] main.(*TaskRun).Run(0x421cc6000, 0x0)
[taskcluster 2017-09-06T12:41:18.199Z] 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1068 +0xfe0
[taskcluster 2017-09-06T12:41:18.199Z] main.FindAndRunTask(0x12a05f200)
[taskcluster 2017-09-06T12:41:18.199Z] 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:679 +0x654
[taskcluster 2017-09-06T12:41:18.199Z] main.RunWorker(0x0)
[taskcluster 2017-09-06T12:41:18.199Z] 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:574 +0x59a
[taskcluster 2017-09-06T12:41:18.199Z] main.main()
[taskcluster 2017-09-06T12:41:18.199Z] 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:337 +0x624
[taskcluster 2017-09-06T12:41:18.199Z] 
[taskcluster 2017-09-06T12:41:18.199Z] "index out of range"
[taskcluster 2017-09-06T12:41:18.199Z] runtime error: index out of range

Failure log flake8: https://treeherder.mozilla.org/logviewer.html#?job_id=128892565&repo=mozilla-inbound
TEST-UNEXPECTED-ERROR | /builds/worker/checkouts/gecko/taskcluster/taskgraph/transforms/coalesce.py:13:1 | expected 2 blank lines, found 0 (E302)
TEST-UNEXPECTED-ERROR | /builds/worker/checkouts/gecko/taskcluster/taskgraph/transforms/task.py:621:1 | expected 2 blank lines, found 1 (E302)
TEST-UNEXPECTED-ERROR | /builds/worker/checkouts/gecko/taskcluster/taskgraph/transforms/task.py:627:1 | expected 2 blank lines, found 1 (E302)
TEST-UNEXPECTED-ERROR | /builds/worker/checkouts/gecko/taskcluster/taskgraph/transforms/task.py:637:1 | expected 2 blank lines, found 1 (E302)
Flags: needinfo?(pmoore)
Attached patch bug1382204_gecko_v3.patch (obsolete) — Splinter Review
This patch is identical to last one, with some blank lines added to appease the linter. Not flagging for review since dustin already r+'d the last patch in comment 41 (and I've been sending him way to many review requests recently!).
Attachment #8903567 - Attachment is obsolete: true
Attachment #8903567 - Flags: feedback?(jwatkins)
Flags: needinfo?(pmoore)
This fixes the bug in generic-worker that didn't handle an empty supersedes list in the response of the supersedes service.

Fixed the issue, and added a test.
Attachment #8905035 - Flags: review?(dustin)
Comment on attachment 8905005 [details] [review]
Github Pull Request for taskcluster-admin

received r+ in github from dustin
Attachment #8905005 - Flags: review?(dustin)
Comment on attachment 8905035 [details] [review]
Github Pull Request for generic-worker

received r+ in github from dustin
Attachment #8905035 - Flags: review?(dustin)
Released generic-worker 10.2.2 with fix from comment 49:
https://github.com/taskcluster/generic-worker/releases/tag/v10.2.2

Rolling generic-worker 10.2.2 out to Windows beta worker types in OCC:
https://tools.taskcluster.net/groups/NEwPa-dbTmaZBpPBRVaxkA

Note, this won't roll out to any OS X workers. I'll need to manage that separately.
Attached patch bug1382204_gecko_v4.patch (obsolete) — Splinter Review
Hey Dustin,

This fixes a bug whereby no test tasks get scheduled on try with my previous version. Whoops!

The two changes (interdiff) are:
  * https://hg.mozilla.org/try/rev/ab12391493e553dc6b67d17ea809e6b9c33e9c6f
  * https://hg.mozilla.org/try/rev/61e7018869737e1dd0ab299d1dc7ee7d8d8476c2

Thanks!
Pete
Attachment #8905030 - Attachment is obsolete: true
Attachment #8906034 - Flags: review?(dustin)
Comment on attachment 8906034 [details] [diff] [review]
bug1382204_gecko_v4.patch

Review of attachment 8906034 [details] [diff] [review]:
-----------------------------------------------------------------

Shame on me for not seeing that :(
Attachment #8906034 - Flags: review?(dustin) → review+
In order to roll this fix out to all platforms, it means upgrading our gecko-t-win7-32 workers from generic-worker 8.2.0 to generic-worker 10.2.2.

The last time I attempted upgrading win7 to the latest generic-worker version, we hit these failures:
https://bugzilla.mozilla.org/show_bug.cgi?id=1373722#c7

One reason the upgrade has such an impact, is that in generic-worker 8.2.0, we configured tasks to run as an administrator user (GenericWorker user account). For generic-worker 10.2.2 we run tasks as a non-administrator as a dedicated task user.

Upgrading generic-worker on Windows 7 introduced some test failures:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5a44969023132e08ad6bf350684e4a2f5f34db1a&group_state=expanded

I am now trying one of those failed tasks again, putting the task user in the Administrators group, to see if that resolves the issue:
https://tools.taskcluster.net/groups/QAEVRNiTSJOu-4WjirweOQ/tasks/QAEVRNiTSJOu-4WjirweOQ/details

If this fixes things, I can make a patch to run known failing test suites as Administrator, and raise bugs for each task that needs to be run as an Administrator on Windows 7, to be fixed, so we can make the task user a regular non-privileged user again.

If that doesn't fix the failures, we'll have to see why.

Note, failures could also possibly be related to the task starting before other logon-associated processes have not fully completed (we can test this by introducing a delay before running task after login completes).
(In reply to Pete Moore [:pmoore][:pete] from comment #53)
> I am now trying one of those failed tasks again, putting the task user in
> the Administrators group, to see if that resolves the issue:
> https://tools.taskcluster.net/groups/QAEVRNiTSJOu-4WjirweOQ/tasks/QAEVRNiTSJOu-4WjirweOQ/details

Now with scopes fixed:
  https://tools.taskcluster.net/groups/GCRLVqFHS7aqib7Su6TtjA/tasks/GCRLVqFHS7aqib7Su6TtjA/details
(In reply to Pete Moore [:pmoore][:pete] from comment #54)

> Now with scopes fixed:
>  
> https://tools.taskcluster.net/groups/GCRLVqFHS7aqib7Su6TtjA/tasks/
> GCRLVqFHS7aqib7Su6TtjA/details

That didn't fix it, so trying to disable failing test:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=e0aa199e35150168d555ec3f87559ba936a69afe

The choice to disable test_ext_notifications in tc-M-e10s(5) was based on this screenshot:
  https://public-artifacts.taskcluster.net/GCRLVqFHS7aqib7Su6TtjA/0/public/test_info/mozilla-test-fail-screenshot_x50_pb.png
In summary, the win7 failures[1]:

Windows 7 opt:
  tc-M-e10s(5 bc1)

Windows 7 debug:
  tc-M-e10s(5 bc5) tc-M(5 bc1 bc7)

windows7-32-stylo-disabled opt:
  tc-M-e10s(5 bc2)

windows7-32-stylo-disabled debug:
  tc-M-e10s(5 bc2)

----
[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=5a44969023132e08ad6bf350684e4a2f5f34db1a&group_state=expanded
It looks like there might be a fix for the tc-M[-e10s](5) issue in bug 1396362.

Patching with fix from bug 1396362 and trying again:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f0a49f8aa40c0b504ef69021c879296e92d862cc&group_state=expanded
Depends on: 1398748
Summary: Enable coalescing for osx workers on inbound/autoland → Enable coalescing for scm level 1,2 test tasks
The win10 failures[1]:

Windows 10 x64 opt:
  tc-X(X)

Windows 10 x64 debug:
  tc-X(X) tc-M-e10s(5)

windows10-64-stylo-disabled opt:
  tc-M-e10s(5)

windows10-64-stylo-disabled debug:
  tc-M-e10s(5)

----
[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=173fe3b37004f9d7f67add716b5ef43af73aef79&group_state=expanded
I'm going to make this bug just about rolling out coalescing on macOS and win10 gpu, and deal with the remaining win7/win10 in a separate bug, in order that the windows issues don't block the roll out of coalescing on macOS worker pool.
Summary: Enable coalescing for scm level 1,2 test tasks → Enable coalescing for scm level 1,2 test tasks on macOS and win10 gpu
This patch upgrades windows worker types that experienced no test failures due to upgrade.
Attachment #8907448 - Flags: review?(rthijssen)
Attachment #8907448 - Flags: review?(mcornmesser)
Attachment #8907448 - Flags: review?(rthijssen) → review+
Note, I downloaded /data/repos/EXEs/generic-worker-v10.2.2-darwin-amd64 onto releng-puppet2.srv.releng.scl3.mozilla.com earlier today, and double checked the md5 was correct:

> [root@releng-puppet2.srv.releng.scl3.mozilla.com EXEs]# ls -ltr generic-worker-v*
> -rw-r--r-- 1 puppetsync puppetsync 11632208 25. Mär 08:29 generic-worker-v8.0.1-darwin-amd64
> -rw-r--r-- 1 puppetsync puppetsync 11810992  4. Apr 05:43 generic-worker-v8.1.1-darwin-amd64
> -rw-r--r-- 1 puppetsync puppetsync 11477792  4. Mai 07:05 generic-worker-v8.5.0-darwin-amd64
> -rw-r--r-- 1 puppetsync puppetsync 12476624  4. Aug 21:31 generic-worker-v10.2.0-darwin-amd64
> -rw-r--r-- 1 puppetsync puppetsync 12520912 13. Sep 01:54 generic-worker-v10.2.2-darwin-amd64

> [pmoore@releng-puppet2.srv.releng.scl3.mozilla.com ~]$ md5sum /data/repos/EXEs/generic-worker-v10.2.2-darwin-amd64
> 2c94a4d27bfee22c6dbc92190500ab51  /data/repos/EXEs/generic-worker-v10.2.2-darwin-amd64
Attachment #8907479 - Flags: review?(dustin)
See Also: → 1399401
Hey Dustin, me again!

This is the same as the previous gecko patch, with a minor change to taskcluster/taskgraph/transforms/coalesce.py which now has a blacklist of worker types that don't yet support coalescing (see the if statement that references bug 1399401).

Thanks!
Attachment #8906034 - Attachment is obsolete: true
Attachment #8907514 - Flags: review?(dustin)
Removing dependency, as win7 is now out-of-scope for this bug.
No longer depends on: 1398748
Attachment #8894537 - Flags: checked-in+
Attachment #8907448 - Flags: checked-in+
Attachment #8905035 - Flags: checked-in+
Attachment #8903213 - Flags: checked-in+
Dustin,

Note, I've patched taskcluster-admin locally with *both* the old and new coalesce scopes:

      'queue:route:coalesce.v1.builds.<project>.*',
      'queue:route:coalesce.v1.<project>.*',

and then ran ./tcadmin make-gecko-branch-role against the various projects to make sure they all had both scopes. For projects where scopes might have got removed (like esr52) I manually patched the roles, to make sure I don't break anything.

Once this bug lands, we can run it manually again with an unpatched version (i.e. the version that landed in attachment 8905005 [details] [review]) to remove the old scopes that are no longer needed.

I wanted to enable both old and new in order to serve the transition.
Attachment #8905005 - Flags: checked-in+
Awesome.  An upstream push of tc-admin with both scopes included would not be a problem.
Attachment #8907479 - Flags: review?(dustin) → review+
Comment on attachment 8907514 [details] [diff] [review]
Enable coalescing on macOS/win10-gpu tests and disable on linux builds

Review of attachment 8907514 [details] [diff] [review]:
-----------------------------------------------------------------

::: taskcluster/taskgraph/transforms/coalesce.py
@@ +33,5 @@
> +            #   https://bugzilla.mozilla.org/show_bug.cgi?id=1399401
> +            'aws-provisioner-v1/gecko-t-win7-32',
> +            'aws-provisioner-v1/gecko-t-win7-32-gpu',
> +            'aws-provisioner-v1/gecko-t-win10-64',
> +        ]:

This was the change since last r? (for my future reference)
Attachment #8907514 - Flags: review?(dustin) → review+
Comment on attachment 8907479 [details] [diff] [review]
Upgrade generic-worker from 10.2.0 to 10.2.2 and add metadata

This is checked in, and merged from default -> production.

The change looks to be ok, e.g. see a completed task:

  * https://tools.taskcluster.net/groups/ZOi_NzkbRsGqOsvImUJZuQ/tasks/IbJNaq9dQw6WVvfrDuXSbQ/details
Attachment #8907479 - Flags: checked-in+
Summary: Enable coalescing for scm level 1,2 test tasks on macOS and win10 gpu → Enable coalescing for scm level 2,3 test tasks on macOS and win10 gpu
Pushed by pmoore@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/641b46e5f932
enable coalescing on macOS/win10-gpu tests and disable on linux builds,r=dustin
Attachment #8907448 - Flags: review?(mcornmesser) → review+
Summary: Enable coalescing for scm level 2,3 test tasks on macOS and win10 gpu → Enable coalescing for scm level 2,3 test tasks on macOS and win10 gpu and linux
https://hg.mozilla.org/mozilla-central/rev/641b46e5f932
Status: ASSIGNED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
I'm adding the old scope back to tcadmin, because it's not clear that it's now unused.  I've marked it deprecated, pointing to this bug.
(it's not clear because the scope is still present for all repos)
(In reply to Dustin J. Mitchell [:dustin] from comment #70)
> I'm adding the old scope back to tcadmin, because it's not clear that it's
> now unused.  I've marked it deprecated, pointing to this bug.

You did the right thing - it looks like it is still used in e.g. esr52:

https://hg.mozilla.org/releases/mozilla-esr52/file/f9df5238dca13e40b8128faba317df25e2f69249/taskcluster/taskgraph/transforms/task.py#l330
You need to log in before you can comment on or make changes to this bug.