Closed Bug 1562991 Opened 5 years ago Closed 5 years ago

Slave loan request for nalexander/tarek/rwood

Categories

(Infrastructure & Operations :: RelOps: Hardware, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nalexander, Assigned: markco)

References

Details

(Whiteboard: [buildduty][capacity][buildslaves][loaner])

Hi folks, me, :tarek, and :rwood are going to be working on a bunch of Raptor-esque changes in Q3. To support that work, it will really help to have either dedicated hardware or a way to "jump the queue" for testing jobs. The number of such testing jobs is strictly "human scale", i.e., we'd be pushing a handful of jobs manually to process ASAP.

See https://bugzilla.mozilla.org/show_bug.cgi?id=1324744#c14 for a bit more context.

The most important platform is Win 10 hardware, which I guess is the "-qr" or "-ux" machine configuration. A loaner might make the most sense here.

We'll also want to test on Android devices managed by bitbar. Here a loaner probably doesn't make sense, unless that cow path already exists. Can we get some ability to jump the queue? My colleagues are reporting 2-3 DAY turn-around times from try for these devices, which sounds like a bug (see Bug 1562988), but even if it's fixed we'll want some focused priority so that we can make progress quickly.

(In reply to Nick Alexander :nalexander [he/him] from comment #0)

Hi folks, me, :tarek, and :rwood are going to be working on a bunch of Raptor-esque changes in Q3. To support that work, it will really help to have either dedicated hardware or a way to "jump the queue" for testing jobs. The number of such testing jobs is strictly "human scale", i.e., we'd be pushing a handful of jobs manually to process ASAP.

See https://bugzilla.mozilla.org/show_bug.cgi?id=1324744#c14 for a bit more context.

The most important platform is Win 10 hardware, which I guess is the "-qr" or "-ux" machine configuration. A loaner might make the most sense here.

rwood: can you confirm exactly the configuration the browsertime MVP is to run against?

We'll also want to test on Android devices managed by bitbar. Here a loaner probably doesn't make sense, unless that cow path already exists. Can we get some ability to jump the queue? My colleagues are reporting 2-3 DAY turn-around times from try for these devices, which sounds like a bug (see Bug 1562988), but even if it's fixed we'll want some focused priority so that we can make progress quickly.

jmaher: can you suggest what can be done here?

Flags: needinfo?(rwood)
Flags: needinfo?(jmaher)

the regular windows hardware machines would be ideal. Do you want to push to try using a special hardware type? we could have a mini pool that you could use.

Otherwise the loaners would be remote desktop to the machine, run the job manually which is not always straightforward. I have done that a lot before and once setup it is ok.

If you want 10 machines setup with a different hardware ID that taskcluster would use given that your try pushes have a custom hack in taskcluster (here is an example for testing windows 1903 on the VMs: https://hg.mozilla.org/try/rev/e45cad8128b62308ad4a7196c851e8f66ec0cd97 ), ask for that instead.

Otherwise, lets get a machine reserved for each of you and we can set up scripts that reduce the typing needed for testing new builds/tooling.

Flags: needinfo?(jmaher)

(In reply to Nick Alexander :nalexander [he/him] from comment #1)

rwood: can you confirm exactly the configuration the browsertime MVP is to run against?

Yes, the browsertime MVP (Bug 1561939) is to be run on just one platform (Win 10) on the same hardware that Raptor currently runs now (using the same mitmproxy/mozbase package and recordings etc). We basically want to get as close to the existing Raptor setup in CI as we can except with Browsertime, so that the data will be as comparable as possible. Thanks!

Flags: needinfo?(rwood)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #2)

the regular windows hardware machines would be ideal. Do you want to push to try using a special hardware type? we could have a mini pool that you could use.

Otherwise the loaners would be remote desktop to the machine, run the job manually which is not always straightforward. I have done that a lot before and once setup it is ok.

If you want 10 machines setup with a different hardware ID that taskcluster would use given that your try pushes have a custom hack in taskcluster (here is an example for testing windows 1903 on the VMs: https://hg.mozilla.org/try/rev/e45cad8128b62308ad4a7196c851e8f66ec0cd97 ), ask for that instead.

Otherwise, lets get a machine reserved for each of you and we can set up scripts that reduce the typing needed for testing new builds/tooling.

Hi Joel -- thanks for being so responsive. As per https://bugzilla.mozilla.org/show_bug.cgi?id=1562991#c3, we want exactly the same Win 10 configuration as Raptor currently runs. So hopefully no difficult imaging/configuration changes are needed.

In terms of loaners vs. a small pool, my personal preference would be a tiny pool (like 2 machines, say) rather than loaners. If we were "guaranteeed" to have jobs start in minutes, then it's a lot less hassle to push things to try for testing (which is what rwood and other experienced folks are basically always doing) than to have to work through the non-standard "loaner" configuration. That is, it's the slow decision task -> job completion that really drags down the progress loop. (Although iterating at a thing that could be done easily with interactive access can be frustrating too.) It seems like it would be much cheaper and easier to have a "Fast Pass" system where designated developers can hop the queue for a short time and trust the honor system for them to keep the number of such jobs down rather than stand up a tiny pool for focused effort of this sort, but that's your call to make.

Thanks again!

Flags: needinfo?(jmaher)

we can create a small pool for you and the only way the jobs won't start is if you have too many jobs scheduled (either yourself or others).

:markco, what would it take to have 5 windows10 hardware machines pulled from the main pool and setup with a different taskcluster token (instead of t-win10-64-hw we would reference t-win10-64-hw-dev from taskcluster: https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/tests.py#58)

Flags: needinfo?(jmaher) → needinfo?(mcornmesser)

:markco, what would it take to have 5 windows10 hardware machines pulled from the main pool and setup with a different taskcluster token (instead of t-win10-64-hw we would reference t-win10-64-hw-dev from taskcluster: https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/tests.py#58)

It would require a creation of a new role within Puppet, and the creation of a new workerType and clientID within taskcluster. The new workerType would be gecko-t-win10-64-hw-dev. However, I am not clear on how t-win10-64-hw is pointed at the gecko-t-win10-64-hw taskcluster workerType, so there maybe more that needs to be done there to add a new workerType.

What is the timeline for it?

Flags: needinfo?(mcornmesser)

timeline is relatively soon, possibly once the bitbar laptops are moved over next week?

If there are others who are better suited to do this or other information needed, please let me know.

:markco- when they are done with the dev work, I would like to use the pool for testing 1903 updates to our hardware :) That timeline can be more flexible- goal would be before end of the year have all our windows hardware running 1903 :)

Depends on: 1563314

The part I wasn't clear on is clear once looking at the link in comment 6.

This relatively straight forward. I should have the pool up by next week.

we can either require all try pushes to have code to use the specific hardware type, or we can land it in tree- thanks for getting the new bug filed and an estimate of when it will be online :)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #9)

we can either require all try pushes to have code to use the specific hardware type, or we can land it in tree- thanks for getting the new bug filed and an estimate of when it will be online :)

A little code required in try pushes seems best here. Thanks, all.

Component: CIDuty → RelOps: Hardware
QA Contact: dlabici
Assignee: nobody → mcornmesser

This mostly ready to deploy. Is VNC or RDP access needed?

I understand try server is the intended use, :nalexander, vnc/rdp access?

Flags: needinfo?(nalexander)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #12)

I understand try server is the intended use, :nalexander, vnc/rdp access?

Try server should be fine for now. Thanks, Mark, thanks, Joel.

Flags: needinfo?(nalexander)

5 nodes will be deployed tonight.

t-w1064-ms-281.wintest.releng.mdc1.mozilla.com 10.49.40.182
t-w1064-ms-282.wintest.releng.mdc1.mozilla.com 10.49.40.183
t-w1064-ms-316.wintest.releng.mdc2.mozilla.com 10.51.40.2
t-w1064-ms-317.wintest.releng.mdc2.mozilla.com 10.51.40.3
t-w1064-ms-318.wintest.releng.mdc2.mozilla.com 10.51.40.4

This has the standard loaner root (administrator in this case) and vnc password. RDP is not set up, but it probably won't be very useful with Generic-workersince the task user gets a new password for each task.

The configuration will mirror the production configuration. However, if need be I can lock the configuration or make changes unique to this pool. Let me know if you all have any question, or if there is anything else I can do to help.

Also the Generic-worker workerType is gecko-t-win10-64-hw-dev.

(In reply to Mark Cornmesser [:markco] from comment #15)

Also the Generic-worker workerType is gecko-t-win10-64-hw-dev.

Awesome possum! I will try them out as I iterate on my patch stack. Thanks, Markco! Thanks, Joel, for suggesting a good path!

(In reply to Nick Alexander :nalexander [he/him] from comment #16)

(In reply to Mark Cornmesser [:markco] from comment #15)

Also the Generic-worker workerType is gecko-t-win10-64-hw-dev.

Awesome possum! I will try them out as I iterate on my patch stack. Thanks, Markco! Thanks, Joel, for suggesting a good path!

And look at that:

[taskcluster 2019-07-09T17:26:08.938Z] Worker Type (releng-hardware/gecko-t-win10-64-hw-dev) settings:

(From https://treeherder.mozilla.org/#/jobs?repo=try&revision=fd7781af352f525c70e78159e2be6f155f53fb16&selectedJob=255524867.)

Thanks, all!

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.