Slave loan request for nalexander/tarek/rwood
Categories
(Infrastructure & Operations :: RelOps: Hardware, task)
Tracking
(Not tracked)
People
(Reporter: nalexander, Assigned: markco)
References
Details
(Whiteboard: [buildduty][capacity][buildslaves][loaner])
Hi folks, me, :tarek, and :rwood are going to be working on a bunch of Raptor-esque changes in Q3. To support that work, it will really help to have either dedicated hardware or a way to "jump the queue" for testing jobs. The number of such testing jobs is strictly "human scale", i.e., we'd be pushing a handful of jobs manually to process ASAP.
See https://bugzilla.mozilla.org/show_bug.cgi?id=1324744#c14 for a bit more context.
The most important platform is Win 10 hardware, which I guess is the "-qr" or "-ux" machine configuration. A loaner might make the most sense here.
We'll also want to test on Android devices managed by bitbar. Here a loaner probably doesn't make sense, unless that cow path already exists. Can we get some ability to jump the queue? My colleagues are reporting 2-3 DAY turn-around times from try for these devices, which sounds like a bug (see Bug 1562988), but even if it's fixed we'll want some focused priority so that we can make progress quickly.
Reporter | ||
Comment 1•5 years ago
|
||
(In reply to Nick Alexander :nalexander [he/him] from comment #0)
Hi folks, me, :tarek, and :rwood are going to be working on a bunch of Raptor-esque changes in Q3. To support that work, it will really help to have either dedicated hardware or a way to "jump the queue" for testing jobs. The number of such testing jobs is strictly "human scale", i.e., we'd be pushing a handful of jobs manually to process ASAP.
See https://bugzilla.mozilla.org/show_bug.cgi?id=1324744#c14 for a bit more context.
The most important platform is Win 10 hardware, which I guess is the "-qr" or "-ux" machine configuration. A loaner might make the most sense here.
rwood: can you confirm exactly the configuration the browsertime MVP is to run against?
We'll also want to test on Android devices managed by bitbar. Here a loaner probably doesn't make sense, unless that cow path already exists. Can we get some ability to jump the queue? My colleagues are reporting 2-3 DAY turn-around times from try for these devices, which sounds like a bug (see Bug 1562988), but even if it's fixed we'll want some focused priority so that we can make progress quickly.
jmaher: can you suggest what can be done here?
Comment 2•5 years ago
|
||
the regular windows hardware machines would be ideal. Do you want to push to try using a special hardware type? we could have a mini pool that you could use.
Otherwise the loaners would be remote desktop to the machine, run the job manually which is not always straightforward. I have done that a lot before and once setup it is ok.
If you want 10 machines setup with a different hardware ID that taskcluster would use given that your try pushes have a custom hack in taskcluster (here is an example for testing windows 1903 on the VMs: https://hg.mozilla.org/try/rev/e45cad8128b62308ad4a7196c851e8f66ec0cd97 ), ask for that instead.
Otherwise, lets get a machine reserved for each of you and we can set up scripts that reduce the typing needed for testing new builds/tooling.
Comment 3•5 years ago
|
||
(In reply to Nick Alexander :nalexander [he/him] from comment #1)
rwood: can you confirm exactly the configuration the browsertime MVP is to run against?
Yes, the browsertime MVP (Bug 1561939) is to be run on just one platform (Win 10) on the same hardware that Raptor currently runs now (using the same mitmproxy/mozbase package and recordings etc). We basically want to get as close to the existing Raptor setup in CI as we can except with Browsertime, so that the data will be as comparable as possible. Thanks!
Reporter | ||
Comment 4•5 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #2)
the regular windows hardware machines would be ideal. Do you want to push to try using a special hardware type? we could have a mini pool that you could use.
Otherwise the loaners would be remote desktop to the machine, run the job manually which is not always straightforward. I have done that a lot before and once setup it is ok.
If you want 10 machines setup with a different hardware ID that taskcluster would use given that your try pushes have a custom hack in taskcluster (here is an example for testing windows 1903 on the VMs: https://hg.mozilla.org/try/rev/e45cad8128b62308ad4a7196c851e8f66ec0cd97 ), ask for that instead.
Otherwise, lets get a machine reserved for each of you and we can set up scripts that reduce the typing needed for testing new builds/tooling.
Hi Joel -- thanks for being so responsive. As per https://bugzilla.mozilla.org/show_bug.cgi?id=1562991#c3, we want exactly the same Win 10 configuration as Raptor currently runs. So hopefully no difficult imaging/configuration changes are needed.
In terms of loaners vs. a small pool, my personal preference would be a tiny pool (like 2 machines, say) rather than loaners. If we were "guaranteeed" to have jobs start in minutes, then it's a lot less hassle to push things to try for testing (which is what rwood and other experienced folks are basically always doing) than to have to work through the non-standard "loaner" configuration. That is, it's the slow decision task -> job completion that really drags down the progress loop. (Although iterating at a thing that could be done easily with interactive access can be frustrating too.) It seems like it would be much cheaper and easier to have a "Fast Pass" system where designated developers can hop the queue for a short time and trust the honor system for them to keep the number of such jobs down rather than stand up a tiny pool for focused effort of this sort, but that's your call to make.
Thanks again!
Comment 5•5 years ago
|
||
we can create a small pool for you and the only way the jobs won't start is if you have too many jobs scheduled (either yourself or others).
:markco, what would it take to have 5 windows10 hardware machines pulled from the main pool and setup with a different taskcluster token (instead of t-win10-64-hw we would reference t-win10-64-hw-dev from taskcluster: https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/tests.py#58)
Assignee | ||
Comment 6•5 years ago
|
||
:markco, what would it take to have 5 windows10 hardware machines pulled from the main pool and setup with a different taskcluster token (instead of t-win10-64-hw we would reference t-win10-64-hw-dev from taskcluster: https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/tests.py#58)
It would require a creation of a new role within Puppet, and the creation of a new workerType and clientID within taskcluster. The new workerType would be gecko-t-win10-64-hw-dev. However, I am not clear on how t-win10-64-hw is pointed at the gecko-t-win10-64-hw taskcluster workerType, so there maybe more that needs to be done there to add a new workerType.
What is the timeline for it?
Comment 7•5 years ago
|
||
timeline is relatively soon, possibly once the bitbar laptops are moved over next week?
If there are others who are better suited to do this or other information needed, please let me know.
:markco- when they are done with the dev work, I would like to use the pool for testing 1903 updates to our hardware :) That timeline can be more flexible- goal would be before end of the year have all our windows hardware running 1903 :)
Assignee | ||
Comment 8•5 years ago
|
||
The part I wasn't clear on is clear once looking at the link in comment 6.
This relatively straight forward. I should have the pool up by next week.
Comment 9•5 years ago
|
||
we can either require all try pushes to have code to use the specific hardware type, or we can land it in tree- thanks for getting the new bug filed and an estimate of when it will be online :)
Reporter | ||
Comment 10•5 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #9)
we can either require all try pushes to have code to use the specific hardware type, or we can land it in tree- thanks for getting the new bug filed and an estimate of when it will be online :)
A little code required in try pushes seems best here. Thanks, all.
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 11•5 years ago
|
||
This mostly ready to deploy. Is VNC or RDP access needed?
Comment 12•5 years ago
|
||
I understand try server is the intended use, :nalexander, vnc/rdp access?
Reporter | ||
Comment 13•5 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #12)
I understand try server is the intended use, :nalexander, vnc/rdp access?
Try server should be fine for now. Thanks, Mark, thanks, Joel.
Assignee | ||
Comment 14•5 years ago
|
||
5 nodes will be deployed tonight.
t-w1064-ms-281.wintest.releng.mdc1.mozilla.com 10.49.40.182
t-w1064-ms-282.wintest.releng.mdc1.mozilla.com 10.49.40.183
t-w1064-ms-316.wintest.releng.mdc2.mozilla.com 10.51.40.2
t-w1064-ms-317.wintest.releng.mdc2.mozilla.com 10.51.40.3
t-w1064-ms-318.wintest.releng.mdc2.mozilla.com 10.51.40.4
This has the standard loaner root (administrator in this case) and vnc password. RDP is not set up, but it probably won't be very useful with Generic-workersince the task user gets a new password for each task.
The configuration will mirror the production configuration. However, if need be I can lock the configuration or make changes unique to this pool. Let me know if you all have any question, or if there is anything else I can do to help.
Assignee | ||
Comment 15•5 years ago
|
||
Also the Generic-worker workerType is gecko-t-win10-64-hw-dev.
Reporter | ||
Comment 16•5 years ago
|
||
(In reply to Mark Cornmesser [:markco] from comment #15)
Also the Generic-worker workerType is gecko-t-win10-64-hw-dev.
Awesome possum! I will try them out as I iterate on my patch stack. Thanks, Markco! Thanks, Joel, for suggesting a good path!
Reporter | ||
Comment 17•5 years ago
|
||
(In reply to Nick Alexander :nalexander [he/him] from comment #16)
(In reply to Mark Cornmesser [:markco] from comment #15)
Also the Generic-worker workerType is gecko-t-win10-64-hw-dev.
Awesome possum! I will try them out as I iterate on my patch stack. Thanks, Markco! Thanks, Joel, for suggesting a good path!
And look at that:
[taskcluster 2019-07-09T17:26:08.938Z] Worker Type (releng-hardware/gecko-t-win10-64-hw-dev) settings:
Thanks, all!
Description
•