Re-balance some Win64 production machines to the try pool

RESOLVED FIXED

Status

Release Engineering
Buildduty
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: armenzg, Assigned: armenzg)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [buildduty][capacity][buildslaves])

Attachments

(1 attachment)

(Assignee)

Description

4 years ago
We're having bad wait times for the try server for Windows builds for a while.

# of win64 build/try production machines: 111/37=3 [1]

If I look at the wait times for the last month I can see how many jobs each pool took.

Win32/64 jobs (build/try):      5601+953/4777+220=6554/4997=1.31 [2][3]

This means that our pool should look more like this:
# of win64 build/try production machines: 85/63=1.35

Unfortunately, the wait times report hides L10n repacks and perhaps other jobs (blurry mind now) that have lower priority for developers that cannot be accounted for.
See after the hyperlinks how I was trying to account for all of this with another report but buildapi got messed up.

For now, I would like to use this ratio instead:
# of win64 build/try production machines: 100/48=2.08

I would like us to move 11 Win64 machines from the build pool to the try one (instead of 26 machines). This would be close to a 30% capacity increase for the try pool.

We should gather a list of hosts for relops to move from one network to the other one.
No renaming is needed.
We would need patches for production_config.py, update statements for slavealloc and replacing keys.


[1] https://secure.pub.build.mozilla.org/slavealloc/ui/#silos
[2] https://secure.pub.build.mozilla.org/buildapi/reports/waittimes/trybuildpool?int_size=86400&starttime=1376712000&endtime=1379476800
[3] https://secure.pub.build.mozilla.org/buildapi/reports/builders?starttime=1376712000&endtime=1379476800

#############################

If I look at the SUM of CPU that each pool I can have an idea of how CPU each pool used:
Win32/64 *build* pool - total CPU SUM:  2116h 13m 56s + 673h 19m 18s
Win32/64 *try* pool   - total CPU SUM:              ? + ?

total CPU SUM ratio (build/try): 2789h 33m 14s / ? = ?

For 4 & 5, I checked "platform" (do this first as it improves responsiveness) and uncheck every platform except win2k3 and win64.
WARNING: Loading these reports is terribly slow with many times Service Unavailable.
[4] https://secure.pub.build.mozilla.org/buildapi/reports/builders?starttime=1376712000&endtime=1379476800
[5]
All of the windows machines in scl1 are on the same VLAN (winbuild.scl1.mozilla.com), so there wouldn't be any movement or renaming required.
(Assignee)

Comment 2

4 years ago
oh cool! Unlike the Windows test infra; makes sense.

Thanks for the info!
When we move to scl3, build and try will be different VLANs for security reasons (just like they are for OSX), but in scl1, things are just all sitting on the same VLAN.
(Assignee)

Updated

4 years ago
Component: Platform Support → Buildduty
QA Contact: coop → armenzg
Whiteboard: [buildduty][capacity][buildslaves]
(Assignee)

Updated

4 years ago
Assignee: nobody → armenzg
(Assignee)

Comment 4

4 years ago
w64-ix-slave64
w64-ix-slave65
w64-ix-slave66
w64-ix-slave67
w64-ix-slave68
w64-ix-slave69
w64-ix-slave70
w64-ix-slave71
w64-ix-slave72
w64-ix-slave73
w64-ix-slave74
(Assignee)

Comment 5

4 years ago
Created attachment 808868 [details] [diff] [review]
win64.try.diff
Attachment #808868 - Flags: review?(jhopkins)
(Assignee)

Comment 6

4 years ago
I will be looking at following (updating) these instructions:
https://wiki.mozilla.org/ReleaseEngineering/How_To/Create_new_slaves_or_move_them_to_other_pools
https://wiki.mozilla.org/ReleaseEngineering/How_To/Adjust_SSH_keys_on_a_slave#Try
Attachment #808868 - Flags: review?(jhopkins) → review+
(Assignee)

Updated

4 years ago
Attachment #808868 - Flags: checked-in+
(Assignee)

Comment 7

4 years ago
Live on production.

Time to update slavealloc, remove the keys and clean up the hosts.
(Assignee)

Comment 8

4 years ago
I'm going to reboot w64-ix-slave64 into production and see how it behaves.
The steps to move it into the try pool were very out-of-date.
(Assignee)

Comment 9

4 years ago
I've added the remaining hosts, I will check tomorrow.
(Assignee)

Comment 10

4 years ago
To help us find issues:
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=build&type=w64-ix&name=w64-ix-slave64
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=build&type=w64-ix&name=w64-ix-slave65
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=build&type=w64-ix&name=w64-ix-slave66
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=build&type=w64-ix&name=w64-ix-slave67
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=build&type=w64-ix&name=w64-ix-slave68
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=build&type=w64-ix&name=w64-ix-slave69
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=build&type=w64-ix&name=w64-ix-slave70
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=build&type=w64-ix&name=w64-ix-slave71
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=build&type=w64-ix&name=w64-ix-slave72
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=build&type=w64-ix&name=w64-ix-slave73
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=build&type=w64-ix&name=w64-ix-slave74
(Assignee)

Comment 11

4 years ago
I actually had forgotten to reboot the following hosts:
w64-ix-slave65
w64-ix-slave66
w64-ix-slave67
w64-ix-slave68
w64-ix-slave69
w64-ix-slave71
w64-ix-slave73

The rm -rf /e/builds/moz2_slave takes forever.
As soon as it is done I will put them into the pool.

So far I see green runs, however, I see some intermittent make check failures.
TEST-UNEXPECTED-FAIL | e:/builds/moz2_slave/try-w32_g-00000000000000000000/build/python/mozbuild/mozbuild/test/frontend/test_sandbox_symbols.py | line 40, test_documentation_formatting: u'List of manifest files defining WebRTC signalling tests.' != ''
TEST-UNEXPECTED-FAIL | js\src\jit-test\tests\basic\bug710947.js | --ion-eager --ion-check-range-analysis
(Assignee)

Updated

4 years ago
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Blocks: 926730
You need to log in before you can comment on or make changes to this bug.