Closed Bug 917923 Opened 8 years ago Closed 8 years ago

Re-balance some Win64 production machines to the try pool

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: armenzg)

References

Details

(Whiteboard: [buildduty][capacity][buildslaves])

Attachments

(1 file)

We're having bad wait times for the try server for Windows builds for a while.

# of win64 build/try production machines: 111/37=3 [1]

If I look at the wait times for the last month I can see how many jobs each pool took.

Win32/64 jobs (build/try):      5601+953/4777+220=6554/4997=1.31 [2][3]

This means that our pool should look more like this:
# of win64 build/try production machines: 85/63=1.35

Unfortunately, the wait times report hides L10n repacks and perhaps other jobs (blurry mind now) that have lower priority for developers that cannot be accounted for.
See after the hyperlinks how I was trying to account for all of this with another report but buildapi got messed up.

For now, I would like to use this ratio instead:
# of win64 build/try production machines: 100/48=2.08

I would like us to move 11 Win64 machines from the build pool to the try one (instead of 26 machines). This would be close to a 30% capacity increase for the try pool.

We should gather a list of hosts for relops to move from one network to the other one.
No renaming is needed.
We would need patches for production_config.py, update statements for slavealloc and replacing keys.


[1] https://secure.pub.build.mozilla.org/slavealloc/ui/#silos
[2] https://secure.pub.build.mozilla.org/buildapi/reports/waittimes/trybuildpool?int_size=86400&starttime=1376712000&endtime=1379476800
[3] https://secure.pub.build.mozilla.org/buildapi/reports/builders?starttime=1376712000&endtime=1379476800

#############################

If I look at the SUM of CPU that each pool I can have an idea of how CPU each pool used:
Win32/64 *build* pool - total CPU SUM:  2116h 13m 56s + 673h 19m 18s
Win32/64 *try* pool   - total CPU SUM:              ? + ?

total CPU SUM ratio (build/try): 2789h 33m 14s / ? = ?

For 4 & 5, I checked "platform" (do this first as it improves responsiveness) and uncheck every platform except win2k3 and win64.
WARNING: Loading these reports is terribly slow with many times Service Unavailable.
[4] https://secure.pub.build.mozilla.org/buildapi/reports/builders?starttime=1376712000&endtime=1379476800
[5]
All of the windows machines in scl1 are on the same VLAN (winbuild.scl1.mozilla.com), so there wouldn't be any movement or renaming required.
oh cool! Unlike the Windows test infra; makes sense.

Thanks for the info!
When we move to scl3, build and try will be different VLANs for security reasons (just like they are for OSX), but in scl1, things are just all sitting on the same VLAN.
Component: Platform Support → Buildduty
QA Contact: coop → armenzg
Whiteboard: [buildduty][capacity][buildslaves]
Assignee: nobody → armenzg
w64-ix-slave64
w64-ix-slave65
w64-ix-slave66
w64-ix-slave67
w64-ix-slave68
w64-ix-slave69
w64-ix-slave70
w64-ix-slave71
w64-ix-slave72
w64-ix-slave73
w64-ix-slave74
Attached patch win64.try.diffSplinter Review
Attachment #808868 - Flags: review?(jhopkins)
Attachment #808868 - Flags: review?(jhopkins) → review+
Attachment #808868 - Flags: checked-in+
Live on production.

Time to update slavealloc, remove the keys and clean up the hosts.
I'm going to reboot w64-ix-slave64 into production and see how it behaves.
The steps to move it into the try pool were very out-of-date.
I've added the remaining hosts, I will check tomorrow.
I actually had forgotten to reboot the following hosts:
w64-ix-slave65
w64-ix-slave66
w64-ix-slave67
w64-ix-slave68
w64-ix-slave69
w64-ix-slave71
w64-ix-slave73

The rm -rf /e/builds/moz2_slave takes forever.
As soon as it is done I will put them into the pool.

So far I see green runs, however, I see some intermittent make check failures.
TEST-UNEXPECTED-FAIL | e:/builds/moz2_slave/try-w32_g-00000000000000000000/build/python/mozbuild/mozbuild/test/frontend/test_sandbox_symbols.py | line 40, test_documentation_formatting: u'List of manifest files defining WebRTC signalling tests.' != ''
TEST-UNEXPECTED-FAIL | js\src\jit-test\tests\basic\bug710947.js | --ion-eager --ion-check-range-analysis
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Blocks: 926730
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.