Closed Bug 1317723 Opened 5 years ago Closed 5 years ago

Rebalance the Win8 machine pool

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: RyanVM, Assigned: aselagea)

References

Details

Attachments

(4 files)

Over in bug 1317434, I'd like to finally enable Win8 e10s tests in production. Now WinXP tests are no longer running on 53+, we should be able to move a large fraction of those machines over to Win8.

If my reading of Slave Health is correct, we currently have 211 WinXP testers. I'd like to propose moving 151 over to Win8, leaving us with a pool of 60 WinXP test machines to cover Aurora/Beta/Release/ESR45 and ~370 Win8 testers, which is pretty close to what we've got for 10.10 and where backlog seems pretty reasonable these days. 60 machines for WinXP is a bit light, but release branches also aren't as high-volume nor as risk-prone with respect to coalescing, so I don't think a bit of backlog there is a big deal. And buildbot branch prioritization will ensure that mozilla-release still gets first dibs should we find ourselves in a chemspill situation where turnaround time is paramount.

Amy, do those numbers sound reasonable? If so, any reason this can't proceed whenever?
Flags: needinfo?(arich)
Component: General Automation → Buildduty
QA Contact: catlee → bugspam.Callek
Assignee: nobody → aselagea
There actually are 222 XP machines in the pool, but noticed that some of them are having issues and will likely need a re-image (see the attachments). Those are:

t-xp32-ix-143
t-xp32-ix-144
t-xp32-ix-145
t-xp32-ix-146
t-xp32-ix-147
t-xp32-ix-148
t-xp32-ix-149
t-xp32-ix-150
t-xp32-ix-151
t-xp32-ix-152
t-xp32-ix-154
I disabled those 11 machines in slavealloc. If we stick to keeping 60 XP machines, that would mean moving 162 to the Windows 8 pool.
Attached image xp_issues.PNG
Attached image xp_issues_2.PNG
I'm going to defer load calculations to the releng folks since they have a better idea of their application load and wait on them for a request to move specific machines around.
Flags: needinfo?(arich)
Alin and I talked about that this morning in our standup.  He looked at the load and I looked at the load and we think that Ryan's proposal is a sensible way forward.
buildbot-configs patch
Attachment #8811338 - Flags: review?(kmoir)
slavealloc new entries for win8
Attachment #8811339 - Flags: review?(kmoir)
I disabled the t-xp32-ix machines in range [061-222].
Comment on attachment 8811338 [details] [diff] [review]
bug_1317723.patch

Alin: Is there a bug for relops to reimage the machines as win8?  I didn't see a dependent bug referenced.
Attachment #8811338 - Flags: review?(kmoir) → review+
Comment on attachment 8811339 [details]
bug_1317723_slavealloc.csv

There's an extra space in the csv file after the second column that should be removed

substitute "win8," for "win8 ,"

needinfo for my question in the previous review
Flags: needinfo?(aselagea)
Attachment #8811339 - Flags: review?(kmoir) → review+
Depends on: 1318275
Comment on attachment 8811339 [details]
bug_1317723_slavealloc.csv

Fixed and added the entries to slavealloc.
Attachment #8811339 - Flags: checked-in+
Attachment #8811338 - Flags: checked-in+
(In reply to Kim Moir [:kmoir] from comment #10)

> Alin: Is there a bug for relops to reimage the machines as win8?  I didn't
> see a dependent bug referenced.

I filed bug 1318275 for that.
Flags: needinfo?(aselagea)
64 machines were enabled as a first step:

t-w864-ix-236
t-w864-ix-237
..............
t-w864-ix-299

Query OK, 64 rows affected (0.01 sec)
Rows matched: 64  Changed: 64  Warnings: 0

We will monitor and if everything will be fine and we will have green jobs we will enable the remaining ones.
The remaining ones were enabled by Andrei before leaving for the day with the understanding that we'd be watching the results. Things were quite rocky for awhile - a lot of the machines didn't start taking jobs until being force-rebooted, leading to a significant backlog in the mean time.

Once they started taking jobs, a not-insignificant number (~20%) had resolution issues that were causing widespread test failures. Many screenshots showed Geforce Experience-related notifications on the screen in addition to being at the wrong resolution. Interesting enough, rebooting those misbehaving machines was all it took to get the majority acting nicely.

At this point, ix-308 is disabled for ongoing problems that rebooting wasn't fixing. Additionally, there were a few machines that refused to connect to a master even after multiple reboot attempts. Tracking bugs for those machines have been filed. At this point, we have 383 working machines and I'm calling this fixed.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Solved ix-308 and the remaining machines which refused to connect to a master after been re-imaged.
Depends on: 1319942
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.