Closed Bug 1088839 Opened 10 years ago Closed 10 years ago

Add new Windows test slaves to production

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: coop)

References

Details

Attachments

(1 file)

We're adding a bunch of Windows test capacity in bug 1064404.

Specifically, we're getting the following new slaves:

* t-w732-ix-1[31-62]
* t-w864-ix-1[31-70]
* t-xp32-ix-1[31-62] 

These are already being added to nagios in bug 1088594. 

We will also need to add them to our buildbot-configs, and add them to the graphserver dbs.
Attachment #8511278 - Flags: review?(bugspam.Callek)
I've updated graphserver, both prod and staging.
Comment on attachment 8511278 [details] [diff] [review]
[buildbot-configs] Add new Windows test slaves

Review of attachment 8511278 [details] [diff] [review]:
-----------------------------------------------------------------

I didn't confirm these match the new slaves, but the code change is fine, so stamp+
Attachment #8511278 - Flags: review?(bugspam.Callek) → review+
Comment on attachment 8511278 [details] [diff] [review]
[buildbot-configs] Add new Windows test slaves

Review of attachment 8511278 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/buildbot-configs/rev/650bb01f907b
Attachment #8511278 - Flags: checked-in+
I've rebooted all the new slaves via slaveapi. They should start taking jobs shortly.
A Pivotal Tracker story has been created for this Bug: https://www.pivotaltracker.com/story/show/81629636
There was an issue with path escaping in the basedir entries in slavealloc, so none of the new slaves were starting up. This has been fixed, and the slaves rebooted again. They should start taking jobs shortly.

t-w864-ix-163 seems to be missing from DNS. Filed bug 1090577.
The win8 slaves appear to all be running at a too-small resolution, like every single win8 slave which has been reimaged since July, all the rest of which are hanging as dependencies off bug 1067062. So far, only a few have hit the honeypot bugs, bug 1003614 and bug 977561, but looking at their results, they have all failed 50%-90% of the jobs they've taken, and the failures I've looked at individually on the few slaves that look like they might possibly not be broken have just been new too-small-resolution failures that we haven't yet filed and turned into honeypots.

I'm about to disable every one of them.
Van: In your spot checks, I thought you found that the resolution was correct? Philor: do we have a list of slaves that are impacted?
Flags: needinfo?(vle)
Flags: needinfo?(q)
Based on having actually been starred in either one of the two old test failure bugs that indicate busted resolution, or one of the five new ones we filed without realizing what we were filing yesterday, for sure affected: t-w864-ix-160, t-w864-ix-134, t-w864-ix-133, t-w864-ix-151, t-w864-ix-166, t-w864-ix-147, t-w864-ix-136, t-w864-ix-154, t-w864-ix-155, t-w864-ix-141, t-w864-ix-159, t-w864-ix-152, t-w864-ix-170, t-w864-ix-137, t-w864-ix-164, t-w864-ix-131, t-w864-ix-165, t-w864-ix-156, t-w864-ix-168, t-w864-ix-166, t-w864-ix-156, t-w864-ix-149, t-w864-ix-162, t-w864-ix-140, t-w864-ix-142, t-w864-ix-169, t-w864-ix-146. I'm 95% sure the list of affected slaves is "all of them", t-w864-ix-1[31-70], based on all but one having a failure rate far higher than normal, and that one which only failed two jobs having failed in new unfiled ways which sounded like resolution.

It's worth keeping in mind, though, that "running at a too-small resolution" is handwaving on my part: the only time I get to directly know the resolution of a slave is when a test hangs, so the test harness takes and uploads a screenshot and I can see the dimensions of the image. Some of the honeypot indicator bugs are either directly, like bug 1090643, or indirectly through "webgl isn't working" like bug 1090639 saying that hardware acceleration isn't enabled, while others like bug 1003614 and probably the new bug 1090640 are more directly resolution, from thinking that parts of the page will be visible and possible to interact with when they aren't.
we did spot check and they were correct. seems to be the issue is with win8  hosts. perhaps it didnt reboot into the correct resolution? to confirm, these w8 hosts should be set to onboard video, correct?

>The win8 slaves appear to all be running at a too-small resolution, like every single win8 slave which has been reimaged since July

im in phx so i'll need info sal to spot check these for us. is this a driver issue?
Flags: needinfo?(vle) → needinfo?(sespinoza)
seems like they are all running 1024x 768, even after a reboot they show the same resolution.
would this be a driver issue?
Flags: needinfo?(sespinoza)
Flags: needinfo?(q)
removed a need info from comment 10, sorry about that
Flags: needinfo?(q)
There are some very odd things here for 160 the machine belives there is a dell monitor lugged in that resets the res and primary monitor on boot:

http://postimg.org/image/nnox5afi5/

For 134 there is a driver error that was just corrected.

I am looking at the rest right now

Q
Flags: needinfo?(q)
Still diving into root cause 147,133,134 are all re-enabled and have correct resolutions. The later two without any manual intervention. I will keep an eye on these tonight and follow suite with the rest.
Blocks: 1091708
Moving trouble shooting to new issue in bug 1091708
Depends on: 1091708
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: