Closed Bug 656042 (releng-snapshots) Opened 13 years ago Closed 13 years ago

Create new images of all ref machines

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: zandr, Assigned: zandr)

References

Details

We seem to have some confusion about the state of various ref images.

Best to make new images across the board, to avoid confusion.

The machines to clone are:

talos-r3-xp-ref
talos-r3-w7-ref
talos-r3-w764-ref
talos-r3-fed-ref
talos-r3-fed64-ref
talos-r3-leopard-ref
talos-r3-snow-ref

win32-ix-ref
linux-ix-ref
linux64-ix-ref

When creating these images, let's use a datestamp instead of a version number (a good suggestion I haven't been following)

So, for example, name the image:

win32-ix-ref-20110510
[mashed return too soon]

This is assigned to Dustin to confirm both the list of machines to image (win64 is intentionally excluded) and to verify that the ref machines are in the expected state before we make new images.

Assign to me when you're done, and I'll get some help around here to make the images.
I checked everything but talos-r3-w7-ref and talos-r3-w764-ref, to which I cannot connect.  I fixed teh VNC password on fed and fed64, but everything else was correct.

If the w7 systems just need to be turned on, then ping me in IRC when they are turned on and I can quickly check that they are set up correctly.
Assignee: dustin → zandr
I'll need this bug back briefly to install runslave.py on talos-r3-w7-ref, before snapshotting.  It only takes a second (see bug 629692 if you want to do it).
w7 is on, but I'm very suspicious of it's current state.

w764 won't boot: What's our recovery path here? Install a new machine from the last refimage and figure out how to update it? Pull an image from a slave in scl1?
w7 is updated with the new runslave.py - ready to roll.

As for w764, I think your suggestion is best - find a suitable spare mini and install the latest refimage on it.  Open up a new bug for that - Armen's probably the best guy to know what's been done since the last refimage.  If you can find the date of that refimage that will be helpful.

I don't need to do anything to w764 for runslave.py (yet), so no need to block on me at that point.

As the blocker list for this bug suggests, it's blocking a lot of activity, so let's get those interns crackin' on it!
Blocks: 639499
To clarify, zandr was poking t-r3-w764-ref.  talos-r3-w764-ref is in DNS as being in scl1, but it doesn't exist.  I'll include that in the DNS overhaul.
Blocks: 651271
Bug 617105 indicates that some reference machines were missing in comment 0.  The whole list is

talos-r3-xp-ref
talos-r3-w7-ref
t-r3-w764-ref
talos-r3-fed-ref
talos-r3-fed64-ref
talos-r3-leopard-ref
talos-r3-snow-ref
win32-ix-ref
linux-ix-ref
linux64-ix-ref
bm-mini-build-ref
moz2-darwin10-ref

I've futzed with all of these boxes, and they're ready to be snapshotted.
> I've futzed with all of these boxes, and they're ready to be snapshotted.
 -- all of them except t-r3-w764-ref, sorry
As for the w764, let's be sure to keep the current latest refimage, and also take a new image from t-r3-w764-011, named
  t-r3-w764-011-201105xx
Alias: releng-snapshots
I just updated the XP and W7 reference machines - hopefully not too late?
Assignee: zandr → jberry
Assignee: jberry → zandr
(In reply to comment #13)
> As for the w764, let's be sure to keep the current latest refimage, and also
> take a new image from t-r3-w764-011, named
>   t-r3-w764-011-201105xx

Make that t-r3-w764-009, named
  t-r3-w764-009-201105xx (xx being whatever day you image it)

There are only three working w764 testers: 009, 005, and 006.
(In reply to comment #15)
 
> There are only three working w764 testers: 009, 005, and 006.

And yet, I have only bug 651272, which is a reimage not for failure but coming back from a developer loan.

Could someone with more info file a bug about this?
We only recently noticed that the rest are down, and the general consensus seems to be that we don't care.  I suppose it'd be nice to figure out why, but that may not be so helpful in the end - I suspect we'll need to re-build these testers once we have a Windows admin onboard.
Blocks: 659933
Blocks: 659934
Blocks: 659935
Blocks: 659936
Blocks: 659937
Blocks: 659938
(In reply to comment #17)
> We only recently noticed that the rest are down, and the general consensus
> seems to be that we don't care.  I suppose it'd be nice to figure out why,
> but that may not be so helpful in the end - I suspect we'll need to re-build
> these testers once we have a Windows admin onboard.
The machines are down because they were not adjusted to point to valid masters.
After a lot of idleness the ssh and vnc servers stop working properly (even RDP in some cases). A modified buildbot.tac and a reboot get them back into shape.

As Dustin mentions we have decided that this pool has not been in use for several quarters because no one has cared enough.
We still want Win64 builders at some point but not being pushed by anyone right now. We can figure a pool of Win64 testers once we are there (rev2 as emergency and the new testing pool as the right long term solution).
No longer blocks: 659118
I'm told that the W7 reference machine may also be bum, so take an image of talos-r3-w7-005 instead.  I've disabled this in slavealloc, but please
 1. ping in #build before imaging so we can make sure it's not still running a build
 2. change its machine name back to talos-r3-w7-ref and reboot before imaging it, lest newly-imaged machines all think they are named talos-r3-w7-005.
(In reply to comment #19)

>  2. change its machine name back to talos-r3-w7-ref and reboot before
> imaging it, lest newly-imaged machines all think they are named
> talos-r3-w7-005.

Actually, my intention here (and on the w764-ref) is to write the image back to the ref machine and fix it there. This way I can return w7-005 and w764-009 to service quickly.
Sounds good.  Then for w7-005, no need to do anything except to let us know when to re-enable it.  For w764-009, if you can edit c:\talos-slave\buildbot.tac and change the slave hostname to t-r3-w764-ref, that should prevent it from burning builds when it comes back up on the ref machine.
I have made images from w764-009 and w7-005 and they're coming back up.

When I write the images back to the ref machines, I'll keep them off the network until I can fixup the hostnames.
Status: NEW → UNCONFIRMED
Ever confirmed: false
This bug is a giant frayed tapestry that makes useful work look blocked. Tearing out the remaining work into new bugs and marking fixed.
Status: UNCONFIRMED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
No longer blocks: 656086
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.