Closed Bug 683228 Opened 13 years ago Closed 12 years ago

issues cloning win32-ix-ref

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
Windows Server 2003
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: dividehex)

References

()

Details

This will help us determine if buildbot-0.8 and the latest NSC.ini is being passed to the cloned slave from the latest win32-ix-ref snapshot.
I had previously made snapshots of win32-ix-ref (in mtv1) at the beginning of the month and then earlier last week.  Both of those snapshots worked when building a server in mtv1, but the image that was copied over to scl1 caused every machine it was used on to fail on reboot with filesystem corruption on C:

To try to narrow down the cause, I have performed the following tests:

* (success) built a machine in mtv1 with 20110803
* (success) built a machine in mtv1 with 20110830
* (success) copied the images for 20110803, 20110826, 20110830 from mtv1 to scl1 and ran md5 checksums against all files
* (success) ran diskutil info /Volumes/Deploy on pxe1 and verified that the SMART status was verified
* (success) ran dmesg on pxe1 and did not see any disk errors

* (fail) tried images from 20110803, 20110826, 20110830 in scl1 on hardware that had not been returned successfully to active service yet
* (fail) tried images from 20110830 in scl1 on hardware that had been pulled directly from active duty (known working)
* (fail) changed ide to ahci on one machine in scl1 and tried an install of 20110830

At this point I speculate that something is wrong with:

* the disk in pxe1 (but we haven't seen errors)
* the copy from mtv1 to scl1 (but the checksums match)
* deploystudio on pxe1 (but older images install fine)

I can't think of any other likely culprits at this point.
Assignee: server-ops-releng → arich
Summary: please clone w32-ix-slave34 off win32-ix-ref → issues cloning win32-ix-ref
w32-ix-ref has been updated this morning to have buildbot-0.8.4p1 on it.  So it will need at least one more snapshot taken.
I've taken another snapshot w32-ix-ref-20110906.
We took this machine to scl1 to see if we could pull a good image off of it there.  It booted into deploystudio, and matt chose the option to take an image, provided a name and description, and then the machine started to take an image and immediately failed.  I'm not sure we've ever taken an image of an IX server in scl1, so this might just be the way that the ds server there is configured (aka it's not working because of the ds side, not because of win32-ix-ref).  We didn't have time for deeper investigation, so the machine is in matt's car waiting to be returned to mtv1 at this point.
I'm going to assign this to matt in hopes that he can figure out what the issue with the w32 image might be.  I don't know if it's the transfer method, different versions of ds, or what, but the image copied over to scl1 still does not work.
Assignee: arich → mlarrain
Assignee: mlarrain → jwatkins
If the DSPC server in scl1 is readily removable (e.g., in a sonnet rack on the top), maybe bringing it to mtv1 to test would help?  It seems the obvious conclusion is that DSPC/mtv1 is producing images incompatible with DSPC/scl1 (both the pxe1 and installN versions).

The other thing I can imagine is that there's some metadata stored elsewhere in DSPC's innards, and simply copying the datafiles over is no longer sufficient.

Finally, you could always fire up fog.. cruddy, but it works.
I have a feeling this is due to incorrect file permissions.  I came across these errors while moving the image repo from install to install2.

rsync: send_files failed to open "/Volumes/deploy/Masters/PC/w32-ix-ref-20110830/sda1.gz.001": Permission denied (13)
rsync: send_files failed to open "/Volumes/deploy/Masters/PC/w32-ix-ref-20110830/sda1.gz.002": Permission denied (13)
rsync: send_files failed to open "/Volumes/deploy/Masters/PC/w32-ix-ref-20110830/sda5.gz.001": Permission denied (13)
rsync: send_files failed to open "/Volumes/deploy/Masters/PC/w32-ix-ref-20120226/sda1.gz.001": Permission denied (13)
rsync: send_files failed to open "/Volumes/deploy/Masters/PC/w32-ix-ref-20120226/sda1.gz.002": Permission denied (13)
rsync: send_files failed to open "/Volumes/deploy/Masters/PC/w32-ix-ref-20120226/sda5.gz.001": Permission denied (13)

pxe1:PC administrator$ ls -l w32-ix-ref-20110830/
total 14820800
-rw-r--r--+ 1 root  staff         361 Aug 30  2011 StructDisk
-rw-r--r--+ 1 root  staff         512 Aug 30  2011 mbr
-rw-r--r--+ 1 root  staff  2136806717 Aug 30  2011 sda1.gz.000
-rw-------+ 1 root  staff  2136798379 Aug 30  2011 sda1.gz.001
-rw-------+ 1 root  staff   706034818 Aug 30  2011 sda1.gz.002
-rw-r--r--+ 1 root  staff  2136809281 Aug 30  2011 sda5.gz.000
-rw-------+ 1 root  staff   470626575 Aug 30  2011 sda5.gz.001
-rw-r--r--+ 1 root  staff     1151097 Aug 30  2011 sda6.gz.000
pxe1:PC administrator$
Wow.  Just... wow.  You'd think that DSPC would give you SOME KIND OF ERROR if this was the case, instead of indicating that everything installed swimmingly.  Easy enough to test, though.  I wonder if they need rx or just r.  I've added r for now on pxe1 and see that +r's already been added on install.
Confirmed this was a permissions issue.  I set the posix permissions 777 since the runtime script actually tries to set all files in the Masters/PC dir with that mask upon loading.  I tested w32-ix-ref-20110830 from PXE1 with no issues.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.