Closed Bug 690951 Opened 13 years ago Closed 13 years ago

Image 70 more rev4 slaves as talos-r4-snow-XXX slaves

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jhford, Assigned: mlarrain)

References

Details

Once we are confident that these testing slaves are valid, we will need to image 78 more slaves with the same image.  This bug will eventually be moved to the serverops::releng component when we are ready for work to begin.  Filing this way to give a heads up.

My aim is to have my testing completed by Wednesday Oct 5, 2011.

Amy, not sure who to CC on this bug for the heads up so please add anyone that should know about this.
Depends on: 671415
No need for amy to do anything with this - when the rev4 minis test good, assign this bug to Server Operations: RelEng and then it will become actionable by relops
Assignee: nobody → jhford
Summary: [filing in advance] Image 78 more rev4 slaves as talos-r4-snow-XXX slaves → Image 78 more rev4 slaves as talos-r4-snow-XXX slaves
70 of these are already ready to get imaged - DNS, DHCP, and inventory are finished.  These have been selected rather carefully to put 20 in each rack, and as far as I know this reflects the original desire to have 80 of each OS.

I know you mentioned in the parent bug that you want a static number, but given that I've already done the legwork, can we do 70 instead of 78?  That leaves 89 free machines: 80 for lion, two ref's, and a few to support development of better imaging techniques and eventually fold into the lion and snow pools (or make into foopies?)
The total number in scl1 also reflects the 22 machines that were bought for momo, so releng needs to decide how those are going to be brought into the fold.  It may be that we have fewer than 80 minis per os strictly for firefox testing.  I suspect that releng directors/managers need to make this decision.
I overstated the case in comment 2 -- as far as *my* work is concerned, these 10+70=80 are ready to go.  I didn't mean to imply that power/network was done, nor that imaging is known to work now.
If you'd prefer 80, that number is OK with me.  I just want a number that isn't constantly changing.  If we need to move the goal post later, lets do that.  Updating bug summary to reflect the change.
Summary: Image 78 more rev4 slaves as talos-r4-snow-XXX slaves → Image 70 more rev4 slaves as talos-r4-snow-XXX slaves
I have an image that looks like it is ready to be deployed.  I need to make two small file changes to the refimage.  I would also like to test the post image process on a single machine.

fwiw, the two things I would like to change are:
-point machines to scl-production-puppet.build.scl1.mozilla.com
-remove ssl cert from staging-puppet.build.mozilla.org (rm -rf /etc/puppet/ssl/certs/*)
(In reply to John Ford [:jhford] from comment #6)
> I have an image that looks like it is ready to be deployed.  I need to make
> two small file changes to the refimage.  I would also like to test the post
> image process on a single machine.

...before imaging the other 69 machines.
On irc you said that you had uploaded an image to fs2, but you didn't move this bug over to the relops queue or provide a path to the image name, so I'm guessing we still aren't good to go here.  Please keep in mind that it will likely take hours to transfer the image to scl1 after you are done with it, so we'll need a day (to transfer the image and get someone onsite in the datacenter) to do the first image after you assign this bug to the relops queue and provide the image.  

How long does your test of the first deployed image need?  Is this something that's going to take a few minutes (and we can do this live with you while someone is in the datacenter), or should we be planning on a trip to the datacenter for one image, and then not returning till the next day (or week, or whatever timeframe you need) to start imaging the rest of the machines after you've done a verification?
(In reply to Amy Rich [:arich] [:arr] from comment #8)
> On irc you said that you had uploaded an image to fs2, but you didn't move
> this bug over to the relops queue or provide a path to the image name, so
> I'm guessing we still aren't good to go here.  Please keep in mind that it
> will likely take hours to transfer the image to scl1 after you are done with
> it, so we'll need a day (to transfer the image and get someone onsite in the
> datacenter) to do the first image after you assign this bug to the relops
> queue and provide the image.  

I transfered this image to scl1 based machine instead, as discussed on irc with Dustin.

> How long does your test of the first deployed image need?  Is this something
> that's going to take a few minutes (and we can do this live with you while
> someone is in the datacenter), or should we be planning on a trip to the
> datacenter for one image, and then not returning till the next day (or week,
> or whatever timeframe you need) to start imaging the rest of the machines
> after you've done a verification?

The test should be about 15-30 minutes and is just to test syncing up with puppet.
(In reply to John Ford [:jhford] from comment #9)
> (In reply to Amy Rich [:arich] [:arr] from comment #8)
> > On irc you said that you had uploaded an image to fs2, but you didn't move
> > this bug over to the relops queue or provide a path to the image name, so
> > I'm guessing we still aren't good to go here.  Please keep in mind that it
> > will likely take hours to transfer the image to scl1 after you are done with
> > it, so we'll need a day (to transfer the image and get someone onsite in the
> > datacenter) to do the first image after you assign this bug to the relops
> > queue and provide the image.  
> 
> I transfered this image to scl1 based machine instead, as discussed on irc
> with Dustin.


Great, we need to know the machine name and file path to wherever you put the image.

> > How long does your test of the first deployed image need?  Is this something
> > that's going to take a few minutes (and we can do this live with you while
> > someone is in the datacenter), or should we be planning on a trip to the
> > datacenter for one image, and then not returning till the next day (or week,
> > or whatever timeframe you need) to start imaging the rest of the machines
> > after you've done a verification?

> The test should be about 15-30 minutes and is just to test syncing up with
> puppet.

Cool, digipengi can schedule a time with you to do this tomorrow, then.
My tests are showing that the rev4 10.6 minis are good to go.

Please do the following:
1) restore my image [1] to a Mini
2) give me access to this mini make two changes to the puppet configs (see comment 7).  I will post in this bug when I finish these items (~5 minutes)
3) when I have reported back from (2), please take an image of the machine with whatever tool you will deploy with
4) restore this image to a single slave and communicate hostname to me so I can test that it is syncing with puppet and connecting to buildbot.  I will report back to this bug about the status of my test (~15-30m).  This machine does not need to be reimaged and should be considered production once it connects to buildbot.

Once I have reported from (4) and everything is ready to go, I will be explicit in my request for the remaining 69 machines to be imaged.

Timing:
I have a hard stop today at 5:30pm pdt.  I will be available tomorrow, monday and tuesday but am on pto next wednesday to friday.

Thanks!


[1] linux-ix-slave03.build.mozilla.org:/builds/slave/talos-r4-snow-ref-v2-puppeted-2010-09-28.dmg
(md5) 7902e231a8bfdcc00ab033f6c9773011 talos-r4-snow-ref-v2-puppeted-2010-09-28.dmg
and... midair collision caused my changes to component to be lost.  Please read Comment 11, but it is the request to do a test of the imaging process followed by imaging the remaining machines.


Quick clarification, we do need dongles for the puppet tests to work properly.  Please make sure that whichever machines are used for the work requested in comment 11 have a dongle for the duration of the work on them.  These machines should not go into production without a dongle.  I'll make sure not to point them to production masters in slavealloc.
Assignee: jhford → server-ops-releng
Severity: normal → major
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → zandr
jhford says we can repurpose rev4-testing as he doesn't need it anymore.  digipengi, please take that machine and the dongle from it with you to scl1.  The dongle will be required for tomorrow's imaging test.
And a FW800 cable, if one isn't already there.
linux-ix-slave03 is not in scl1, so matt's capturing the image to his laptop now, while he's in mtv1.  For reference, you can tell where a machine is with 'host':

[dustin@boris ~]$ host linux-ix-slave03.build.mozilla.org
linux-ix-slave03.build.mozilla.org is an alias for linux-ix-slave03.build.mtv1.mozilla.com.
Depends on: 692421
Assignee: server-ops-releng → mlarrain
Severity: major → critical
I have made the modifications and Matt has captured that hard disk to an image called "talos-r4-snow-ref-20111007"
Matt imaged talos-r4-snow-079 for me using "talos-r4-snow-ref-20111007".  Aside from needing to clean a stray cert on the puppet master, the machine ran puppet cleanly and attached to my staging master with no client side intervention.

On the basis of the above test, I hereby declare that "talos-r4-snow-ref-20111007" is a good image and should be deployed to 69 more rev4 minis.

Since there aren't dongles ready for these machines, please:

1) image talos-r4-snow-011 - talos-r4-snow-078 and talos-r4-snow-080. talos-r4-snow-079 should not be disturbed.
2) check that DNS and DHCP work
3) check that the machine can be reached by ssh
4) update this bug stating the hostname(s) that have been imaged but lack dongles
  e.g. "talos-r4-snow-011 imaged and pingable, lacking dongle"

These machines will talk to scl-production-puppet to exchange certs on boot, but we don't expect them to attach to buildbot because of missing dongles.

When dongles arrive, they should be plugged in.  Releng confirmation is not needed to plug in the dongles.  Once the dongle is plugged in, it should not be unplugged.  Please update this bug with hostnames when a dongle is attached so I know when I can set the slave to report to production and reboot the slave.  Example: "talos-r4-snow-011 imaged and pingable with dongle"
tested another slave, talos-r4-snow-075 and it is connecting to buildbot after cleaning the puppet cert on the master.
It looks like there is a contiguous block of slaves imaged between 031 and 080.  I assume that none have dongles aside from 079 and 075, is this correct?  I am having trouble connecting to 046 and 077 by hostname.  I am also seeing two machines connecting to the puppet master using their asset tag based hostname.  I wonder if they are the same machines?

r4-mini-06055.build.scl1.mozilla.com
r4-mini-06186.build.scl1.mozilla.com
These should not have been imaged:

r4-mini-06055.build.scl1.mozilla.com
r4-mini-06186.build.scl1.mozilla.com

These don't have MAC addresses in DHCP, so they won't come up (Matt's going to get me this data so I can add it to inventory and DHCP).

talos-r4-snow-017
talos-r4-snow-046

As for talos-r4-snow-077, that will need Matt to take a look.
From my tests, it looks like the following host is missing a dongle:
-talos-r4-snow-072

I am unable to ssh/vnc into:
-talos-r4-snow-024
-talos-r4-snow-028
-talos-r4-snow-030
-talos-r4-snow-077

The following hosts are consistently not responding to ping:
-talos-r4-snow-017
-talos-r4-snow-023
-talos-r4-snow-025
-talos-r4-snow-026
-talos-r4-snow-027
-talos-r4-snow-029
-talos-r4-snow-045
-talos-r4-snow-046

Would you prefer that I file new bugs for these hosts or should this be tracked here?
I'll head over there in a few and check on them soon.
Matt confirmed that 072 has a dongle.  Oddly it is now showing as having two monitors installed

talos-r4-snow-072:~ cltbld$ screenresolution get
Display 0: 1600x1200x32
Display 1: 800x600x32
(In reply to John Ford [:jhford] from comment #23)
> Matt confirmed that 072 has a dongle.  Oddly it is now showing as having two
> monitors installed
> 
> talos-r4-snow-072:~ cltbld$ screenresolution get
> Display 0: 1600x1200x32
> Display 1: 800x600x32

Crash cart was plugged into machine.  Removing the crash card and rebooting showed that the machine automatically set the correct resolution.
024, 028 and 030 were all in rack 201 which I was still in the process of imaging/dongling. They should be good now. As far as talos-r4-snow-077 it was giving me a weird display that looked like it was from an old TV with static. Upon rebooting it found that the machine was at the "setup OSX screen" Reimaged per jhford's request.
017 is still not pingable for me.  This is the only host that I have had trouble with.  Should we keep this bug open for 017 or should we file a new bug for that host?
I added 017 to https://bugzilla.mozilla.org/show_bug.cgi?id=693910 (the reboots bug).

Aside from that, is this bug finished?
closing bug - they have been imaged and are in production - anything else can become new bugs
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.