Closed Bug 683718 Opened 13 years ago Closed 13 years ago

Prepare 10 rev4 minis for a medium scale test

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jhford, Assigned: jhford)

References

Details

With the exception of a puppet specific issue, our puppet manifests are looking good in testing.  I would like to start a medium scale test with 10 Rev4 minis.

Please prepare 10 Rev4 minis with the hostnames:

talos-r4-snow-001
talos-r4-snow-002
talos-r4-snow-003
talos-r4-snow-004
talos-r4-snow-005
talos-r4-snow-006
talos-r4-snow-007
talos-r4-snow-008
talos-r4-snow-009
talos-r4-snow-010

These machines must have:
1) Mac OS X 10.6 installed
2) 10.6.8 v1.1 update applied, from:
   curl -LO http://support.apple.com/downloads/DL1399/en_US/MacOSXUpdCombo10.6.8.dmg
2) User created with the following details:
   Full Name: Client Builder
   User Name: cltbld
   Password to be communicated to Release Engineering
3) VNC and SSH sharing enabled.  This can be done by
   launching System Preferences, going to 'Sharing' and ticking
   the 'Screen Sharing' and 'Remote Login' settings
4) 'cltbld' set to automatically log into a console session.  
   This can be done by launching System Preferences, going to 
   'Accounts', clicking the padlock to unlock the preference pane,
   pressing 'Login Options' then selecting "Client Builder" from
   the "Automatic Login" list
5) Puppet v 0.24.8 installed. This can be done by running
   curl -LO http://downloads.puppetlabs.com/gems/facter-1.5.6.gem
   curl -LO http://projects.puppetlabs.com/attachments/download/584/puppet-0.24.8.gem
   sudo gem install facter-1.5.6.gem puppet-0.24.8.gem
Summary: Prepare 10 minis for a medium scale test → Prepare 10 rev4 minis for a medium scale test
Blocks: 683720
Per meeting with IT yesterday:

* Erica said these machines were waiting on cabling and should be online today. They'll need to be imaged following that, obviously.
(In reply to Chris Cooper [:coop] from comment #1)
> Per meeting with IT yesterday:
> 
> * Erica said these machines were waiting on cabling and should be online
> today. They'll need to be imaged following that, obviously.

Are these machines online?
As mentioned in the overarching mini bug, all of the hardware work on the minis had to be redone this week.  

At this point, we have 10 minis with the base os install that comes with the system and a cltbld user (same passwd as all other machines) with screen sharing and ssh enabled.  We still need to reboot these minis so that they obtain their correct IP addresses, since we didn't have time to do that before we had to leave the datacenter tonight.  These minis do not yet have dongles on them, though zandr says we do have 20 of them on hand (just not at scl1).

jhford, do you still want the updates applied, considering your comments about rev-testing2 and not wanting the updates applied there?
Assignee: server-ops-releng → arich
DNS also isn't fixed for these minis, although the IPs are in DHCP.  I can get to that tomorrow (Friday).
These minis are all now responding to ssh via IP.
(In reply to Amy Rich [:arich] from comment #3)
> As mentioned in the overarching mini bug, all of the hardware work on the
> minis had to be redone this week.  

:S

> At this point, we have 10 minis with the base os install that comes with the
> system and a cltbld user (same passwd as all other machines) with screen
> sharing and ssh enabled.  We still need to reboot these minis so that they
> obtain their correct IP addresses, since we didn't have time to do that
> before we had to leave the datacenter tonight.  These minis do not yet have
> dongles on them, though zandr says we do have 20 of them on hand (just not
> at scl1).

comment 5 in this bug suggests that these were rebooted, is that correct?

> jhford, do you still want the updates applied, considering your comments
> about rev-testing2 and not wanting the updates applied there?

Comment 0 is still correct for these minis, please install the 10.6.8 v1.1 update.  I asked for slightly different requirements in the rev4-testing2 bug because I wanted to save time in that bug.
They were rebooted, and are now in DNS.  I'll take care of the updates.
Assignee: arich → dustin
For future reference, I made a copy of the updater in fs2:/IT/Apple.

I downloaded the updater and ran it (by hand .. softwareupdate -i didn't seem to want to do it).

I ran the requested gem install.

talos-r4-snow-010 isn't responding to ssh or vnc, although it responds to ping.

talos-r4-snow-006 failed while installing puppet:
---
Installing RDoc documentation for facter-1.5.6...
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rdoc/template.rb:137: [BUG] Segmentation fault
ruby 1.8.7 (2009-06-12 patchlevel 174) [universal-darwin10.0]

Abort trap
---
re-running the install worked.  Hard to say whether that's hardware or software.

So everything but talos-r4-snow-010 is up and running.  John, I assume that "9" is close enough to "10" for a medium scale test?  I'll file a separate bug for -010, but deal with it at a more leisurely pace, if that's OK.
Assignee: dustin → jhford
Yes, missing that machine while it gets fixed up is fine
(In reply to John Ford [:jhford] from comment #10)
> Yes, missing that machine while it gets fixed up is fine

Resolving based on comment #10.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Actually, these machines are missing dongles. Please close this bug when the dongles are installed on the 9 functional machines
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee: jhford → server-ops-releng
Assignee: server-ops-releng → mlarrain
Severity: normal → critical
I just spoke to zandr, and he said that emux will be handling this tonight.  Reassigning.
Assignee: mlarrain → emuxlow
Hey, emux, did these get installed last night?  jhford needs them by 9:00 today.  Thanks!
The dongles are attached and the machines are ready to go.
Assignee: emuxlow → arich
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Reopening bug.  These machines, as installed, have iLife installed.  Because we don't know the possible impact of having iLife installed on the machines, we need to have these machine set up again without iLife.

Because there are addenda in comments in this and other bugs, these are the updated requirements.

These machines must have:
1) The hard drive with pre-installed Mac OS X erased
2) Mac OS X 10.6 installed from OS recovery DVD included with hardware. Note the Applications DVD
   included with the hardware should not be installed.
3) 10.6.8 v1.1 update applied, from:
   curl -LO http://support.apple.com/downloads/DL1399/en_US/MacOSXUpdCombo10.6.8.dmg
4) User created with the following details:
   Full Name: Client Builder
   User Name: cltbld
   Password to be communicated by Release Engineering
5) VNC and SSH sharing enabled with VNC password set to 'cltbld' password.
   This can be done by launching System Preferences, going to 'Sharing' and ticking
   the 'Screen Sharing' and 'Remote Login' settings.  To set the VNC password, select
   the 'Screen Sharing' item from the checkbox list and press "Computer Settings".  On the
   the sheet that drops down, tick the 'VNC Viewers may control screen with password:"
   checkbox, then enter the communicated 'cltbld' user password.
6) 'cltbld' set to automatically log into a console session.  
   This can be done by launching System Preferences, going to 
   'Accounts', clicking the padlock to unlock the preference pane,
   pressing 'Login Options' then selecting "Client Builder" from
   the "Automatic Login" list
7) Puppet v 0.24.8 installed. This can be done by running
   curl -LO http://downloads.puppetlabs.com/gems/facter-1.5.6.gem
   curl -LO http://projects.puppetlabs.com/attachments/download/584/puppet-0.24.8.gem
   sudo gem install facter-1.5.6.gem puppet-0.24.8.gem
8) Hardware dongle installed that allows the display mode to be set to 1600x1200x32
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
We're resource constrained this week and are relying on infra for hands on work, but we can get to this next week when we have a CA person back in the office.  I spoke with mrz about getting resources from infra, but they are also strapped this week since it's the end of Q3, and we've already borrowed someone for the thunderbird migration emergency. Since we don't want to stand in the way if this is an urgent blocker, he offered to get jhford access to scl1 (if he doesn't currently have it) so he is able to come perform the work himself.
(In reply to Amy Rich [:arich] from comment #17)
> We're resource constrained this week and are relying on infra for hands on
> work, but we can get to this next week when we have a CA person back in the
> office.  I spoke with mrz about getting resources from infra, but they are
> also strapped this week since it's the end of Q3, and we've already borrowed
> someone for the thunderbird migration emergency. Since we don't want to
> stand in the way if this is an urgent blocker, he offered to get jhford
> access to scl1 (if he doesn't currently have it) so he is able to come
> perform the work himself.

Per discussions with mrz, jhford and myself:
* mrz is setting up colo access for aki, jhford, lsblakk and myself
* jhford will be driving to scl1 first thing in the morning to image these 10minis.

More info as we have it.
> Per discussions with mrz, jhford and myself:
> * mrz is setting up colo access for aki, jhford, lsblakk and myself

I had no trouble getting into the DC.  I am not sure if that means I have colo access, or if that means that Erica spoke to the security guard.

> * jhford will be driving to scl1 first thing in the morning to image these
> 10minis.

This is done, machines are in staging for medium scale test.

Marking this bug as fixed.  If there are further issues, I will file a new bug.
Assignee: arich → jhford
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Did you reimage talos-r4-snow-010 as well?

Also, I can only ping 2, 4, and 6.  The rest of them appear to be unreachable.
(In reply to Amy Rich [:arich] [:arr] from comment #20)
> Did you reimage talos-r4-snow-010 as well?

Yes

> Also, I can only ping 2, 4, and 6.  The rest of them appear to be
> unreachable.

I just noticed that myself.  They were reachable yesterday and all of them were talk to their master.  Each of the 10 slaves have done at least one test job.

I am going to be filing a bug for nagios monitoring of these 10 slaves because at this point, it is a problem is they go down.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
silly bugzilla.  i cleared cache when i refreshed the page, but the old form values stuck around.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
The odd-numbered hosts are not available because, I think, the switch they're in doesn't have its uplink configured correctly.  Erica should be onsite to fix that quickly later today.
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.