Closed Bug 760077 Opened 12 years ago Closed 12 years ago

mac-signing1.srv.releng.scl3.mozilla.com reporting down

Categories

(Infrastructure & Operations :: DCOps, task)

x86
macOS
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Assigned: sal)

References

Details

(Whiteboard: scl3)

I'm guessing that this machine powered itself off again.  Could someone from dcops power it on and run hardware diagnostics on it, please?  This is one of the two servers that signs all the mac builds, so it's important that it be up, especially in the wee hours of the CA morning when all of the nightly builds are being done.

If we don't already have it, we need to get N+! redundancy for the mac signing servers since we have no way to power them back on remotely when they power themselves off (usually when they're most needed while CA is still asleep).
Sal, could you check on this one today?

:arr, mac-signing1 and mac-signing2 are in the same Sonnet chassis. That means we'll need to take both out of service if we need to physically remove mac-signing1. I think we should fix that, so please think about when we can schedule a downtime.
Scheduling a downtime (currently) would close the tree. Coincidentally, I just filed https://bugzilla.mozilla.org/show_bug.cgi?id=759759 about getting another machine. We'll need to resolve that before we can take mac-signing1 and 2 down.
Depends on: 759759
(In reply to Derek Moore from comment #1)
> Sal, could you check on this one today?

Yeah, I can check on this when I get to scl3 today.
Assignee: server-ops → sespinoza
On dividehex's advice, I had a look at the logs.  There are no smoking guns - logs just stop at 5:20am.

14:23 < dividehex> system_profiler | grep "Boot ROM"
14:23 < dustin>       Boot ROM Version: MM41.0042.B03

Sal's going to try a diagnostic DVD.
Running diagnostic.
Failed the diagnostic test.



Status:

ERROR-Data match failed
-T E S T  F A I L  E D-

[31 May 12[21:09:43 GTM]] •TESTING FAILED•
Severity: critical → major
Whiteboard: scl3
I brought back up the instances on this machine for now, because mac-signing2 is starting to die under the load. Please don't bring this machine down again without notice. Once mac-signing3 and 4 are up (bug 759759) we should be alright again without this one.
This works. Once mac-signing1 and 2 are down and ready to go back up, I will like to swap the chassis they're in. The back plastic is different from the rest of the mac mini sonnet chassis we have at scl3.
Okay. mac-signing3 and 4 are up now. We can take down mac-signing1 and 2 without downtime now. Please let me know in advance though, so I can shut down the daemons as gracefully as possible.
mac-signing1 and 2 are down, sending mac-signing1 to desktop for repairs.

Once it comes back we can bring mac-signing1 and 2 back online.
New sonnet chassis will be installed for these two once they're ready to go back online.
Severity: major → normal
Sal and I chatted in a privmsg about this. He suggested getting mac-signing2 checked out since it was already down. Now that we have mac-signing3 & 4 we can live without either of mac-signing1 or 2 for now, so we may as well (considering mac-signing2 would have to come back down again anyways when mac-signing1 is back from repair). So, the way I understand things is:
mac-signing3 & 4 are our production machines for now
mac-signing1 is off for repair, mac-signing2 is headed to desktop for diagnostics.
when mac-signing1 is back from repair we'll bring mac-signing1 and 2 back into the production pool.
Severity: normal → major
Per IRC, Henry is going to wipe and install Lion prior to sending this off for repair to get rid of all the sensitive data. It'll need 10.6 re-installed when it comes back.
Mac-signing2 has been racked and powered on.  Waiting for someone to take "install.build.releng.scl3" offline so that I can rack "mac-signing1" with that chassis.
"install.build.releng.scl3" is back online.  "mac-signing1" is now re-imaging.
This machine appears to be back online but with the wrong passwords. Any ETA on finishing up the re-imaging process?
Per IRC conversation with Jake Watkins:

dividehex
4:59 PM
i don't seem to have copy of the mac-signing base image in scl3.  I'll have to cp it over and re-image it
I don't think we have a base image for this other than a base OS.
I copied over the correct image and re-imaged mac-signing1 with it (on 7/18).  Make sure you are using the signing and mac-signing users and passwords.  These are different than the releng slave usernames and passwords.
(In reply to Ben Hearsum [:bhearsum] from comment #18)
> I don't think we have a base image for this other than a base OS.

That's correct.  The base image only consists of osx 10.6.8 with the signing users and passwords (as opposed to the typical releng slave username and password)
Hi,
Is it ok to close out this bug ticket? Is there anything else DCOps can help with?
(In reply to Vinh Hua [:vinh] from comment #21)
> Hi,
> Is it ok to close out this bug ticket? Is there anything else DCOps can help
> with?

I can log in with the correct password now, so I think we're done here - thanks!
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.