Closed
Bug 760077
Opened 12 years ago
Closed 12 years ago
mac-signing1.srv.releng.scl3.mozilla.com reporting down
Categories
(Infrastructure & Operations :: DCOps, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: arich, Assigned: sal)
References
Details
(Whiteboard: scl3)
I'm guessing that this machine powered itself off again. Could someone from dcops power it on and run hardware diagnostics on it, please? This is one of the two servers that signs all the mac builds, so it's important that it be up, especially in the wee hours of the CA morning when all of the nightly builds are being done. If we don't already have it, we need to get N+! redundancy for the mac signing servers since we have no way to power them back on remotely when they power themselves off (usually when they're most needed while CA is still asleep).
Comment 1•12 years ago
|
||
Sal, could you check on this one today? :arr, mac-signing1 and mac-signing2 are in the same Sonnet chassis. That means we'll need to take both out of service if we need to physically remove mac-signing1. I think we should fix that, so please think about when we can schedule a downtime.
Comment 2•12 years ago
|
||
Scheduling a downtime (currently) would close the tree. Coincidentally, I just filed https://bugzilla.mozilla.org/show_bug.cgi?id=759759 about getting another machine. We'll need to resolve that before we can take mac-signing1 and 2 down.
Assignee | ||
Comment 4•12 years ago
|
||
(In reply to Derek Moore from comment #1) > Sal, could you check on this one today? Yeah, I can check on this when I get to scl3 today.
Assignee: server-ops → sespinoza
Comment 5•12 years ago
|
||
On dividehex's advice, I had a look at the logs. There are no smoking guns - logs just stop at 5:20am. 14:23 < dividehex> system_profiler | grep "Boot ROM" 14:23 < dustin> Boot ROM Version: MM41.0042.B03 Sal's going to try a diagnostic DVD.
Assignee | ||
Comment 6•12 years ago
|
||
Running diagnostic.
Assignee | ||
Comment 7•12 years ago
|
||
Failed the diagnostic test. Status: ERROR-Data match failed -T E S T F A I L E D- [31 May 12[21:09:43 GTM]] •TESTING FAILED•
Severity: critical → major
Updated•12 years ago
|
Whiteboard: scl3
Comment 8•12 years ago
|
||
I brought back up the instances on this machine for now, because mac-signing2 is starting to die under the load. Please don't bring this machine down again without notice. Once mac-signing3 and 4 are up (bug 759759) we should be alright again without this one.
Assignee | ||
Comment 9•12 years ago
|
||
This works. Once mac-signing1 and 2 are down and ready to go back up, I will like to swap the chassis they're in. The back plastic is different from the rest of the mac mini sonnet chassis we have at scl3.
Comment 10•12 years ago
|
||
Okay. mac-signing3 and 4 are up now. We can take down mac-signing1 and 2 without downtime now. Please let me know in advance though, so I can shut down the daemons as gracefully as possible.
Assignee | ||
Comment 11•12 years ago
|
||
mac-signing1 and 2 are down, sending mac-signing1 to desktop for repairs. Once it comes back we can bring mac-signing1 and 2 back online. New sonnet chassis will be installed for these two once they're ready to go back online.
Severity: major → normal
Comment 12•12 years ago
|
||
Sal and I chatted in a privmsg about this. He suggested getting mac-signing2 checked out since it was already down. Now that we have mac-signing3 & 4 we can live without either of mac-signing1 or 2 for now, so we may as well (considering mac-signing2 would have to come back down again anyways when mac-signing1 is back from repair). So, the way I understand things is: mac-signing3 & 4 are our production machines for now mac-signing1 is off for repair, mac-signing2 is headed to desktop for diagnostics. when mac-signing1 is back from repair we'll bring mac-signing1 and 2 back into the production pool.
Severity: normal → major
Comment 13•12 years ago
|
||
Per IRC, Henry is going to wipe and install Lion prior to sending this off for repair to get rid of all the sensitive data. It'll need 10.6 re-installed when it comes back.
Comment 14•12 years ago
|
||
Mac-signing2 has been racked and powered on. Waiting for someone to take "install.build.releng.scl3" offline so that I can rack "mac-signing1" with that chassis.
Comment 15•12 years ago
|
||
"install.build.releng.scl3" is back online. "mac-signing1" is now re-imaging.
Comment 16•12 years ago
|
||
This machine appears to be back online but with the wrong passwords. Any ETA on finishing up the re-imaging process?
Comment 17•12 years ago
|
||
Per IRC conversation with Jake Watkins: dividehex 4:59 PM i don't seem to have copy of the mac-signing base image in scl3. I'll have to cp it over and re-image it
Comment 18•12 years ago
|
||
I don't think we have a base image for this other than a base OS.
Comment 19•12 years ago
|
||
I copied over the correct image and re-imaged mac-signing1 with it (on 7/18). Make sure you are using the signing and mac-signing users and passwords. These are different than the releng slave usernames and passwords.
Comment 20•12 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #18) > I don't think we have a base image for this other than a base OS. That's correct. The base image only consists of osx 10.6.8 with the signing users and passwords (as opposed to the typical releng slave username and password)
Comment 21•12 years ago
|
||
Hi, Is it ok to close out this bug ticket? Is there anything else DCOps can help with?
Comment 22•12 years ago
|
||
(In reply to Vinh Hua [:vinh] from comment #21) > Hi, > Is it ok to close out this bug ticket? Is there anything else DCOps can help > with? I can log in with the correct password now, so I think we're done here - thanks!
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•