Closed Bug 1376584 Opened 3 years ago Closed 3 years ago

OS X yosemite machines no longer working after reboot

Categories

(Infrastructure & Operations :: CIDuty, task, P1, blocker)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: garndt, Assigned: kmoir)

References

Details

Starting between 6:30am and 7:30am PT machines completed tasks, rebooted, and then never came back to claiming tasks.

After digging in, it seems the machines didn't autologin.  Talking with kmoir it might be related to a password rotation done around this same time this morning.
looking at 

https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/Modules/users#Automatic_Login

It appears that the value of 
base64 < /etc/kcpassword

and the value stored in hiera match  (hiera password secret is deployed via puppet 

continuing to investigate
Depends on: 1366019
afaik, after we change cltbld password, we need to generate the content for /etc/kcpasword. The most straightforward way I know is to login in a machine through VNC, setup aulologin in System preferences, and run base64 /etc/kcpassword. Then, apply the output to builder_pw_kcpassword_base64 hiera config.
Severity: normal → blocker
Priority: -- → P1
None of this worked

However we were able to revert the password changes to the ones that were there this morning and now the machines are puppetizing and able to start generic worker

I will advise the sheriffs when we have all machines up, they will be able to determine when we have cleared enough backlog to reopen trees
as note, we have to reboot all the machines twice
1) once to get puppet change
2) once to get autologin working again
Everything should be back in action again now, except for few misbehaving workers which I'll chase down as slave health catches them. There's a backlog of 14250 jobs on taskcluster to burn down, and another 475 on buildbot, but this is back in the hands of the sheriffs at this point.
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
TIL the taskcluster pending count can be found at https://queue.taskcluster.net/v1/pending/releng-hardware/gecko-t-osx-1010. Down to 13600 but bouncing around - try is open while other trees are closed.
I'm not sure if 0340 and 0350 have had some problems, but it's been 16 hours and 3 hours, respectively, since they have resolved a task.
Those two both got re-imaged so they may not have come back up cleanly. I reopened bug 1342221 for 0340, and have rebooted 0350 to see if it can stick at doing work.

TC pending down to 8700.
reopen trees after production trees have catched up, but will monitor the tests
I talked to Andrei about this issue this morning and he borrowed a loaner and pin to his puppet env and hiera env and try to make the password change and ensure that autologin is working and it did (bug  	1366019)

We also need to fix nagios alerts for tc pending bug 1373289

There is also work remaining in this bug to fix machine health 1367448

Maybe we should have alerts for autologin failures in papertrail, or perhaps the pending counts alerts in nagios should take care of this
I have an alert set up on our side related to pending wait times (not pending count) that I can add you to if that helps.

I can also help hack on the nagios script if someone wants to help me with that.  I have a copy of our existing nagios check, but I have actually never written a nagios check nor completely aware of what releng needs out of this check other than checking pending counts for worker types it cares about.
Here is the code for the nagios checkss

Kims-MacBook-Pro:braindump kmoir$ hg path
default = ssh://hg.mozilla.org/build/braindump/
try = ssh://hg.mozilla.org/try
Kims-MacBook-Pro:braindump kmoir$ cd nagios-related/
Kims-MacBook-Pro:nagios-related kmoir$ ls
README				check_backlog_age.py		check_bouncer			check_bouncer.rst		check_pending_builds.py		test_check_pending_builds.py

Alin has modified this before, would probably have advice on this work
Blocks: 1366019
No longer depends on: 1366019
The pending counts are down to around ~3000. We'll work on fixing the underlying issues that caused this outage.
Assignee: nobody → kmoir
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.