Closed Bug 1376807 Opened 7 years ago Closed 7 years ago

Autologin is failing on the mac machines

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aobreja, Assigned: dividehex)

References

Details

Attachments

(6 files)

Tested today with the 2 method mentioned in "Automatic login" [1] 
-with the script 
-by login in a machine through VNC and setup aulologin in System preferences

For both cases I tested on 2 machines that were loaned to my puppet environment with in which I used the new password.
After puppet ran and the password is changed the vnc console remain in the same state so Builder is still set as Automatic login (see builder.png).

So I'm not sure why after password is changed and automatic login apparently remain enable the jobs for OS X are not ran.The passwords from builders and builder_pw_kcpassword_base64 are changed but we still got issues.



[1]https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/Modules/users#Linux
Did some new tests today,loaned some machines and I catch some logs from what happen after the password is changed for autologin.
It seems  that after changing the password we get some fails for loginwindow because keychain password is not changed,I'm not sure if this should normally happen o we need to also change other stuff here.I added 2 files containing logs before and after changing the password.
Chris can you give us an advice if you check the logs ,of what may be wrong or if there is something else that we missed when changing the password for autologin on OS X.This issue doesn't impact build machines.
Flags: needinfo?(catlee)
I really don't know :(
Flags: needinfo?(catlee)
Dustin I see that you found this issue in the past and solve it (https://bugzilla.mozilla.org/show_bug.cgi?id=1036980#c7)
I follow the steps in the documentation and made sure that the builder_pw_kcpasswd_base64 is changed at the same time as the other OS X builder passwords,and is the same password; but I still hit the autologin problem.
You said that you and Amy logged in and fixed everything by hand,can you tell me what exactly what you did?:
-Login to each machine and reboot?
-Was there anything else that needed to be changed because in our case it seems that the autologin is not working after reboot
Flags: needinfo?(dustin)
Amy and I wouldn't have fixed things manually, so I'm sure it was just a reboot to let puppet run.

My guess would be that there was an issue generating the kcpasswd form of the password so that it didn't match the user's password.  Did you use the perl script linked from https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/Modules/users or just manually set up autologin and copy the password from there?
Flags: needinfo?(dustin)
> Did you use the perl script linked from
> https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/Modules/users or
> just manually set up autologin and copy the password from there?

I used both methods (the script and the manually set),for both cases I tested by decrypting the generated kcpasswd to be sure that is the right one, also when the trees were down for several hours also Kim tried to generate again the new password but with the same result.
I added also some logs before and after the changing of password and Puppet ran,I'm not sure but I think I catch some logs in systemlog where autologin fail,something related to keychain reset(see after_changing_pass.png).
If you have any advice for us or a better way we can test that autologin is working please let us know.Thank you
Flags: needinfo?(dustin)
Ah, good catch -- if the keychain is generated and does not match kcpasswd, autologin will fail.  I *think* there's support in puppet to delete that keychain (we don't store anything in it, so deleting is the easy option), but perhaps it's not working correctly?

Looking a bit further, I see the rm in modules/users/manifests/builder/account.pp, so looking at how or whether that worked might be a good place to start.
Flags: needinfo?(dustin)
Did some investigation but with no results.the deletion from modules/users/manifests/builder/account.pp is happening but autologin is still not running.
Amy's advice was to ask for a opinion from Jake.As a summarize in #c0:
-cltbld passwords were changed for OS X
-also autologin password was changed (using both, the script and the console method) (check [1])
-the output hash that result was added to hiera -(note: all changes for cltbld and autologin was done in the same time)

The issue is that even if the autologin and cltbld passwords were changed,and we could saw the new password changed in /etc/kcpasswords, autologin was failing and the jobs were not running,even after reboot.
The after_changing_pass.png show that there seems to be an issue with the keychain reset.



[1]https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/Modules/users#Darwin
Flags: needinfo?(jwatkins)
:aobreja Do you have a host I can poke at?
Flags: needinfo?(jwatkins)
sure Jake,use any of these machines : t-yosemite-r7-(0388,389,390) to do your tests.
Found it!

https://hg.mozilla.org/build/puppet/rev/e0e4c33fd12c#l6.151

This is fallout from the big style refactor (bug 1368935).  In this case, the keychain is not being deleted because the single quote string fails to interpolate $home.  And going back it looks like I missed this one in the review. :-(
Assignee: nobody → jwatkins
Attachment #8887271 - Flags: review?(nthomas)
Attachment #8887271 - Flags: review?(nthomas) → review+
Updated today :

root_pw_paddedsha1!low-security
builder_pw_paddedsha1
root_pw_saltedsha512!low-security 
builder_pw_saltedsha512
root_pw_pbkdf2_iterations!low-security
builder_pw_pbkdf2_iterations
root_pw_pbkdf2_salt!low-security
builder_pw_pbkdf2_salt
builder_pw_kcpassword_base64
root_pw_pbkdf2!low-security
builder_pw_pbkdf2

But rollback the changes as we got lots of pending jobs and no finish jobs
This patch causes the osx builder password change and the keychain removal to trigger the reboot semaphore.  This should prevent the need for a second manual reboot when changing the password.

:dustin, I try not to flag you for review on puppet matters but since you are still a SME on this and it affect tc it seems appropriate. :-)
Attachment #8888115 - Flags: review?(dustin)
Attachment #8888115 - Attachment is patch: true
Comment on attachment 8888115 [details] [diff] [review]
Make sure reboot flags get put up when changing builder passwd and killing keychain

Review of attachment 8888115 [details] [diff] [review]:
-----------------------------------------------------------------

Does this need to be done for any of the other modules/user/*/account.pp?
Attachment #8888115 - Flags: review?(dustin) → review+
(In reply to Dustin J. Mitchell [:dustin] from comment #18)
> Comment on attachment 8888115 [details] [diff] [review]
> Make sure reboot flags get put up when changing builder passwd and killing
> keychain
> 
> Review of attachment 8888115 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Does this need to be done for any of the other modules/user/*/account.pp?

I would assume so. :-)
I did some tests on friday with the new changes,I loaned some machine from buildbot and taskcluster to my environment to test the autologin , but it seems that after changing the password the problem is not solved ,no jobs are running after the change,and for the machine that are on taskcluster we don't have any generic workers.Forcing reboot couple of times also did not change the state of these machines.
Did some new tests,and I got some results,the strange part is that for some machines generic-worker is present after pinning to my environment and for some is not but this could be cause by the low number of pendings

I have task that ran fine after the changes like:
-t-yosemite-r7-0190 (https://tools.taskcluster.net/groups/KO-5qvbwSvqzJgYt5KGlMg/tasks/YmawkbQTSwaKR5gy1K48Eg/runs/0)
-t-yosemite-r7-0150 (tasks IboBK8H6Ro2rr2YXZz9kmw, YnA4zznyRxGrquaIJIY7OQ)

And machines where generic worker is not running: 
-t-yosemite-r7-0030
	
I pinned to my environment these machines t-yosemite-r7-(0030,0190,0150,0191,0020,0021) and I will check tomorrow again for new results.So far the situation looks promising so in case everything is working maybe we can do the changes in production
Attached image generic-worker.PNG
The result are somehow confusing,I was hoping that by today more of these machine pinned to my environment will begin taking jobs but this didn't happened.
While machines like t-yosemite-r7-(0190,0159) that are in taskcluster are continuing taking jobs along with the ones from buildbot t-yosemite-r7-(0020,0021), the rest are blocked and since no jobs are running on them,they remain in the same state.

Also these machines that take jobs have the generic-worker present on the processes (check generic-worker.PNG) while the other doesn't.

I tested and checked the logs but I don't know why these machines don't begin taking jobs when we have around 2000 pending jobs,I don't know what is the trigger for those jobs.Amy if you or Jake have any more ideas please have a look on the list of logs bellow and update the bug if you find something,you can also do any tests on these machines:

https://papertrailapp.com/systems/178509233/events
https://papertrailapp.com/systems/178504023/events
https://papertrailapp.com/systems/118426583/events?focus=826180397720305716
https://papertrailapp.com/systems/156827283/events?q=generic&focus=826476780092293123
https://papertrailapp.com/systems/156863723/events?focus=826482816299728905
https://papertrailapp.com/systems/156863723/events?focus=826485957204279344
Flags: needinfo?(jwatkins)
Flags: needinfo?(arich)
It doesn't look like generic worker is running, and you need to debug why. In order to do that, make sure that puppet has completed successfully by going through the script /usr/local/bin/run-puppet.sh (it looks like it had on t-yosemite-r7-0176 since the sempahore file was there). If the semaphore file is there at the end, then it should be running the script to start either buildbot or the generic worker (depending on the machine).

I did a grep -r for the semaphore file name in the puppet code and found the script it calls to start the worker. If you do the same, you will find modules/generic_worker/templates/generic-worker.plist.erb and if you read that you will find the script that it then tries to run on this machine: /usr/local/bin/run-generic-worker.sh along with the arguments it's called with. Work with that to debug why generic worker is not running. Reach out to the taskcluster team on #taskcluster to help you debug if you get stuck.
Flags: needinfo?(jwatkins)
Flags: needinfo?(arich)
Blocks: 1382360
Tests revealed that additional reboots can trigger "run-generic-worker.sh" and so, generic worker begin to work.
I suspect the issue is related with the "reboot_semaphore",when I ran puppet agent to pin to my environment the host is never restarted,I remain connected to the host. 
This is what should should happen:
- the password in changed in hiera -> the next time a mini reboots, it should run puppet -> during the puppet change, it should remove the cltbld keychain file, then tell the machine to reboot -> the machine reboots and runs puppet again (successfully) and cltbld autologs in -> after the puppet run, it creates a semaphore file that buildbot/taskcluster look for to start the worker -> the CI system start
This is happening:
- the password in changed in hiera -> the next time a mini reboots, it ran puppet -> during the puppet change, it remove the cltbld keychain file but the machine is not rebooted ->so it needs additional reboot -> the machine reboots and runs puppet again (successfully) and cltbld autologs in -> after the puppet run it creates a semaphore file that buildbot/taskcluster look for to start the worker -> the CI system start

Jake can you check please the "reboot_semaphore",or should we add a "reboot" action at the end of run-puppet.sh?
It is also possible that when I test with my environment the first ran does not reboot the machine after puppet ran and the actual state could just work but we can't risk another bustage.
Flags: needinfo?(jwatkins)
We had a mantra a while back: "how do you run puppet? sudo reboot."  The same applies for pinned environments.  When you're working on a module and looking for syntax errors, running it manually is OK, but when you want to verify how something will work on workers in production, always reboot.
Dustin is correct.  You are going to need to initiate a reboot manually after updating the passwords in hiera.  Understand that by design it is not intended for puppet to actually trigger a reboot.  The reboot semaphore logic is executed during boot.

This is the logic of rebooting, running puppet at boot, and using the semaphore

manually reboot host -> runs puppet in a loop until it successfully completes -> if puppet created a reboot semaphore remove semaphore and reboot, otherwise continue booting

There is already reboot action at the end of run-puppet.sh. That is the entire purpose of the reboot semaphore.
https://hg.mozilla.org/build/puppet/file/tip/modules/puppet/templates/puppet-atboot-common.erb#l77


So assuming you are rebooting after the hiera change, does the host fail to set the reboot semaphore after the keychain is deleted and/or fails to observe the reboot semaphore request to reboot?
Flags: needinfo?(jwatkins)
The hiera change shouldn't be picked up until the machine is rebooted (it won't run puppet till then), so there has to be a reboot in there by definition for this to have broken, I think. This may not be true for Andrei's test cases, but it should be true in production. Andrei, in your test case, pin it to your env, then make the hiera change, then reboot the machine without running puppet manually. Does that work or fail?
I tested  different  today ,re-imaged first t-yosemite-r7-(0190,0160,0030,0191,0150,0020,0021),then pinned them to my environment while my environment had nothing changed yet.I monitored and seen some jobs started, after that I changed my environment to use the new "secrets hiera" with all the changes ,then continued monitoring without doing any manually reboot.
After finishing the existing job the machine reboot itself,puppet ran and after that  generic worker began running too (for machines on Tasckluster).
ex.
[root@t-yosemite-r7-0191.test.releng.scl3.mozilla.com ~]# ps -ef | grep generic
   0:00.00 /bin/bash /usr/local/bin/run-generic-worker.sh run --config /etc/generic-worker.config
 0:00.21 /usr/local/bin/generic-worker run --config /etc/generic-worker.config
  0:00.01 logger -t generic-worker -s

Also checked some jobs, and they seems to work fine,and the process continued after each finihed job:

ex. https://tools.taskcluster.net/groups/I2PwyVT-T4ePqreuAwkBig/tasks/LWrzSQJiQhGRsDaI-UGtqw/runs/0

All these suggest that we can do the changes in production,all should work fine.
Amy if there is nothing else that you think that may go wrong I can do the changes in production.
Flags: needinfo?(arich)
You should coordinate with the taskcluster team and make sure there's a clear understanding of what success/failure looks like and how to monitor for that. I suspect that the taskcluster team will need to be onhand to help monitor/troubleshoot, so the time that you choose to make this change will be partially dependent on availability of someone from that team (we need to make sure we have a designated person). There should be constant monitoring for failure after the change is made to identify any failure markers early on. You should also have a rollback plan (and understand when to enact that plan) in the event that it does not succeed.

You should then communicate the intention to make this change and the time at which we plan to do it to firefox-ci@mozilla.com.
Flags: needinfo?(arich)
We plan to do this change again tomorrow in the European morning since we don't have to much jobs running,spoke with :wcosta and he will help me after the change to make sure that the machines are taking jobs and everything is working as expected.

We will keep an eye on pending jobs on OS X test([1]) and on the pool ([2]) right after the change.

[1]https://queue.taskcluster.net/v1/pending/releng-hardware/gecko-t-osx-1010
[2]https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-yosemite-r7

Mihai if the time is to short since the release and you think that this change may affect production please post in here and we will discuss of a later schedule.
Flags: needinfo?(mtabara)
Please verify with releaseduty to make sure there are no critical releases in fight before making changes tomorrow.
(In reply to Amy Rich [:arr] [:arich] from comment #32)
> Please verify with releaseduty to make sure there are no critical releases
> in fight before making changes tomorrow.

Spoke with Sylvestre and rest of releaseduty cvorum and we decided to postpone this until next week when waters are gonna be, hopefully, calmer. 
Really sorry for short notice for all the actors involved in this. Andrei will be PTO next week but buildduty can cover this or we can wait until his return, we'll see how it goes.

Thank you and sorry again.
Flags: needinfo?(mtabara)
Status update on this bug: Andrei is PTO this week. He left instructions on how to do this to both :spacurar and :aselagea but I'd personally prefer to postpone to next week. We still have 56.0b3, 56.0b5 and 55.0.2 in the pipeline this week. Since this is internally driven, I think we can safely postpone and attempt to do this next Tuesday.
The change was reschedule for tomorrow,me and wcosta will focus on the things mentioned on #c31
Password was changed for:

root_pw_paddedsha1
root_pw_saltedsha512
root_pw_pbkdf2_iterations
root_pw_pbkdf2_salt
root_pw_pbkdf2!low-security
builder_pw_pbkdf2
builder_pw_pbkdf2_salt
builder_pw_pbkdf2_iterations
builder_pw_paddedsha1
builder_pw_saltedsha512
builder_pw_kcpassword_base64

The machines began taking jobs after have been rebooted,we keep monitored the jobs and so far all seems well.
Everything seems to be fine after the changes from yesterday,so I'll mark this bug as resolved.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
No longer depends on: 1393007
See Also: → 1393007
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: