1376807 - Autologin is failing on the mac machines

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Description

•

7 years ago

Tested today with the 2 method mentioned in "Automatic login" [1] 
-with the script 
-by login in a machine through VNC and setup aulologin in System preferences

For both cases I tested on 2 machines that were loaned to my puppet environment with in which I used the new password.
After puppet ran and the password is changed the vnc console remain in the same state so Builder is still set as Automatic login (see builder.png).

So I'm not sure why after password is changed and automatic login apparently remain enable the jobs for OS X are not ran.The passwords from builders and builder_pw_kcpassword_base64 are changed but we still got issues.



[1]https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/Modules/users#Linux

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Comment 1

•

7 years ago

Attached image builder.PNG — Details

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Comment 2

•

7 years ago

Attached image before changing pass.PNG — Details

Did some new tests today,loaned some machines and I catch some logs from what happen after the password is changed for autologin.
It seems  that after changing the password we get some fails for loginwindow because keychain password is not changed,I'm not sure if this should normally happen o we need to also change other stuff here.I added 2 files containing logs before and after changing the password.

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Comment 3

•

7 years ago

Attached image after changing pass.PNG — Details

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Comment 4

•

7 years ago

Chris can you give us an advice if you check the logs ,of what may be wrong or if there is something else that we missed when changing the password for autologin on OS X.This issue doesn't impact build machines.

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Comment 5

•

7 years ago

I really don't know :(

Flags: needinfo?(catlee)

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Comment 6

•

7 years ago

Dustin I see that you found this issue in the past and solve it (https://bugzilla.mozilla.org/show_bug.cgi?id=1036980#c7)
I follow the steps in the documentation and made sure that the builder_pw_kcpasswd_base64 is changed at the same time as the other OS X builder passwords,and is the same password; but I still hit the autologin problem.
You said that you and Amy logged in and fixed everything by hand,can you tell me what exactly what you did?:
-Login to each machine and reboot?
-Was there anything else that needed to be changed because in our case it seems that the autologin is not working after reboot

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 7

•

7 years ago

Amy and I wouldn't have fixed things manually, so I'm sure it was just a reboot to let puppet run.

My guess would be that there was an issue generating the kcpasswd form of the password so that it didn't match the user's password.  Did you use the perl script linked from https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/Modules/users or just manually set up autologin and copy the password from there?

Flags: needinfo?(dustin)

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Comment 8

•

7 years ago

> Did you use the perl script linked from
> https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/Modules/users or
> just manually set up autologin and copy the password from there?

I used both methods (the script and the manually set),for both cases I tested by decrypting the generated kcpasswd to be sure that is the right one, also when the trees were down for several hours also Kim tried to generate again the new password but with the same result.
I added also some logs before and after the changing of password and Puppet ran,I'm not sure but I think I catch some logs in systemlog where autologin fail,something related to keychain reset(see after_changing_pass.png).
If you have any advice for us or a better way we can test that autologin is working please let us know.Thank you

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 9

•

7 years ago

Ah, good catch -- if the keychain is generated and does not match kcpasswd, autologin will fail.  I *think* there's support in puppet to delete that keychain (we don't store anything in it, so deleting is the easy option), but perhaps it's not working correctly?

Looking a bit further, I see the rm in modules/users/manifests/builder/account.pp, so looking at how or whether that worked might be a good place to start.

Flags: needinfo?(dustin)

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Comment 10

•

7 years ago

Did some investigation but with no results.the deletion from modules/users/manifests/builder/account.pp is happening but autologin is still not running.
Amy's advice was to ask for a opinion from Jake.As a summarize in #c0:
-cltbld passwords were changed for OS X
-also autologin password was changed (using both, the script and the console method) (check [1])
-the output hash that result was added to hiera -(note: all changes for cltbld and autologin was done in the same time)

The issue is that even if the autologin and cltbld passwords were changed,and we could saw the new password changed in /etc/kcpasswords, autologin was failing and the jobs were not running,even after reboot.
The after_changing_pass.png show that there seems to be an issue with the keychain reset.



[1]https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/Modules/users#Darwin

Flags: needinfo?(jwatkins)

Jake Watkins [:dividehex]

Assignee

Comment 11

•

7 years ago

:aobreja Do you have a host I can poke at?

Flags: needinfo?(jwatkins)

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Comment 12

•

7 years ago

sure Jake,use any of these machines : t-yosemite-r7-(0388,389,390) to do your tests.

Blocks: t-yosemite-r7-0390, t-yosemite-r7-0389, t-yosemite-r7-0388

Jake Watkins [:dividehex]

Assignee

Comment 13

•

7 years ago

Found it!

https://hg.mozilla.org/build/puppet/rev/e0e4c33fd12c#l6.151

This is fallout from the big style refactor (bug 1368935).  In this case, the keychain is not being deleted because the single quote string fails to interpolate $home.  And going back it looks like I missed this one in the review. :-(

Jake Watkins [:dividehex]

Assignee

Comment 14

•

7 years ago

Attached patch fix kill-builder-keychain exec string — Details — Splinter Review

Assignee: nobody → jwatkins

Jake Watkins [:dividehex]

Assignee

Updated

•

7 years ago

Attachment #8887271 - Flags: review?(nthomas)

Nick Thomas [:nthomas] (UTC+12)

Updated

•

7 years ago

Attachment #8887271 - Flags: review?(nthomas) → review+

Jake Watkins [:dividehex]

Assignee

Comment 15

•

7 years ago

Comment on attachment 8887271 [details] [diff] [review]
fix kill-builder-keychain exec string

remote:   https://hg.mozilla.org/build/puppet/rev/1a37823ade15ec18b736f2ee48239623a81d26aa
remote:   https://hg.mozilla.org/build/puppet/rev/506021860ce14baf1b70f313c809b8bc0b253192

Attachment #8887271 - Flags: checked-in+

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Comment 16

•

7 years ago

Updated today :

root_pw_paddedsha1!low-security
builder_pw_paddedsha1
root_pw_saltedsha512!low-security 
builder_pw_saltedsha512
root_pw_pbkdf2_iterations!low-security
builder_pw_pbkdf2_iterations
root_pw_pbkdf2_salt!low-security
builder_pw_pbkdf2_salt
builder_pw_kcpassword_base64
root_pw_pbkdf2!low-security
builder_pw_pbkdf2

But rollback the changes as we got lots of pending jobs and no finish jobs

Jake Watkins [:dividehex]

Assignee

Comment 17

•

7 years ago

Attached patch Make sure reboot flags get put up when changing builder passwd and killing keychain — Details — Splinter Review

This patch causes the osx builder password change and the keychain removal to trigger the reboot semaphore.  This should prevent the need for a second manual reboot when changing the password.

:dustin, I try not to flag you for review on puppet matters but since you are still a SME on this and it affect tc it seems appropriate. :-)

Attachment #8888115 - Flags: review?(dustin)

Jake Watkins [:dividehex]

Assignee

Updated

•

7 years ago

Attachment #8888115 - Attachment is patch: true

Dustin J. Mitchell [:dustin] (he/him)

Comment 18

•

7 years ago

Comment on attachment 8888115 [details] [diff] [review]
Make sure reboot flags get put up when changing builder passwd and killing keychain

Review of attachment 8888115 [details] [diff] [review]:
-----------------------------------------------------------------

Does this need to be done for any of the other modules/user/*/account.pp?

Attachment #8888115 - Flags: review?(dustin) → review+

Jake Watkins [:dividehex]

Assignee

Comment 19

•

7 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #18)
> Comment on attachment 8888115 [details] [diff] [review]
> Make sure reboot flags get put up when changing builder passwd and killing
> keychain
> 
> Review of attachment 8888115 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Does this need to be done for any of the other modules/user/*/account.pp?

I would assume so. :-)

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Comment 20

•

7 years ago

I did some tests on friday with the new changes,I loaned some machine from buildbot and taskcluster to my environment to test the autologin , but it seems that after changing the password the problem is not solved ,no jobs are running after the change,and for the machine that are on taskcluster we don't have any generic workers.Forcing reboot couple of times also did not change the state of these machines.

Mihai Tabara [:mtabara]⌚️GMT

Comment 21

•

7 years ago

Comment on attachment 8888115 [details] [diff] [review]
Make sure reboot flags get put up when changing builder passwd and killing keychain

https://hg.mozilla.org/build/puppet/rev/9a6bd6a74216
https://hg.mozilla.org/build/puppet/rev/f968a9998c3b
https://hg.mozilla.org/build/puppet/rev/4690627f9920

Attachment #8888115 - Flags: checked-in+

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Comment 22

•

7 years ago

Did some new tests,and I got some results,the strange part is that for some machines generic-worker is present after pinning to my environment and for some is not but this could be cause by the low number of pendings

I have task that ran fine after the changes like:
-t-yosemite-r7-0190 (https://tools.taskcluster.net/groups/KO-5qvbwSvqzJgYt5KGlMg/tasks/YmawkbQTSwaKR5gy1K48Eg/runs/0)
-t-yosemite-r7-0150 (tasks IboBK8H6Ro2rr2YXZz9kmw, YnA4zznyRxGrquaIJIY7OQ)

And machines where generic worker is not running: 
-t-yosemite-r7-0030
	
I pinned to my environment these machines t-yosemite-r7-(0030,0190,0150,0191,0020,0021) and I will check tomorrow again for new results.So far the situation looks promising so in case everything is working maybe we can do the changes in production

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Comment 23

•

7 years ago

Attached image generic-worker.PNG — Details

The result are somehow confusing,I was hoping that by today more of these machine pinned to my environment will begin taking jobs but this didn't happened.
While machines like t-yosemite-r7-(0190,0159) that are in taskcluster are continuing taking jobs along with the ones from buildbot t-yosemite-r7-(0020,0021), the rest are blocked and since no jobs are running on them,they remain in the same state.

Also these machines that take jobs have the generic-worker present on the processes (check generic-worker.PNG) while the other doesn't.

I tested and checked the logs but I don't know why these machines don't begin taking jobs when we have around 2000 pending jobs,I don't know what is the trigger for those jobs.Amy if you or Jake have any more ideas please have a look on the list of logs bellow and update the bug if you find something,you can also do any tests on these machines:

https://papertrailapp.com/systems/178509233/events
https://papertrailapp.com/systems/178504023/events
https://papertrailapp.com/systems/118426583/events?focus=826180397720305716
https://papertrailapp.com/systems/156827283/events?q=generic&focus=826476780092293123
https://papertrailapp.com/systems/156863723/events?focus=826482816299728905
https://papertrailapp.com/systems/156863723/events?focus=826485957204279344

Flags: needinfo?(jwatkins)

Flags: needinfo?(arich)

Amy Rich [:arr] [:arich]

Comment 24

•

7 years ago

It doesn't look like generic worker is running, and you need to debug why. In order to do that, make sure that puppet has completed successfully by going through the script /usr/local/bin/run-puppet.sh (it looks like it had on t-yosemite-r7-0176 since the sempahore file was there). If the semaphore file is there at the end, then it should be running the script to start either buildbot or the generic worker (depending on the machine).

I did a grep -r for the semaphore file name in the puppet code and found the script it calls to start the worker. If you do the same, you will find modules/generic_worker/templates/generic-worker.plist.erb and if you read that you will find the script that it then tries to run on this machine: /usr/local/bin/run-generic-worker.sh along with the arguments it's called with. Work with that to debug why generic worker is not running. Reach out to the taskcluster team on #taskcluster to help you debug if you get stuck.

Flags: needinfo?(jwatkins)

Flags: needinfo?(arich)

Alin Selagea [:aselagea]

Updated

•

7 years ago

Blocks: 1382360

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Comment 25

•

7 years ago

Tests revealed that additional reboots can trigger "run-generic-worker.sh" and so, generic worker begin to work.
I suspect the issue is related with the "reboot_semaphore",when I ran puppet agent to pin to my environment the host is never restarted,I remain connected to the host. 
This is what should should happen:
- the password in changed in hiera -> the next time a mini reboots, it should run puppet -> during the puppet change, it should remove the cltbld keychain file, then tell the machine to reboot -> the machine reboots and runs puppet again (successfully) and cltbld autologs in -> after the puppet run, it creates a semaphore file that buildbot/taskcluster look for to start the worker -> the CI system start
This is happening:
- the password in changed in hiera -> the next time a mini reboots, it ran puppet -> during the puppet change, it remove the cltbld keychain file but the machine is not rebooted ->so it needs additional reboot -> the machine reboots and runs puppet again (successfully) and cltbld autologs in -> after the puppet run it creates a semaphore file that buildbot/taskcluster look for to start the worker -> the CI system start

Jake can you check please the "reboot_semaphore",or should we add a "reboot" action at the end of run-puppet.sh?
It is also possible that when I test with my environment the first ran does not reboot the machine after puppet ran and the actual state could just work but we can't risk another bustage.

Flags: needinfo?(jwatkins)

Dustin J. Mitchell [:dustin] (he/him)

Comment 26

•

7 years ago

We had a mantra a while back: "how do you run puppet? sudo reboot."  The same applies for pinned environments.  When you're working on a module and looking for syntax errors, running it manually is OK, but when you want to verify how something will work on workers in production, always reboot.

Jake Watkins [:dividehex]

Assignee

Comment 27

•

7 years ago

Dustin is correct.  You are going to need to initiate a reboot manually after updating the passwords in hiera.  Understand that by design it is not intended for puppet to actually trigger a reboot.  The reboot semaphore logic is executed during boot.

This is the logic of rebooting, running puppet at boot, and using the semaphore

manually reboot host -> runs puppet in a loop until it successfully completes -> if puppet created a reboot semaphore remove semaphore and reboot, otherwise continue booting

There is already reboot action at the end of run-puppet.sh. That is the entire purpose of the reboot semaphore.
https://hg.mozilla.org/build/puppet/file/tip/modules/puppet/templates/puppet-atboot-common.erb#l77


So assuming you are rebooting after the hiera change, does the host fail to set the reboot semaphore after the keychain is deleted and/or fails to observe the reboot semaphore request to reboot?

Flags: needinfo?(jwatkins)

Amy Rich [:arr] [:arich]

Comment 28

•

7 years ago

The hiera change shouldn't be picked up until the machine is rebooted (it won't run puppet till then), so there has to be a reboot in there by definition for this to have broken, I think. This may not be true for Andrei's test cases, but it should be true in production. Andrei, in your test case, pin it to your env, then make the hiera change, then reboot the machine without running puppet manually. Does that work or fail?

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Comment 29

•

7 years ago

I tested  different  today ,re-imaged first t-yosemite-r7-(0190,0160,0030,0191,0150,0020,0021),then pinned them to my environment while my environment had nothing changed yet.I monitored and seen some jobs started, after that I changed my environment to use the new "secrets hiera" with all the changes ,then continued monitoring without doing any manually reboot.
After finishing the existing job the machine reboot itself,puppet ran and after that  generic worker began running too (for machines on Tasckluster).
ex.
[root@t-yosemite-r7-0191.test.releng.scl3.mozilla.com ~]# ps -ef | grep generic
   0:00.00 /bin/bash /usr/local/bin/run-generic-worker.sh run --config /etc/generic-worker.config
 0:00.21 /usr/local/bin/generic-worker run --config /etc/generic-worker.config
  0:00.01 logger -t generic-worker -s

Also checked some jobs, and they seems to work fine,and the process continued after each finihed job:

ex. https://tools.taskcluster.net/groups/I2PwyVT-T4ePqreuAwkBig/tasks/LWrzSQJiQhGRsDaI-UGtqw/runs/0

All these suggest that we can do the changes in production,all should work fine.
Amy if there is nothing else that you think that may go wrong I can do the changes in production.

Flags: needinfo?(arich)

Amy Rich [:arr] [:arich]

Comment 30

•

7 years ago

You should coordinate with the taskcluster team and make sure there's a clear understanding of what success/failure looks like and how to monitor for that. I suspect that the taskcluster team will need to be onhand to help monitor/troubleshoot, so the time that you choose to make this change will be partially dependent on availability of someone from that team (we need to make sure we have a designated person). There should be constant monitoring for failure after the change is made to identify any failure markers early on. You should also have a rollback plan (and understand when to enact that plan) in the event that it does not succeed.

You should then communicate the intention to make this change and the time at which we plan to do it to firefox-ci@mozilla.com.

Flags: needinfo?(arich)

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Comment 31

•

7 years ago

We plan to do this change again tomorrow in the European morning since we don't have to much jobs running,spoke with :wcosta and he will help me after the change to make sure that the machines are taking jobs and everything is working as expected.

We will keep an eye on pending jobs on OS X test([1]) and on the pool ([2]) right after the change.

[1]https://queue.taskcluster.net/v1/pending/releng-hardware/gecko-t-osx-1010
[2]https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-yosemite-r7

Mihai if the time is to short since the release and you think that this change may affect production please post in here and we will discuss of a later schedule.

Flags: needinfo?(mtabara)

Amy Rich [:arr] [:arich]

Comment 32

•

7 years ago

Please verify with releaseduty to make sure there are no critical releases in fight before making changes tomorrow.

Mihai Tabara [:mtabara]⌚️GMT

Comment 33

•

7 years ago

(In reply to Amy Rich [:arr] [:arich] from comment #32)
> Please verify with releaseduty to make sure there are no critical releases
> in fight before making changes tomorrow.

Spoke with Sylvestre and rest of releaseduty cvorum and we decided to postpone this until next week when waters are gonna be, hopefully, calmer. 
Really sorry for short notice for all the actors involved in this. Andrei will be PTO next week but buildduty can cover this or we can wait until his return, we'll see how it goes.

Thank you and sorry again.

Flags: needinfo?(mtabara)

Mihai Tabara [:mtabara]⌚️GMT

Comment 34

•

7 years ago

Status update on this bug: Andrei is PTO this week. He left instructions on how to do this to both :spacurar and :aselagea but I'd personally prefer to postpone to next week. We still have 56.0b3, 56.0b5 and 55.0.2 in the pipeline this week. Since this is internally driven, I think we can safely postpone and attempt to do this next Tuesday.

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Comment 35

•

7 years ago

The change was reschedule for tomorrow,me and wcosta will focus on the things mentioned on #c31

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Comment 36

•

7 years ago

Password was changed for:

root_pw_paddedsha1
root_pw_saltedsha512
root_pw_pbkdf2_iterations
root_pw_pbkdf2_salt
root_pw_pbkdf2!low-security
builder_pw_pbkdf2
builder_pw_pbkdf2_salt
builder_pw_pbkdf2_iterations
builder_pw_paddedsha1
builder_pw_saltedsha512
builder_pw_kcpassword_base64

The machines began taking jobs after have been rebooted,we keep monitored the jobs and so far all seems well.

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Comment 37

•

7 years ago

Everything seems to be fine after the changes from yesterday,so I'll mark this bug as resolved.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Updated

•

7 years ago

Depends on: 1393007

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Updated

•

7 years ago

No longer depends on: 1393007

Updated

•

7 years ago

Blocks: 1382381

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

builder.PNG 7 years ago Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty] 89.51 KB, image/png		Details
before changing pass.PNG 7 years ago Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty] 61.16 KB, image/png		Details
after changing pass.PNG 7 years ago Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty] 89.75 KB, image/png		Details
fix kill-builder-keychain exec string 7 years ago Jake Watkins [:dividehex] 1.04 KB, patch	nthomas : review+ dividehex : checked-in+	Details \| Diff \| Splinter Review
Make sure reboot flags get put up when changing builder passwd and killing keychain 7 years ago Jake Watkins [:dividehex] 2.22 KB, patch	dustin : review+ mtabara : checked-in+	Details \| Diff \| Splinter Review
generic-worker.PNG 7 years ago Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty] 19.26 KB, image/png		Details