Closed Bug 1371780 Opened 8 years ago Closed 8 years ago

Task user is not logging in on Windows hardware following OCC being applied

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: markco, Assigned: markco)

References

Details

Current assumption that this is a new race condition. This was not happening during the prior week, and is not the case on hardware Windows 10.
Assignee: relops → mcornmesser
I missed a match when upgrading the generic worker version. That seem to have caused the issue.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INVALID
Actually it appears to be a condition on hardware where after the OCC with generic worker 10.0.4 is applied the geerneric worker account is not there. And this seems to be the case for both 7 and 10.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Blocks: 1351366
In release 10.0.4, generic worker runs as LocalSystem as a windows service, so the generic worker account is no longer needed. The new mode of operation is that the worker installs as a service, and then the machine is rebooted. The worker starts up, creates a task user, and reboots *into the task user account*. Once the login completes, the worker is still running as a Windows service as LocalSystem, and when it claims a task, it will execute the commands under the task user logon session. Once the task completes, the worker will create a new task user, and then reboot into that user... So there is no generic worker user account any more. Let me know if you have questions! :) Pete
OK. The piece I am not seeing on hardware is the task user login. Any ideas?
Flags: needinfo?(pmoore)
Summary: Generic Worker account absent on hardware Windows 7 following OCC being applied → Task user is not logging in on Windows hardware following OCC being applied
One thing I spotted in the config file I see you are downloading "https://github.com/taskcluster/generic-worker/releases/download/v10.0.5/generic-worker-windows-386.exe" But declaring later "Match": "generic-worker 10.0.4" They should both be 10.0.4 or 10.0.5 (10.0.5 is better). The OCC file should install the worker under C:\generic-worker If the install is successful, there should be a windows service called "Generic Worker". If that doesn't exist, check the logs from the OCC run. If it installed ok, there needs to be a reboot, or the windows service needs to be started manually if you're not rebooting. The service can be started manually by running: C:\nssm-2.24-103-gdee49fc\win32\nssm.exe start "Generic Worker" After a short time, it should create a user, and reboot. If it doesn't reboot, check the logs: C:\generic-worker\generic-worker.log To look at the service, to see where it writes logs to etc you can run C:\nssm-2.24-103-gdee49fc\win32\nssm.exe edit "Generic Worker" Typcially it just executes C:\generic-worker\run-generic-worker.bat and that in turn writes logs to C:\generic-worker\generic-worker.log (but check the .bat script to see what it is doing, as it looks like from the OCC manifest it is a patched .bat script). If it doesn't run and reboot into the task user, but the C:\generic-worker\generic-worker.log file is generated, please let me know what it says in the log. BE CAREFUL NOT TO PASTE THE ENTIRE LOG IN THIS BUG as it contains the taskcluster credentials!!!!! YOU HAVE BEEN WARNED!!! xD However, if you are careful, the credentials are only mentioned near the top of the log - but obviously be very careful there is nothing else sensitive in there.... Good luck!
Flags: needinfo?(pmoore)
Also, we can schedule a live session if you like to go through it together. That might be more effective than async communication - maybe we can iron out any issues in a live session. Just let me know when is good for you.
For reference I am currently working on Win 10. OCC runs, and issues a reboots, and then it boots back up but does not auto login. The generic worker service is running, and the only account at that point is root. The generic worker log ends with: 2017/06/13 23:15:31 Making system call WTSQueryUserToken with args: [1 **********] 2017/06/13 23:15:31 Result: 0 0 An attempt was made to reference a token that does not exist. 2017/06/13 23:15:31 Making system call WTSGetActiveConsoleSessionId with args: [] 2017/06/13 23:15:31 Result: 1 0 The operation completed successfully. 2017/06/13 23:15:31 Making system call WTSQueryUserToken with args: [1 **********] 2017/06/13 23:15:31 Result: 0 0 An attempt was made to reference a token that does not exist. 2017/06/13 23:15:31 Saving file file-caches.json with content: {} 2017/06/13 23:15:31 Saving file directory-caches.json with content: {} 2017/06/13 23:15:31 goroutine 1 [running]: runtime/debug.Stack(0xc04217b758, 0x8ddb60, 0xc0422da280) /home/travis/go/src/runtime/debug/stack.go:24 +0x80 main.RunWorker.func1(0xc04217bd40) /home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:489 +0x75 panic(0x8ddb60, 0xc0422da280) /home/travis/go/src/runtime/panic.go:458 +0x251 main.prepareTaskUser(0xc04223fbe0, 0x4, 0x2) /home/travis/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:109 +0x7bc main.PrepareTaskEnvironment(0xed0d268d7) /home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1169 +0x167 main.RunWorker(0x0) /home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:546 +0x5b0 main.main() /home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:328 +0x7fe 2017/06/13 23:15:31 *********** PANIC occurred! *********** 2017/06/13 23:15:31 WTSQueryUserToken: An attempt was made to reference a token that does not exist. 2017/06/13 23:15:31 Exiting worker with exit code 69 I wasn't sure if "WTSQueryUserToken with args:" was referencing a sensitive value in relation to the token, so I redacted it.
Can you patch C:\generic-worker\run-generic-worker.bat on a machine to append to the log file, rather than replace it? i.e. C:\generic-worker\generic-worker.exe run --config C:\generic-worker\gen_worker.config > C:\generic-worker\generic-worker.log 2>&1 => C:\generic-worker\generic-worker.exe run --config C:\generic-worker\gen_worker.config >> C:\generic-worker\generic-worker.log 2>&1 Can you then start from a fresh image, and send me the entire log file? My public gpg key is https://keybase.io/pmoore/pgp_keys.asc and https://gpg.mozilla.org/pks/lookup?op=get&search=0x8295FE77B475FEBC The above extract suggests that a task user has been created, and that the autologin registry settings have been applied - but maybe there is a problem with the autologon. This particular system call is attempting to get the handle of the interactively logged in user, and it normally tries retries every 100ms for up to a minute (iirc). It looks like after a full minute, the autologon has not been successful, so I'm wondering if there might be anything in the logs to suggest what went wrong. Another thing that would be useful would be to check if a task user actually got created. Another thing to check would be what is in the winlogon registry: reg query "HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon" In any case, by appending to the log, it should not get flushed between reboots, and then we should see what happened. Alternatively, if you are already logging to papertrail, can you share a url with me? I have access to the releng account. To me, it looks like the registry settings are made to auto-logon, but the auto-logon is failing (perhaps local policy does not accept the generated random password?). However, I would expect it to fail earlier, rather than continuing, if this was the case. Thanks Mark!
The task user account is not present, and the winlogon registry values are unchanged from what MDT sets it as. I am going to kick off a new image shortly.
Hi Mark, It looks like the problem is in C:\generic-worker\generic-worker.config: "runTasksAsCurrentUser": true This needs to be false. It is trying to run tasks as the user running generic-worker, which in this case is the LocalSystem account. Since no user is logged into the machine, it can't get a handle to the interactive logon session. I'll check over the other config to see if there is anything else that might need adjusting.
"disableReboots": false This will mean that the generic-worker will trigger reboots. This is probably fine, but you might want to check with grenade if this will interfere with HaltOnIdle. On the gecko worker types in AWS this is disabled, and rebooting is managed by run-generic-worker.bat I believe, mostly because the Z: drive is formatted after the worker completes a task. However, if no Z: drive formatting is occurring on hardware machines, then the worker might as well reboot directly, I see no problem with this config. But I don't know HaltOnIdle too well, and what it might expect, and if that is something that runs on hardware or not (I guess not, as on hardware we probably don't want to shutdown in case of problem - do you know?).
Downloads dir and caches dir are usually siblings - no reason they need to be, but seemed unusual to have "cachesDir": "C:\\caches" "downloadsDir": "C:\\Users\\GenricWorker\\downloads" I'd recommend putting them both on C:\ so they are together, although it won't break anything as you have it here.
"idleTimeoutSecs": 0 On the Mac workers we set this to something like 96 hours, and then make sure machines are rebooted when the worker exits, in order that they get puppet updates. Not sure if you want to do the same on Windows? See the OS X puppet configs: https://hg.mozilla.org/build/puppet/file/eff4f2b582a0/modules/generic_worker/templates/generic-worker.config.erb Note - That is based on having this version of run-generic-worker.sh: https://hg.mozilla.org/build/puppet/file/eff4f2b582a0/modules/generic_worker/templates/run-generic-worker.sh.erb (which would be the equivalent of your https://github.com/mozilla-releng/OpenCloudConfig/blob/26010943c1def56d0c30ef24638ca1bb34205ba8/userdata/Configuration/GenericWorker/run-hw-generic-worker-and-reboot.bat )
"numberOfTasksToRun": 1, This should be 0 not 1. See the docs: https://github.com/taskcluster/generic-worker/#set-up-your-env
"runAfterUserCreation": "C:\\generic-worker\\task-user-init.cmd", I suspect this isn't needed.
"signingKeyLocation": "not-configured", This is ok on testers, but you would need to set this on builders.
"tasksDir": "C:\\", It is probably *safer* to keep the default here, C:\Users (i.e. "C:\\Users") so that the task directory is also the user home directory. This means you won't get adjusted APPDATA and LOCALAPPDATA env vars etc, which is just a lot safer and more predictable.
The rest looks good! After you've made these changes, can you send me a log file again? Also, can you hook the machine(s) up to papertrail? This would make remote troubleshooting easier (for me). Thanks Mark! I already have read access to the releng papertrail account.
I have updated the config file. Currently I am trying to get the machine to talk to papertrail.
t-w864-ix-011 and 012 are now hooked up to papertrail. We are still in the state where the task user is not logging in.
TL;DR: Make sure nothing is setting winlogon registry keys (just leave the generic-worker to do this) (it looks like something is setting/unsetting winlogon auto login for root account) Further details below. ------------------------------ From looking at the logs, at one point, there is an auto logon setup for the root user: https://papertrailapp.com/systems/949147771/events?focus=812147456891576328&selected=812147456891576328 This interferes with the generic worker, which expects to be able to configure the auto logon user itself. You will need to make sure that when the generic worker is started up for the first time, that no auto winlogon is configured, otherwise the generic worker will currently assume this is the user to be used for executing the next task. Another thing that stands out, is that there are two logs for this machine. First log ========= Papertrail url: https://papertrailapp.com/systems/949147771 System name: T-W864-IX-011.wintest.releng.scl3.mozilla.com Start time: 2017/06/15 23:10:39 UTC End time: 2017/06/15 23:11:04 UTC This log is clean, the worker starts up, sees it is logged in as root user, and assumes this will be the task user. It prepares the log directory for the next task (C:\Users\root\public\logs) which it assumes will run as root (for the reason mentioned above). And then we never get any more output. My guess is something reboots the machine. Is DSC/OCC stuff running at this point? Second log ========== Papertrail url: https://papertrailapp.com/systems/T-W864-IX-011 System name: T-W864-IX-011 Start time: 2017/06/16 00:00:42 UTC End time: 2017/06/16 00:01:55 UTC This second log does not start at the beginning. The first part of the log file is missing. Is it possible that the machine rebooted, the worker started up, then the papertrail logging started up afterwards, and the logging that took place before the papertrail connection is not flushed to papertrail? This log is harder to dianose, due to the front portion of it being missing. It also takes place 49m38s after the last log entry of the first log, which suggests that DSC/OCC was still running when the first generic-worker invocation occurred, and that the second invocation occurred after DSC/OCC rebooted the computer. This second log shows generic-worker failing 1m13s after the first log entry, after spending one minute trying to get a handle on the interactive user (which presumably doesn't exist). This suggests at this point, the winlogon registry settings were set, but the machine hadn't been rebooted, so the auto logon had not occurred. Or the username/password combination was invalid. In both cases, this must have been something outside of the worker setting the registry, since the worker would always reboot (unless NSSM is starting it up immediately after it exits). There are two places in generic worker code that try to gain an access token for the interactive user. Without the full log, it is not possible to know for sure which of the two is the cause of this attempt in the second log. However, we can have a guess. The first use case is when an auto winlogon is configured and the worker assumes this user should be used to run the next task it claims. Since it sees an automated winlogon has been configured, it assumes that within a minute, that logon session should have been created, and tries to get an auth token for it. It uses the auth token to access the user profile, in order that it can prepare the task directories etc for the next task that comes in. The second use case is when it has actually claimed a task, and wants to execute a command (which can happen hours later, if e.g. no task is claimed for several hours). Technically this is probably not needed, and we could store the token from the first use case, but anyway, that is how the code looks today. I'm not sure if this failure is due to the initial task directory preparation, or whether it is a failure happening when trying to execute a task. So my questions: 1) Is it possible that winlogon registry entries have been created by something outside generic-worker, and they are invalid? 2) Please can you provide me with the output of the reg query in comment 8 (pgp encrypted)? 3) Can you confirm generic-worker service isn't started up during machine setup (DSC/OCC), but only when DSC/OCC has completed and machine has been rebooted? 4) Is it possible to configure the papertrail logging to flush all of the logs (if papertrail logging starts up after generic worker for some reason)? 5) Can you provide details of what auto winlogon stuff is setup outside of generic worker? Please note, when generic worker runs, it uses this information to decide whether it needs to create a new user (because no autologon has been set up) or to use an existing user (if it finds there is an autologon configured). It is therefore important that the registry settings do not exist / are empty the first time the generic worker runs. 6) Do you know why there was a 49 minute gap between the two logs, and why they are logging under different system names? The most likely candidate I see as being the cause is that winlogon settings are applied by something (DSC/OCC?) and a reboot is not issued, and then the generic worker is started. When the worker reads those registry settings, it assumes a reboot occurred, and therefore it is finds values set, it will assume that user is already logged on. The other possibility is that for some reason, the generic-worker is getting restarted when it exits (e.g. by NSSM) - you can check the NSSM service configurations for that (see comment 5). Also notice in comment 13, on OS X, we intentionally sleep after issuing the reboot, to make sure the run script is still running, to prevent another invocation. It would make sense to do this too, if NSSM is restarting it, for whatever reason (although I suspect this is a red herring, since after the reboot it would self-recover anyway).
MDT is what setting the root auto logon. However, by the time OCC runs MDT is disconnected. I will add a clean up of the registry entries to the script that invokes OCC. Also I am going to have MDT do the initial install of nxlog and copy of the nxlog.conf file, so that the full OCC invocation is capture by papertrail.
I have made the changes from comment 22, and kicked off new installs. > 1) Is it possible that winlogon registry entries have been created by > something outside generic-worker, and they are invalid? MDT, which deploys the OS and and initial configuration, is setting the values of the Win autologin for root. Which is needed to for various pieces post OS installation. As is MDT has builtin clean up step that would remove these entries, but this was leaving these registry settings in a state where OCC was unable to update them. After that logic was removed from the clean up script OCC was able to set the values for the generic worker account login. > 2) Please can you provide me with the output of the reg query in comment 8 > (pgp encrypted)? That would be the values set by MDT for the root login > 3) Can you confirm generic-worker service isn't started up during machine > setup (DSC/OCC), but only when DSC/OCC has completed and machine has been > rebooted? I will have to address this in OCC.I am currently unsure when the service is starting. > 4) Is it possible to configure the papertrail logging to flush all of the > logs (if papertrail logging starts up after generic worker for some reason)? I am having MDT install and configure nxlog, so hopefully now we capture everything. Previously OCC was installing and configuring so there may had been an overlap between nxlog service starting and generic worker installation. > 5) Can you provide details of what auto winlogon stuff is setup outside of > generic worker? Please note, when generic worker runs, it uses this > information to decide whether it needs to create a new user (because no > autologon has been set up) or to use an existing user (if it finds there is > an autologon configured). It is therefore important that the registry > settings do not exist / are empty the first time the generic worker runs. The registry values should now be empty at the time OCC is invoked. > 6) Do you know why there was a 49 minute gap between the two logs, and why > they are logging under different system names? Currently, I am using two machines.
Task user is now logging in.
I think the generic worker maybe causing a reboot before OCC is finished: Jun 16 13:50:58 T-W864-IX-011.wintest.releng.scl3.mozilla.com generic-worker: {}#015 Jun 16 13:50:58 T-W864-IX-011.wintest.releng.scl3.mozilla.com generic-worker: 2017/06/16 20:50:57 Exiting worker with exit code 67#015 Jun 16 13:50:58 T-W864-IX-011.wintest.releng.scl3.mozilla.com generic-worker: 2017/06/16 20:50:57 Immediate reboot being issued...#015 In the event viewer there is a shutdown with the comment of "generic-worker requested reboot".
Sorry, you removed only the second part, but note, **you'll need to add that back in otherwise after reboot after one task, the worker won't start up again**. The OCC mechanism for signalling to generic-worker that it has finished and generic-worker can start up, is writing a file to C:\dsc\task-claim-state.valid If generic-worker is starting up before that file is written, it suggests that either: 1) the file C:\generic-worker\run-generic-worker.bat script is the wrong one (i.e. not a copy of https://github.com/mozilla-releng/OpenCloudConfig/blob/26010943c1def56d0c30ef24638ca1bb34205ba8/userdata/Configuration/GenericWorker/run-hw-generic-worker-and-reboot.bat), or 2) the file C:\dsc\task-claim-state.valid is getting created prematurely I recommend manually comparing the run-generic-worker.bat script in 1) to make sure the patched version is the one deployed (not the default version installed by generic worker) and also making sure that C:\dsc\task-claim-state.valid is only created once OCC finishes.
(In reply to Mark Cornmesser [:markco] from comment #23) > Currently, I am using two machines. System name: T-W864-IX-011 and System name: T-W864-IX-011.wintest.releng.scl3.mozilla.com are two different machines? Why do they have the same unqualified name?
I am going to close this bug, since the original issue has been resolved, and open up 2 new bugs: https://bugzilla.mozilla.org/show_bug.cgi?id=1374363 OCC configured hardware machines reporting to papertrail under two names https://bugzilla.mozilla.org/show_bug.cgi?id=1374366 OCC configured hardware is not picking up tests
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.