Closed Bug 1240031 Opened 8 years ago Closed 8 years ago

Generic Worker cannot run under a Windows Service after removing use of PsExec

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: pmoore, Unassigned)

References

Details

(Whiteboard: [generic-worker])

Generic Worker runs fine if you start it from a command prompt, but not when running as a Windows Service.

The problem was introduced when PsExec was removed from the toolchain in bug 1176072.

It turns out that in the former implementation which used PsExec, the task processes was inadvertently running under the System account, see:

* https://github.com/taskcluster/generic-worker/blob/32688930db93bc56790a054ec2e0a7ae00fa2a84/plat_windows.go#L335
* https://technet.microsoft.com/en-us/sysinternals/psexec.aspx

The solution is therefore not just a case of switching back to using PsExec, since the problem is not that we are not using PsExec, rather that we shouldn't execute tasks under the System account at all - in other words, the migration away from PsExec uncovered this issue which we hadn't realised before - the generic worker was not running tasks under the user it created for them, but under the System account instead.

There are some online resources that might help solve this problem, although it is expected to be non-trivial to solve:

* http://blogs.msdn.com/b/winsdk/archive/2013/05/01/how-to-launch-a-process-interactively-from-a-windows-service.aspx
* http://blogs.msdn.com/b/winsdk/archive/2009/07/14/launching-an-interactive-process-from-windows-service-in-windows-vista-and-later.aspx

The error we actually get in the generic worker log (C:\generic-worker\generic-worker.log) is:
  * Error executing command 0: "TASK NOT SUCCESSFUL: status Errored with reason: \"worker-shutdown\" due to exit status 3221225794"
which comes from:
  * https://github.com/taskcluster/generic-worker/blob/f88149d4d3a2d10a6749e05ba143e5e90f9db2da/main.go#L744
which comes from:
  * https://github.com/taskcluster/generic-worker/blob/f88149d4d3a2d10a6749e05ba143e5e90f9db2da/main.go#L736
which comes from:
  * https://github.com/taskcluster/generic-worker/blob/f88149d4d3a2d10a6749e05ba143e5e90f9db2da/main.go#L794

In the case of task run https://tools.taskcluster.net/task-inspector/#TddeVzXEQU6KQivGJ7ja3A/0, it is trying to run script C:\Users\Task_1452863936\command_000000_wrapper.bat with the following content, as the user "Task_1452863936" via the https://msdn.microsoft.com/en-us/library/windows/desktop/ms682431%28v=vs.85%29.aspx system call (see https://github.com/taskcluster/generic-worker/blob/c20737cb3714c4e4525a2bdf4a3301622b0285b0/syscall/zsyscall_windows.go#L38):

:: This script runs command 0 defined in TaskId TddeVzXEQU6KQivGJ7ja3A...
@echo off
set TOOLCHAIN=64-bit cross-compile
set TOOLTOOL_REPO=https://git.mozilla.org/build/tooltool.git
set Framework40Version=v4.0
set WindowsSDK_ExecutablePath_x64=C:\Program Files (x86)\Microsoft SDKs\Windows\v8.1A\bin\NETFX 4.5.1 Tools\x64\
set MOZBUILDDIR=C:\mozilla-build
set SDKDIR=C:\Program Files (x86)\Windows Kits\8.1\
set GECKO_HEAD_REV=fc4b30cc56fb3a63fce819390712abf3ba8b0692
set MOZBUILD_STATE_PATH=C:\Users\Administrator\.mozbuild
set SDKROOTKEY=HKLM\SOFTWARE\Microsoft\Windows Kits\Installed Roots
set GECKO_HEAD_REPOSITORY=https://hg.mozilla.org/try/
set LIBPATH=C:\Windows\Microsoft.NET\Framework64\v4.0.30319;C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\LIB;C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\ATLMFC\LIB;C:\Program Files (x86)\Windows Kits\8.1\References\CommonConfiguration\Neutral;C:\Program Files (x86)\Microsoft SDKs\Windows\v8.1\ExtensionSDKs\Microsoft.VCLibs\12.0\References\CommonConfiguration\neutral;
set WindowsSdkDir=C:\Program Files (x86)\Windows Kits\8.1\
set MOZ_BUILD_DATE=19770819000000
set MSVCKEY=HKLM\SOFTWARE\Wow6432Node\Microsoft\VisualStudio\12.0\Setup\VC
set FrameworkVersion=v4.0.30319
set MOZ_MSVCVERSION=12
set MOZ_CRASHREPORTER_NO_REPORT=1
set MACHTYPE=i686-pc-msys
set MOZILLABUILD=C:\mozilla-build
set WINCURVERKEY=HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion
set Path=C:\Program Files (x86)\Microsoft Visual Studio 12.0\Common7\IDE\CommonExtensions\Microsoft\TestWindow;C:\Program Files (x86)\MSBuild\12.0\bin\amd64;C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\BIN\amd64_x86;C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\BIN\amd64;C:\Windows\Microsoft.NET\Framework64\v4.0.30319;C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\VCPackages;C:\Program Files (x86)\Microsoft Visual Studio 12.0\Common7\IDE;C:\Program Files (x86)\Microsoft Visual Studio 12.0\Common7\Tools;C:\Program Files (x86)\HTML Help Workshop;C:\Program Files (x86)\Microsoft Visual Studio 12.0\Team Tools\Performance Tools\x64;C:\Program Files (x86)\Microsoft Visual Studio 12.0\Team Tools\Performance Tools;C:\Program Files (x86)\Windows Kits\8.1\bin\x64;C:\Program Files (x86)\Windows Kits\8.1\bin\x86;C:\Program Files (x86)\Microsoft SDKs\Windows\v8.1A\bin\NETFX 4.5.1 Tools\x64\;C:\Windows\System32;C:\Windows;C:\Windows\System32\Wbem;C:\mozilla-build\moztools-x64\bin;C:\mozilla-build\7zip;C:\mozilla-build\info-zip;C:\mozilla-build\kdiff3;C:\mozilla-build\mozmake;C:\mozilla-build\nsis-3.0b1;C:\mozilla-build\nsis-2.46u;C:\mozilla-build\python;C:\mozilla-build\python\Scripts;C:\mozilla-build\upx391w;C:\mozilla-build\wget;C:\mozilla-build\yasm
set FSHARPINSTALLDIR=C:\Program Files (x86)\Microsoft SDKs\F#\3.1\Framework\v4.0\
set Platform=X86
set VSINSTALLDIR=C:\Program Files (x86)\Microsoft Visual Studio 12.0\
set LIB=C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\LIB;C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\ATLMFC\LIB;C:\Program Files (x86)\Windows Kits\8.1\lib\winv6.3\um\x86;
set VisualStudioVersion=12.0
set GECKO_HEAD_REF=fc4b30cc56fb3a63fce819390712abf3ba8b0692
set ExtensionSdkDir=C:\Program Files (x86)\Microsoft SDKs\Windows\v8.1\ExtensionSDKs
set INCLUDE=C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\INCLUDE;C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\ATLMFC\INCLUDE;C:\Program Files (x86)\Windows Kits\8.1\include\shared;C:\Program Files (x86)\Windows Kits\8.1\include\um;C:\Program Files (x86)\Windows Kits\8.1\include\winrt;
set VCINSTALLDIR=C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\
set VCDIR=C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\
set MOZ_TOOLS=C:\mozilla-build\moztools-x64
set MOZ_AUTOMATION=1
set SDKMINORVER=1
set MSYSTEM=MINGW32
set MOZ_MSVCBITS=32
set FrameworkDir=C:\Windows\Microsoft.NET\Framework64
set FrameworkDIR64=C:\Windows\Microsoft.NET\Framework64
set GECKO_BASE_REPOSITORY=https://hg.mozilla.org/mozilla-central
set WIN81SDKKEY={5247E16E-BCF8-95AB-1653-B3F8FBF8B3F1}
set WIN64=1
set TOOLTOOL_REV=master
set SDKPRODUCTKEY=HKLM\SOFTWARE\Microsoft\Windows Kits\Installed Products
set MOZ_MSVCYEAR=2013
set MAKE_MODE=unix
set CommandPromptType=Cross
set PreferredToolArchitecture=x64
set WindowsSDK_ExecutablePath_x86=C:\Program Files (x86)\Microsoft SDKs\Windows\v8.1A\bin\NETFX 4.5.1 Tools\
set FrameworkVersion64=v4.0.30319
set SDKVER=8
cd "C:\Users\Task_1452863936"
set errorlevel=
call C:\Users\Task_1452863936\command_000000.bat 2>&1
set tcexitcode=%errorlevel%
set > C:\Users\Task_1452863936\env.txt
cd > C:\Users\Task_1452863936\dir.txt
exit /b %tcexitcode%

The problem is that tasks may need to be interactive (e.g. open a window in a desktop) and in modern versions of Windows, Services run in Session 0, which is non-interactive. We probably need to find some way to:

a) automatically start an interactive session (e.g. automatically RDP into the machine)
b) grant the temporary user (that the generic worker creates) access to that interactive session - perhaps we even need a daemon running in the interactive session that can do that
c) get a token via some system call, if we can work out which one(s), that the process can use in order to attach to the interactive session

It is not going to be easy.

Alternatives:

1) re-evaluate design of creating temporary users on the fly, e.g.
  1a) delegate cleanup to something outside of the generic worker
  1b) run insider a virtualisation engine
  1c) destroy workers when tasks complete
  ....
2) only use for non-interactive processes
3) only use for "trusted" tasks (and enforce with scopes)
...

I think it is worth giving it a stab at solving, especially if we can pull in help from Windows experts that know the Windows APIs well, otherwise we'll be left with one of the options above, or some other alternative.
Jonas also suggested the possibility we somehow auto-login, and run the generic worker as e.g. a startup item, rather than a Windows service. This could also be an option, if we can find a way to e.g. auto-rdp in.
Auto-login and a start up task are relatively easy to implement, but why would we need to auto-rdp in?
We have done a great deal of work to remove rdp from the testing and build process. For resolution issues in testing we have implemented a virtual display driver. 

You can have a service run and interact with the desktop but the account the service runs under and the logged in user have to be the same otherwise you will run into problems. 

The usual best bet is to create a triggered schedule task that is set to trigger at any user login with "highest privileges". If you want to create users on the fly the user would need to be created at spin up and a set of registry entries applied for auto login. However dynamic usernames this will complicate things like .ssh file placement or anything else dependent on the users home path.
(In reply to Mark Cornmesser [:markco] from comment #2)
> Auto-login and a start up task are relatively easy to implement, but why
> would we need to auto-rdp in?

(In reply to Q from comment #3)
> We have done a great deal of work to remove rdp from the testing and build
> process. For resolution issues in testing we have implemented a virtual
> display driver. 
> 
> You can have a service run and interact with the desktop but the account the
> service runs under and the logged in user have to be the same otherwise you
> will run into problems. 
> 
> The usual best bet is to create a triggered schedule task that is set to
> trigger at any user login with "highest privileges". If you want to create
> users on the fly the user would need to be created at spin up and a set of
> registry entries applied for auto login. However dynamic usernames this will
> complicate things like .ssh file placement or anything else dependent on the
> users home path.

Thanks Q and Mark!

In order to have some kind of isolation between tasks, we decided to create a user on-the-fly for each task. This was intended to be something like a system "guest account" whereby the user would have limited privileges (e.g. permission to update entries under HKEY_CURRENT_USER, permission to affect the user's home directory, but ideally no permission to change system/global state that could affect the cleanliness of another user's environment). After the task completes, the user's home directory and registry entries would be destroyed, much like when a guest user logs out, and all state is lost. This approached seemed wise since we wanted to reuse the same AWS instance for multiple tasks, but each to start with a "clean" environment, and be able to run arbitrary code, without the risk of affecting other jobs. This was a slightly different requirement to the Buildbot implementation, whereby known trusted code would run that was defined in buildbotcustom/buildbot-configs. In the case of taskcluster, steps can be defined in the task, so arbitrary code can run in a try push, for example. It was also considered not cost-effective to spawn a new instance for every task, due to AWS billing charging a minimum of one compute hour (e.g. spawning 3 instances for 20 mins each would cost three times more than spawning one instance for up to an hour).

Maybe we need to revisit this design... Not sure the best way to proceed.
Well are we trying to run tasks in parallel (which I would not recommend ) ? If not, a post script that makes some registry changes and reboots the instances would achieve the same thing. I don't see many complications on builders. However on testers there will be some potential permission issues using things like xperf etc.

Q
(In reply to Q from comment #5)
> Well are we trying to run tasks in parallel (which I would not recommend ) ?
> If not, a post script that makes some registry changes and reboots the
> instances would achieve the same thing. I don't see many complications on
> builders. However on testers there will be some potential permission issues
> using things like xperf etc.
> 
> Q

Yeah, maybe we'll have to do that, at least for tasks that need access to the desktop... Thanks for the suggestion.

Another complication is that the generic worker runs under its own user account - if we reboot into the temporary user account we created, will we have any problems running the generic worker under the privileged account, when we are logged in with the newly created user? Would we still run the generic worker as a Windows Service, and would we be able to get the impersonation tokens etc with suitable syscalls in order to spawn processes as the logged in user with access to the desktop?
Flags: needinfo?(q)
The way we conquered this in render / render qc world on windows was to log in a user into a "desktop session" this was a "real" user space session that would tap the hardware gpu. It could not be an rdp session as the rpd driver was to fuzzy to test things at a pixel level.  So we would have a worker windows service that would run in a privileged mode gather the task pay load, run any preliminary tasks for setup, then use CreateProcessAsUser and pass it to the sessionid of the logged in user and it would launch under that context and that display.
Flags: needinfo?(q)
No longer blocks: 1176072
See Also: → 1176072
We have a workaround, to run the generic worker by auto-logging in the GenericWorker user, and having a scheduled task that runs when the GenericWorker user logs in.

However, I'll leave this bug open, as :grenade suggests that running the worker as a Windows Service would be preferable. However it no longer blocks bug 1176072 which is now resolved.
Component: Generic-Worker → Worker
Whiteboard: [generic-worker]
At the moment it does not make sense to invest effort in the generic worker to get it running as a Windows service, as it is running well as a scheduled task.

We may wish to invest effort in the future, but then probably with taskcluster worker.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
See Also: → 1298801
Component: Worker → Workers
You need to log in before you can comment on or make changes to this bug.