Closed
Bug 1303455
Opened 8 years ago
Closed 8 years ago
TC Windows tests run without JOB_OBJECT_LIMIT_BREAKAWAY_OK
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: gbrown, Assigned: pmoore)
References
Details
Attachments
(4 files)
In bug 1280474, we are trying to run Windows tests in taskcluster with generic-worker. A lot of tests are failing in this environment; some of the problems appear to be related to process management.
The most obvious error is
20:29:11 INFO - ProcessManager UNABLE to use job objects to manage child processes
20:29:11 INFO - ProcessManager NOT managing child processes
That's generated by mozprocess [1] because winprocess.CanCreateJobObject() returns False. CanCreateJobObject() returns False on taskcluster because JOB_OBJECT_LIMIT_BREAKAWAY_OK is not set in the limit flags [2]; JOB_OBJECT_LIMIT_BREAKAWAY_OK is set during buildbot tests.
[1] https://dxr.mozilla.org/mozilla-central/rev/f1dbeb5dee22cabc56b2cf25d2eab1435abbf131/testing/mozbase/mozprocess/mozprocess/processhandler.py#263
[2] https://dxr.mozilla.org/mozilla-central/rev/f1dbeb5dee22cabc56b2cf25d2eab1435abbf131/testing/mozbase/mozprocess/mozprocess/winprocess.py#432
Reporter | ||
Comment 1•8 years ago
|
||
It's not clear to me what determines whether a process has JOB_OBJECT_LIMIT_BREAKAWAY_OK set.
I'm wondering if it is a security/privilege issue. I called "whoami /all" from the mozharness script and note that buildbot's cltbld user appears to be an Administrator, while the taskcluster user is not.
BUILDBOT
13:14:56 INFO - Running command: ['whoami', '/all']
13:14:56 INFO - Copy/paste: whoami /all
13:14:56 INFO - USER INFORMATION
13:14:56 INFO - ----------------
13:14:56 INFO - User Name SID
13:14:56 INFO - ====================== ==============================================
13:14:56 INFO - t-w732-spot-112\cltbld S-1-5-21-2192203079-3471170106-2880594189-1000
13:14:56 INFO - GROUP INFORMATION
13:14:56 INFO - -----------------
13:14:56 INFO - Group Name Type SID Attributes
13:14:56 INFO - ==================================== ================ ============ ===============================================================
13:14:56 INFO - Everyone Well-known group S-1-1-0 Mandatory group, Enabled by default, Enabled group
13:14:56 INFO - BUILTIN\Administrators Alias S-1-5-32-544 Mandatory group, Enabled by default, Enabled group, Group owner
13:14:56 INFO - BUILTIN\Remote Desktop Users Alias S-1-5-32-555 Mandatory group, Enabled by default, Enabled group
13:14:56 INFO - BUILTIN\Users Alias S-1-5-32-545 Mandatory group, Enabled by default, Enabled group
13:14:56 INFO - NT AUTHORITY\INTERACTIVE Well-known group S-1-5-4 Mandatory group, Enabled by default, Enabled group
13:14:56 INFO - CONSOLE LOGON Well-known group S-1-2-1 Mandatory group, Enabled by default, Enabled group
13:14:56 INFO - NT AUTHORITY\Authenticated Users Well-known group S-1-5-11 Mandatory group, Enabled by default, Enabled group
13:14:56 INFO - NT AUTHORITY\This Organization Well-known group S-1-5-15 Mandatory group, Enabled by default, Enabled group
13:14:56 INFO - LOCAL Well-known group S-1-2-0 Mandatory group, Enabled by default, Enabled group
13:14:56 INFO - NT AUTHORITY\NTLM Authentication Well-known group S-1-5-64-10 Mandatory group, Enabled by default, Enabled group
13:14:56 INFO - Mandatory Label\High Mandatory Level Label S-1-16-12288 Mandatory group, Enabled by default, Enabled group
13:14:56 INFO - PRIVILEGES INFORMATION
13:14:56 INFO - ----------------------
13:14:56 INFO - Privilege Name Description State
13:14:56 INFO - =============================== ========================================= ========
13:14:56 INFO - SeIncreaseQuotaPrivilege Adjust memory quotas for a process Disabled
13:14:56 INFO - SeSecurityPrivilege Manage auditing and security log Disabled
13:14:56 INFO - SeTakeOwnershipPrivilege Take ownership of files or other objects Disabled
13:14:56 INFO - SeLoadDriverPrivilege Load and unload device drivers Disabled
13:14:56 INFO - SeSystemProfilePrivilege Profile system performance Disabled
13:14:56 INFO - SeSystemtimePrivilege Change the system time Disabled
13:14:56 INFO - SeProfileSingleProcessPrivilege Profile single process Disabled
13:14:56 INFO - SeIncreaseBasePriorityPrivilege Increase scheduling priority Disabled
13:14:56 INFO - SeCreatePagefilePrivilege Create a pagefile Disabled
13:14:56 INFO - SeBackupPrivilege Back up files and directories Disabled
13:14:56 INFO - SeRestorePrivilege Restore files and directories Disabled
13:14:56 INFO - SeShutdownPrivilege Shut down the system Disabled
13:14:56 INFO - SeDebugPrivilege Debug programs Disabled
13:14:56 INFO - SeSystemEnvironmentPrivilege Modify firmware environment values Disabled
13:14:56 INFO - SeChangeNotifyPrivilege Bypass traverse checking Enabled
13:14:56 INFO - SeRemoteShutdownPrivilege Force shutdown from a remote system Disabled
13:14:56 INFO - SeUndockPrivilege Remove computer from docking station Disabled
13:14:56 INFO - SeManageVolumePrivilege Perform volume maintenance tasks Disabled
13:14:56 INFO - SeImpersonatePrivilege Impersonate a client after authentication Enabled
13:14:56 INFO - SeCreateGlobalPrivilege Create global objects Enabled
13:14:56 INFO - SeIncreaseWorkingSetPrivilege Increase a process working set Disabled
13:14:56 INFO - SeTimeZonePrivilege Change the time zone Disabled
13:14:56 INFO - SeCreateSymbolicLinkPrivilege Create symbolic links Disabled
13:14:56 INFO - Return code: 0
TASKCLUSTER
20:29:08 INFO - Running command: ['whoami', '/all']
20:29:08 INFO - Copy/paste: whoami /all
20:29:08 INFO - USER INFORMATION
20:29:08 INFO - ----------------
20:29:08 INFO - User Name SID
20:29:08 INFO - =============================== ==============================================
20:29:08 INFO - i-0d2eae1827188\task_1474055472 S-1-5-21-2192203079-3471170106-2880594189-1010
20:29:08 INFO - GROUP INFORMATION
20:29:08 INFO - -----------------
20:29:08 INFO - Group Name Type SID Attributes
20:29:08 INFO - ====================================== ================ ============ ==================================================
20:29:08 INFO - Everyone Well-known group S-1-1-0 Mandatory group, Enabled by default, Enabled group
20:29:08 INFO - BUILTIN\Users Alias S-1-5-32-545 Mandatory group, Enabled by default, Enabled group
20:29:08 INFO - BUILTIN\Remote Desktop Users Alias S-1-5-32-555 Mandatory group, Enabled by default, Enabled group
20:29:08 INFO - NT AUTHORITY\INTERACTIVE Well-known group S-1-5-4 Mandatory group, Enabled by default, Enabled group
20:29:08 INFO - CONSOLE LOGON Well-known group S-1-2-1 Mandatory group, Enabled by default, Enabled group
20:29:08 INFO - NT AUTHORITY\Authenticated Users Well-known group S-1-5-11 Mandatory group, Enabled by default, Enabled group
20:29:08 INFO - NT AUTHORITY\This Organization Well-known group S-1-5-15 Mandatory group, Enabled by default, Enabled group
20:29:08 INFO - LOCAL Well-known group S-1-2-0 Mandatory group, Enabled by default, Enabled group
20:29:08 INFO - NT AUTHORITY\NTLM Authentication Well-known group S-1-5-64-10 Mandatory group, Enabled by default, Enabled group
20:29:08 INFO - Mandatory Label\Medium Mandatory Level Label S-1-16-8192 Mandatory group, Enabled by default, Enabled group
20:29:08 INFO - PRIVILEGES INFORMATION
20:29:08 INFO - ----------------------
20:29:08 INFO - Privilege Name Description State
20:29:08 INFO - ============================= ==================================== ========
20:29:08 INFO - SeShutdownPrivilege Shut down the system Disabled
20:29:08 INFO - SeChangeNotifyPrivilege Bypass traverse checking Enabled
20:29:08 INFO - SeUndockPrivilege Remove computer from docking station Disabled
20:29:08 INFO - SeIncreaseWorkingSetPrivilege Increase a process working set Disabled
20:29:08 INFO - SeTimeZonePrivilege Change the time zone Disabled
20:29:08 INFO - Return code: 0
See Also: → 1280474
Comment 2•8 years ago
|
||
:pmoore, do you think that the task user groups could be made a configurable option in generic-worker so that we can test the effect of adding the user to the admin group?
https://github.com/taskcluster/generic-worker/blob/master/plat_windows.go#L145
i've also seen http://serverfault.com/questions/198197/what-restrictions-does-winrm-winrs-place-on-the-processes-that-it-runs the answers hint at restrictions on the cmd shell from where processes are run, so that's another thing i'm investigating.
Flags: needinfo?(pmoore)
Comment 3•8 years ago
|
||
for talos we run xperf which requires admin privileges. I think it is expected from our tests that we run with admin credentials, we have a lot of tools that manage a set of processes and tweak settings and simulate stuff.
Assignee | ||
Comment 4•8 years ago
|
||
I'm not going to lie, this isn't pretty.
What we stand to lose from running a task as administrator, is safe sandboxing. Currently, users are like "guest users" whereby when their session ends, all trace of their existence disappears. The user is deleted, the registry changes are unmounted, the home folder is deleted, and the next task starts in a pristine clean environment, or at least in theory. The user shouldn't be able to alter system state in order that the worker is reusable between tasks.
When we allow a user to be an administrator, it means it will have access to any secrets on the worker (including taskcluster credentials, private gpg keys - which would break chain of trust, the livelog secret - which could enable malicious streaming of fake logs, and potentially any other secrets that might be used for other integrations). In addition, it could mean that a task could leave the worker in an altered state for a subsequent task, which also would be a security risk.
Admittedly, the situation in buildbot is not better - however, this is one of the reasons we wanted to move to a safer setup.
I think it would make most sense if we could dig deeper into what specific privileges are required, and see if it might be possible to grant a limited set, but not full administrator control. I think for every required privilege, we should also audit whether it is really required, or the test(s) can be adapted to work without it.
If this is an unsatisfactory answer, let's set up a meeting to discuss how we manage these conflicting requirements.
Flags: needinfo?(pmoore)
Comment 5•8 years ago
|
||
I would vote for an alternative masure- getting things running now so we can migrate over and then work on investigating this guest mode usage. Possibly we can restrict admin to a few jobs. What I see is that we have issues spawning e10s child process with our current worker, that is a non starter. Possibly we can tweak the worker to make this work smoother.
I have always maintained since the beginning that we require admin access for the xperf talos tests to work- this has to hook into the lower levels of the OS to get raw fileIO and processIO counters. Possibly there is a middle ground somewhere- but talos will be on hardware.
Pete, what type of secrets do we risk getting out? we have credentials to upload/interact with s3, treeherder, taskcluster for each task- these tests are not run on build machines, so we don't have concerns with signing keys.
Comment 6•8 years ago
|
||
Pete, even if we determine that a subset of permissions need to be added, we'll still need to be able to adjust the task users groups. Eg: maybe they could be added to a dedicated testers group. In any case, having the ability to adjust group memberships looks like it's going to be necessary. The answer isn't unsatisfactory, it's a valid point, but our testing and progress is currently blocked by a limitation in the generic worker which could be patched in a way that would allow us to make progress. Can you consider supporting us in this?
Flags: needinfo?(pmoore)
Assignee | ||
Comment 7•8 years ago
|
||
I'll add the feature, and we'll create a separate security bug to track the security implications, which we'll link to this one.
Flags: needinfo?(pmoore)
Reporter | ||
Comment 8•8 years ago
|
||
I have collected some more info about root cause of our tc windows 7 process management failures. For all of our e10s jobs and at least one xpcshell test, failure occurs when CreateProcessAsUserW is called at [1]; it sets GetLastError() to 5 - access denied. In all cases I have observed, we are calling CreateProcessAsUserW with the CREATE_BREAKAWAY_FROM_JOB flag (I'm not sure if that is the cause of the access denied, but it seems problematic in light of https://bugzilla.mozilla.org/show_bug.cgi?id=1303455#c0).
[1] https://hg.mozilla.org/mozilla-central/annotate/62f79d676e0e11b3ad59a5425b3ebb3ec5bbefb5/security/sandbox/chromium/sandbox/win/src/target_process.cc#l149
Comment 9•8 years ago
|
||
The reasoning behind adding this feature is understanding that this worker will be run on machines not managed by the TaskCluster team. As such, we need to expand the flexibility to match the environment that one chooses to run the worker in. This might include elevating privileges of the user used for tasks.
Comment 10•8 years ago
|
||
Following on from Geoffs look at access denied issues, I'm looking at a specific test failure.
https://treeherder.allizom.org/#/jobs?repo=try&revision=544ec1d97d4b4f066113eca3f764f96f1812e1dc&filter-tier=2&group_state=expanded&selectedJob=28218592
https://public-artifacts.taskcluster.net/JX9yYSRrR2-PHb4kfA97TQ/0/public/logs/live_backing.log
https://papertrailapp.com/systems/win7-i-0fa18b2c1d79d7690/events?focus=714927988487282692
The raw log from the test contains a lot of errors around access to paths on the c: drive. eg:
16:49:13 INFO - Failed to remove directory: c:\users\task_1474389326\appdata\local\temp\xpc-plugins-7f8dao. Waiting.
or:
17:55:55 INFO - "CONSOLE_MESSAGE: (info) 1474394154967 Toolkit.Telemetry TRACE TelemetryStorage::loadPingfile - unreadable ping c:\\users\\task_1474389326\\appdata\\local\\temp\\xpc-profile-zpettu\\datareporting\\aborted-session-ping: Win error 2 during operation open on file c:\\users\\task_1474389326\\appdata\\local\\temp\\xpc-profile-zpettu\\datareporting\\aborted-session-ping (The system cannot find the file specified.
I'm not 100% on this, but I hope it's a safe guess that the tests are using the environment configuration to determine where to look for (or create/delete) these files. Papertrail logs show that the generic worker provides environment configuration to the task:
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: 2016/09/20 16:35:35 Generating script:
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: 2016/09/20 16:35:35 :: This script runs command 2 defined in TaskId JX9yYSRrR2-PHb4kfA97TQ...
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: @echo off
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set ALLUSERSPROFILE=C:\ProgramData
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set APPDATA=C:\Users\task_1474389326\AppData\Roaming
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set ChocolateyInstall=C:\ProgramData\chocolatey
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set CommonProgramFiles=C:\Program Files\Common Files
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set COMPUTERNAME=I-0FA18B2C1D79D
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set ComSpec=C:\windows\system32\cmd.exe
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set DCLOCATION=SCL3
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set DNSSUFFIX=wintest.releng.scl3.mozilla.com
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set FP_NO_HOST_CHECK=NO
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set HOMEDRIVE=Z:
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set HOMEPATH=\task_1474389326
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set KTS_HOME=C:\Program Files\KTS
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set KTS_VERSION=1.19c
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set LOCALAPPDATA=C:\Users\task_1474389326\AppData\Local
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set MOZILLABUILD=C:\mozilla-build
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set NUMBER_OF_PROCESSORS=8
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set OS=Windows_NT
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set Path=C:\mozilla-build\7zip;C:\mozilla-build\info-zip;C:\mozilla-build\kdiff3;C:\mozilla-build\moztools-x64\bin;C:\mozilla-build\mozmake;C:\mozilla-build\msys\bin;C:\mozilla-build\msys\local\bin;C:\mozilla-build\nsis-3.0b3;C:\mozilla-build\nsis-2.46u;C:\mozilla-build\python;C:\mozilla-build\python\Scripts;C:\mozilla-build\upx391w;C:\mozilla-build\wget;C:\windows\system32;C:\windows;C:\windows\System32\Wbem;C:\windows\System32\WindowsPowerShell\v1.0\;C:\mozilla-build\python27;C:\mozilla-build\python27\Scripts;C:\mozilla-build\vim\vim72;C:\CoreUtils\bin;C:\mozilla-build\buildbotve\scripts;c:\mozilla-build\hg\;c:\Program Files\Microsoft Windows Performance Toolkit\;C:\Users\cltbld\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0;C:\ProgramData\chocolatey\bin;C:\Program Files\Puppet Labs\Puppet\bin;C:\mozilla-build\hg
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set PATHEXT=.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set PROCESSOR_ARCHITECTURE=x86
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set PROCESSOR_IDENTIFIER=x86 Family 6 Model 62 Stepping 4, GenuineIntel
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set PROCESSOR_LEVEL=6
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set PROCESSOR_REVISION=3e04
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set ProgramData=C:\ProgramData
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set ProgramFiles=C:\Program Files
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set PROMPT=$P$G
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set PSModulePath=C:\windows\system32\WindowsPowerShell\v1.0\Modules\;C:\Program Files\AWS Tools\PowerShell\
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set PUBLIC=C:\Users\Public
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set SystemDrive=C:
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set SystemRoot=C:\windows
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set tcexitcode=0
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set TEMP=C:\Users\task_1474389326\AppData\Local\Temp
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set test1=testie
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set TMP=C:\Users\task_1474389326\AppData\Local\Temp
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set USERDOMAIN=I-0FA18B2C1D79D
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set USERNAME=task_1474389326
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set USERPROFILE=C:\Users\task_1474389326
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set windir=C:\windows
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set windows_tracing_flags=3
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set windows_tracing_logfile=C:\BVTBin\Tests\installpackage\csilogfile.log
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: cd Z:\task_1474389326
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set errorlevel=
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: call Z:\task_1474389326\command_000002.bat 2>&1
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set tcexitcode=%errorlevel%
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: set > Z:\task_1474389326\env.txt
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: cd > Z:\task_1474389326\dir.txt
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: exit /b %tcexitcode%
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: 2016/09/20 16:35:35 Script "Z:\\task_1474389326\\command_000002.bat":
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: 2016/09/20 16:35:35 Contents:
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: 2016/09/20 16:35:35 @echo on
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: c:\mozilla-build\python\python.exe -u mozharness\scripts\desktop_unittest.py --cfg mozharness\configs\unittests\win_taskcluster_unittest.py --xpcshell-suite=xpcshell --no-read-buildbot-config --installer-url https://queue.taskcluster.net/v1/task/TNYaBfxDTWGI0hB0PRjHGw/artifacts/public/build/firefox-51.0a1.en-US.win32.zip --test-packages-url https://queue.taskcluster.net/v1/task/TNYaBfxDTWGI0hB0PRjHGw/artifacts/public/build/firefox-51.0a1.en-US.win32.test_packages.json --download-symbols true
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: @echo off
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker: 2016/09/20 16:35:35 Running command: 'Z:\task_1474389326\command_000002_wrapper.bat'
Some of the environment is correct (HOMEDRIVE, HOMEPATH), but a lot isn't (APPDATA, LOCALAPPDATA, TEMP, TMP, USERPROFILE).
I think that the generic-worker should be explicitly setting the values of APPDATA, LOCALAPPDATA, TEMP, TMP and USERPROFILE to directories under the users home path on the z: drive (or wherever the config states the homedir should be) and creating those directories if they don't exist. When we start to correct permissions for tests, it won't be helpful to have to make allowances for appdata and temp folders that are on the c: drive when they should be under the users home directory.
Comment 11•8 years ago
|
||
I am curious why z: vs c:? I see that we are using c:\users\task_xxx\..., to me I interpret that as 100% in the scope of the generic worker and should be fully readable/writable/executable?
Comment 12•8 years ago
|
||
g-w takes a config parameter to tell it under what directory it should create task user home directories. gecko-* windows worker types use z:\ because it is mapped to the ephemeral drive which is free, rather than the c:\users directory which costs money. note that the task work directory is correctly set to Z:\task_1474389326, where the work is all taking place. it's just the temp and appdata folders that are referencing c:\Users\task_1474389326
Comment 13•8 years ago
|
||
thanks for the explanation. To confirm, outside of c:\mozilla-build we should see no c:\ access in the log file?
outside of the above mentioned, I see some env vars set:
'WINDOWS_TRACING_LOGFILE': 'C:\\BVTBin\\Tests\\installpackage\\csilogfile.log'
'KTS_HOME': 'C:\\Program Files\\KTS'
'PUBLIC': 'C:\\Users\\Public'
'ALLUSERSPROFILE': 'C:\\ProgramData'
'CHOCOLATEYINSTALL': 'C:\\ProgramData\\chocolatey' <- who names a progam chocolatey ???
and in path, I see:
C:\\CoreUtils\\bin
C:\\Users\\cltbld\\AppData\\Local\\Programs\\Common\\Microsoft\\Visual C++ for Python\\9.0
C:\\ProgramData\\chocolatey\\bin
C:\\Program Files\\Puppet Labs\\Puppet\\bin
Comment 14•8 years ago
|
||
I'm not sure. It's one of the things I want to understand better through this migration as well. If we have tests reading from or writing to locations on the C: drive, I want to understand what locations. Using Z: gives us a simple mechanism for locking down permissions (if we want to). There might be cases where we explicitly grant access to specific paths on C: but we/I have yet to establish what and why.
Assignee | ||
Comment 15•8 years ago
|
||
Thanks for this Rob.
I suspect the problem lies somewhere deep under the covers of https://github.com/taskcluster/generic-worker/blob/7edf51d0f80732396412db68fbfe7d2bc188c7f0/plat_windows.go#L153
There is probably an assumption somewhere in the code that the Users parent directory is C:\Users - and we'll need to dig into this to understand where that comes from in the calling library (a hardcoded reference, something configured in the registry, an environment variable, etc).
Assignee | ||
Comment 16•8 years ago
|
||
Probably when the local user is created, something is determining where the home folder should be created, which is incorrectly using C:\Users instead of Z:\Users... I'll try to trace this through the various syscalls when the user is created and when the process is created, to find how the USERPROFILE is set (and associated variables).
It might be that the location is frozen to this location by the OS, but I suspect even if that was the case, I can probably adjust the registry entry/entries after user creation to relocate to the preferred configured location for the worker type.
There are a chain of syscalls, so I'll need to dig through all of them.
Comment 17•8 years ago
|
||
The log in comment 10 shows tests putting things in the temp directory. Specifically, that looks to be from an xpcshell test. The xpcshell harness just calls mkdtemp to create these:
https://dxr.mozilla.org/mozilla-central/rev/29beaebdfaccbdaeb4c1ee5a43a9795ab015ef49/testing/xpcshell/runxpcshelltests.py#364
Looking at the Python 2.7 source for mkdtemp:
http://svn.python.org/view/python/branches/release27-maint/Lib/tempfile.py?revision=82503&view=markup#l296
It looks like it uses _candidate_tempdir_list to find the tempdir:
http://svn.python.org/view/python/branches/release27-maint/Lib/tempfile.py?revision=82503&view=markup#l140
Its first choice is whatever's in the first of the $TMPDIR / $TEMP / $TMP environment variables.
Assignee | ||
Comment 18•8 years ago
|
||
Is this now resolved with the resolution of bug 1306988 (the "osGroups" feature in generic worker) and the workaround described in https://bugzilla.mozilla.org/show_bug.cgi?id=1306989#c3 ?
Flags: needinfo?(rthijssen)
Comment 19•8 years ago
|
||
this is not resolved. we still see the error but it is not due to problems in g-w. the ability to test different permission groups is helping with debugging and eliminating possible causes.
Assignee: nobody → rthijssen
Component: Generic-Worker → General Automation
Flags: needinfo?(rthijssen)
Product: Taskcluster → Release Engineering
QA Contact: catlee
Reporter | ||
Comment 20•8 years ago
|
||
In https://treeherder.mozilla.org/#/jobs?repo=try&revision=029efe60b61e6ac8b9897defe87f9dcf79ef4f19, I repeated my experiment from comment 1, now using "osGroups" to join Administrators. Now whoami reports privileges on taskcluster Windows 7 jobs very similar to corresponding jobs in buildbot.
19:28:25 INFO - Copy/paste: whoami /all
19:28:25 INFO - USER INFORMATION
19:28:25 INFO - ----------------
19:28:25 INFO - User Name SID
19:28:25 INFO - =============================== ==============================================
19:28:25 INFO - i-0d0fbe76c0089\task_1477507637 S-1-5-21-2192203079-3471170106-2880594189-1005
19:28:25 INFO - GROUP INFORMATION
19:28:25 INFO - -----------------
19:28:25 INFO - Group Name Type SID Attributes
19:28:25 INFO - ==================================== ================ ============ ===============================================================
19:28:25 INFO - Everyone Well-known group S-1-1-0 Mandatory group, Enabled by default, Enabled group
19:28:25 INFO - BUILTIN\Users Alias S-1-5-32-545 Mandatory group, Enabled by default, Enabled group
19:28:25 INFO - BUILTIN\Remote Desktop Users Alias S-1-5-32-555 Mandatory group, Enabled by default, Enabled group
19:28:25 INFO - BUILTIN\Administrators Alias S-1-5-32-544 Mandatory group, Enabled by default, Enabled group, Group owner
19:28:25 INFO - NT AUTHORITY\INTERACTIVE Well-known group S-1-5-4 Mandatory group, Enabled by default, Enabled group
19:28:25 INFO - CONSOLE LOGON Well-known group S-1-2-1 Mandatory group, Enabled by default, Enabled group
19:28:25 INFO - NT AUTHORITY\Authenticated Users Well-known group S-1-5-11 Mandatory group, Enabled by default, Enabled group
19:28:25 INFO - NT AUTHORITY\This Organization Well-known group S-1-5-15 Mandatory group, Enabled by default, Enabled group
19:28:25 INFO - LOCAL Well-known group S-1-2-0 Mandatory group, Enabled by default, Enabled group
19:28:25 INFO - NT AUTHORITY\NTLM Authentication Well-known group S-1-5-64-10 Mandatory group, Enabled by default, Enabled group
19:28:25 INFO - Mandatory Label\High Mandatory Level Label S-1-16-12288 Mandatory group, Enabled by default, Enabled group
19:28:25 INFO - PRIVILEGES INFORMATION
19:28:25 INFO - ----------------------
19:28:25 INFO - Privilege Name Description State
19:28:25 INFO - =============================== ========================================= ========
19:28:25 INFO - SeIncreaseQuotaPrivilege Adjust memory quotas for a process Disabled
19:28:25 INFO - SeSecurityPrivilege Manage auditing and security log Disabled
19:28:25 INFO - SeTakeOwnershipPrivilege Take ownership of files or other objects Disabled
19:28:25 INFO - SeLoadDriverPrivilege Load and unload device drivers Disabled
19:28:25 INFO - SeSystemProfilePrivilege Profile system performance Disabled
19:28:25 INFO - SeSystemtimePrivilege Change the system time Disabled
19:28:25 INFO - SeProfileSingleProcessPrivilege Profile single process Disabled
19:28:25 INFO - SeIncreaseBasePriorityPrivilege Increase scheduling priority Disabled
19:28:25 INFO - SeCreatePagefilePrivilege Create a pagefile Disabled
19:28:25 INFO - SeBackupPrivilege Back up files and directories Disabled
19:28:25 INFO - SeRestorePrivilege Restore files and directories Disabled
19:28:25 INFO - SeShutdownPrivilege Shut down the system Disabled
19:28:25 INFO - SeDebugPrivilege Debug programs Disabled
19:28:25 INFO - SeSystemEnvironmentPrivilege Modify firmware environment values Disabled
19:28:25 INFO - SeChangeNotifyPrivilege Bypass traverse checking Enabled
19:28:25 INFO - SeRemoteShutdownPrivilege Force shutdown from a remote system Disabled
19:28:25 INFO - SeUndockPrivilege Remove computer from docking station Disabled
19:28:25 INFO - SeManageVolumePrivilege Perform volume maintenance tasks Disabled
19:28:25 INFO - SeImpersonatePrivilege Impersonate a client after authentication Enabled
19:28:25 INFO - SeCreateGlobalPrivilege Create global objects Enabled
19:28:25 INFO - SeIncreaseWorkingSetPrivilege Increase a process working set Disabled
19:28:25 INFO - SeTimeZonePrivilege Change the time zone Disabled
19:28:25 INFO - SeCreateSymbolicLinkPrivilege Create symbolic links Disabled
19:28:25 INFO - Return code: 0
That's great to see!
But, JOB_OBJECT_LIMIT_BREAKAWAY_OK is still not set in the limit flags, so CanCreateJobObject() still returns False, and tests still cannot create processes, as in Comment 0.
Reporter | ||
Comment 21•8 years ago
|
||
It is worth noting that JOB_OBJECT_LIMIT_BREAKAWAY_OK is also not set on Windows 10 -- but that does not cause any test failures. On Windows 6.2 (note that our Windows 7 is 6.1.7601) and above, nested jobs can be used. In terms of our python code, see for instance:
https://dxr.mozilla.org/mozilla-central/rev/f9f3cc95d7282f1fd83f66dd74acbcdbfe821915/testing/mozbase/mozprocess/mozprocess/processhandler.py#267 and
https://dxr.mozilla.org/mozilla-central/rev/f9f3cc95d7282f1fd83f66dd74acbcdbfe821915/testing/mozbase/mozprocess/mozprocess/processhandler.py#366
On Windows 10, we do not have can_create_jobs (because JOB_OBJECT_LIMIT_BREAKAWAY_OK is missing) but we can_nest_jobs, so no warning is printed (and future operations succeed).
On Windows 7, we do not have either can_create_jobs nor can_nest_jobs (and there is no hope of getting can_nest_jobs on Windows 7).
Comment 22•8 years ago
|
||
so how do we successfully run tests on the buildbot AMIs?
Comment 23•8 years ago
|
||
Pete, I am not sure where this problem is defined at, but if you could log into a machine and run the command manually from a command prompt, that would help us determine if this is a generic-worker.exe issue or an AMI issue.
I did poke around in the generic worker code, and was not able to determine how we launch a process. I am not familiar with GO as a programming language and was not in a position to learn the differences of GO.
In reading Microsoft's CreateProcess() documentation it looks like we need to create a job group with the proper flags set, then associate the processes with that. Once flags are set properly for the job, then any CreateProcess calls from a process in the job group will inherit the proper flags. If that is the case, then it is safe to assume that we are launching the worker incorrectly, or doing something in process creation from the worker to remove/restrict the process/job flags.
As it stands this is blocking us from doing any work in windows 7. I would not trust any results we get on what few jobs might pass now until this is resolved as it will change the process scope running for all tasks.
Flags: needinfo?(pmoore)
Assignee | ||
Comment 24•8 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #23)
> Pete, I am not sure where this problem is defined at, but if you could log
> into a machine and run the command manually from a command prompt, that
> would help us determine if this is a generic-worker.exe issue or an AMI
> issue.
Sure - I can take a look at this on Monday. Unfortunately it's 9pm on Friday night now.
>
> I did poke around in the generic worker code, and was not able to determine
> how we launch a process. I am not familiar with GO as a programming
> language and was not in a position to learn the differences of GO.
See https://github.com/taskcluster/generic-worker/blob/c61c8950947e1f62f078599210ea828060389f44/syscall/exec_windows.go#L19
>
> In reading Microsoft's CreateProcess() documentation it looks like we need
> to create a job group with the proper flags set, then associate the
> processes with that. Once flags are set properly for the job, then any
> CreateProcess calls from a process in the job group will inherit the proper
> flags. If that is the case, then it is safe to assume that we are launching
> the worker incorrectly, or doing something in process creation from the
> worker to remove/restrict the process/job flags.
>
> As it stands this is blocking us from doing any work in windows 7. I would
> not trust any results we get on what few jobs might pass now until this is
> resolved as it will change the process scope running for all tasks.
Let's continue working together to solve this problem, as none of us knows the source of the problem yet. I'll let you know how I get on on Monday; please also update the bug if you make any progress.
Flags: needinfo?(pmoore)
Comment 25•8 years ago
|
||
this can easily wait until Monday- lets spend time on Monday working on this together- possibly we can narrow down the root cause- I have some time in your afternoon :)
Reporter | ||
Comment 26•8 years ago
|
||
MSDN's "Job Objects" page has some possible helpful info: https://msdn.microsoft.com/en-us/library/windows/desktop/ms684161(v=vs.85).aspx. I'm thinking we might be able to QueryInformationJobObject(NULL, ...) to get our current process' job handle, then use SetInformationJobObject() to set the limits we need on our job -- but that's very speculative and I'm no expert in this area.
Assignee | ||
Comment 27•8 years ago
|
||
Let's go through this together this evening.
I'll be running some job errands this afternoon, which will also mean we have more crossover time this evening.
Thanks Geoff for the msdn link - that page is really useful - indeed, it looks like I should be able to set some process options when I spawn the process, so I'm sure we can get this resolved swiftly.
Comment 28•8 years ago
|
||
logging into a machine, I was able to confirm a few things:
1) running the raw commands from a regular admin cmd.exe prompt, they run fast, either on c:\ or z:\
2) I can reproduce the problem from the generic-worker.exe (I hack a command prompt), but cannot reproduce from a cmd.exe
3) I created a new scheduled task to mirror the settings of the generic worker and instead ran cmd.exe- I was able run processes normally.
Basically we have the condition where when running jobs from the generic-worker.exe we are running inside a job object. When we run from a regular shell or a scheduled task shell, we are not in a job object.
This is pointing towards that of the generic-worker.exe process launching needs to run without a jobObject, or adjust flags. There is a good chance that the same fix will also remedy the performance issues we are seeing.
in going through this, it because evident that we could quickly iterate on a loaner box by getting access to the AMI, then relaunching the scheduled task with new workers.
One other inconsistency that I saw in the scheduled task definition [1] was that we have the user as |GenericWorker|, but for manually created tasks they are |i-<nodeid>\GenericWorker|. I don't think this is a problem since importing a similar task but with just cmd.exe works as expected.
[1] https://pastebin.mozilla.org/8923714
Comment 29•8 years ago
|
||
python script to validate if we are in a job object or not
Comment 30•8 years ago
|
||
my attempt at writing a go program so it will launch a cmd.exe shell to run |python jobobject.py| from either createprocess or createprocesswithlogon.
When I use createprocess, we are not in a jobobject
When I use createprocesswithlogon, we are IN a jobobject
I would expect us to ensure that we pass our tests (ideally NOT IN a job object) prior to calling this fixed.
:pmoore, let me know if you need more information.
Flags: needinfo?(pmoore)
Assignee | ||
Comment 31•8 years ago
|
||
It looks like CreateProcessAsUser[1] system call might avoid the job creation - I'm going to have a play swapping out CreateProcessWithLogonW[2] with CreateProcessAsUser.
[1] https://msdn.microsoft.com/en-us/library/windows/desktop/ms682429(v=vs.85).aspx
[2] https://msdn.microsoft.com/en-us/library/windows/desktop/ms682431(v=vs.85).aspx
This is based on an optimistic interpretation of https://github.com/contester/runlib/blob/e1c5f73778c7081e9ce20b930999d03f51e78f4d/subprocess/subprocess_windows.go#L140 and https://github.com/contester/runlib/blob/e1c5f73778c7081e9ce20b930999d03f51e78f4d/subprocess/subprocess_windows.go#L166-L195 whereby I assume that sub.NoJob is set to true if a job already exists (and therefore no new job should be created).
Flags: needinfo?(pmoore)
Assignee | ||
Comment 32•8 years ago
|
||
I have a build currently running (7.0.0alpha1) using CreateProcessAsUser instead of CreateProcessWithLogonW.
I will test this when I've set up a win7 environment to use it.
----------------------
In addition, I discovered the following, which supports the hypothesis that moving to CreateProcessAsUser is the way to go for Windows 7:
http://stackoverflow.com/questions/27339341/job-control-with-createprocesswithlogonw
--------------
Interactions with Win32 Job Objects
CreateProcessWithLogonW executes the new process as a child of the Secondary Logon service, which has the outcome of making the process escape any Job Object membership/restrictions even if the Job Object did not allow breakaway.
Furthermore, the Secondary Logon service automatically creates its own new Job Object and assigns the new process into it. As such, it is not possible for the caller to explicitly assign the new process to any other Job Object (since a process may only be assigned to one Job Object, and can never be removed from a Job Object once it has been assigned to one).
Assignee | ||
Comment 33•8 years ago
|
||
Blast - will need to debug this....
2016/11/02 15:39:27 Running command: 'cmd' '/c' 'del' '/q' '/f' 'Z:\task_1478100703'
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x10 pc=0x75e0d2]
goroutine 1 [running]:
panic(0x7bb080, 0x11e92030)
/home/travis/go/src/runtime/panic.go:500 +0x331
github.com/taskcluster/generic-worker/syscall.StartProcess(0x12047e00, 0x1b, 0x11ff3a40, 0x6, 0x6, 0x11f59838, 0x11fe4cc0, 0xf, 0x11ff1a40, 0x1d, ...)
/home/travis/gopath/src/github.com/taskcluster/generic-worker/syscall/exec_windows.go:144 +0x6f2
github.com/taskcluster/generic-worker/os.startProcess(0x12047e00, 0x1b, 0x11ff3a40, 0x6, 0x6, 0x11f5998c, 0x11fe4cc0, 0xf, 0x11ff1a40, 0x1d, ...)
/home/travis/gopath/src/github.com/taskcluster/generic-worker/os/exec_windows.go:41 +0x2d3
github.com/taskcluster/generic-worker/os.StartProcess(0x12047e00, 0x1b, 0x11ff3a40, 0x6, 0x6, 0x11f5998c, 0x11fe4cc0, 0xf, 0x11ff1a40, 0x1d, ...)
/home/travis/gopath/src/github.com/taskcluster/generic-worker/os/doc_windows.go:17 +0x7f
github.com/taskcluster/generic-worker/os/exec.(*Cmd).Start(0x11eb7080, 0x0, 0x0)
/home/travis/gopath/src/github.com/taskcluster/generic-worker/os/exec/exec_windows.go:153 +0x620
github.com/taskcluster/generic-worker/os/exec.(*Cmd).Run(0x11eb7080, 0x0, 0x0)
/home/travis/gopath/src/github.com/taskcluster/generic-worker/os/exec/exec_windows.go:354 +0x2b
main.runCommands(0x0, 0x11fe4cc0, 0xf, 0x11ff1a40, 0x1d, 0x11f59b20, 0x1, 0x1, 0x0, 0x0)
/home/travis/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:547 +0x2cf
main.deleteHomeDir(0x11ff19a0, 0x12, 0x11fe4cc0, 0xf, 0x0, 0x0)
/home/travis/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:85 +0x4b7
main.deleteHomeDirs()
/home/travis/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:219 +0x584
main.deleteExistingOSUsers()
/home/travis/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:186 +0x1e
main.taskCleanup(0x0, 0x0)
/home/travis/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:383 +0xf8
main.startup(0x0, 0x0)
/home/travis/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:68 +0x9e
main.runWorker()
/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:393 +0x38
main.main()
/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:285 +0x648
Assignee | ||
Comment 34•8 years ago
|
||
The above issue was fixed relatively easily, but now we have a tricky problem....
-------------------------------------------------------
2016/11/03 12:33:48 Running command: 'cmd' '/c' 'md' 'Z:\task_1478176427\public\logs'
username: "task_1478176427"
password: "pWd0_KDzLh3SFtTIlI7pa6xjyKizo"
LOGON_WITH_PROFILE: '\x01'
argv0p: %!q(*uint16=0x1202ad00)
argvp: %!q(*uint16=0x120088a0)
flags: 'Ѐ'
attr.Env: ["=C:=C:\\gopath\\src\\github.com\\taskcluster\\generic-worker" "=ExitCode=00000000" "ALLUSERSPROFILE=C:\\ProgramData" "APPDATA=C:\\Users\\GenericWorker\\AppData\\Roaming" "CLIENTNAME=Petes-iMac.loca" "CommonProgramFiles=C:\\Program Files\\Common Files" "COMPUTERNAME=I-00107EBFEFA9C" "ComSpec=C:\\windows\\system32\\cmd.exe" "DCLOCATION=SCL3" "DNSSUFFIX=use1.mozilla.com" "FP_NO_HOST_CHECK=NO" "GOPATH=C:\\gopath" "GOROOT=C:\\go" "HOMEDRIVE=C:" "HOMEPATH=\\Users\\GenericWorker" "KTS_HOME=C:\\Program Files\\KTS" "KTS_VERSION=1.19c" "LOCALAPPDATA=C:\\Users\\GenericWorker\\AppData\\Local" "LOGONSERVER=\\\\I-00107EBFEFA9C" "MOZILLABUILD=C:\\mozilla-build" "NUMBER_OF_PROCESSORS=8" "OS=Windows_NT" "Path=C:\\Program Files\\Windows Resource Kits\\Tools\\;C:\\Program Files\\Mercurial;C:\\mozilla-build\\7zip;C:\\mozilla-build\\info-zip;C:\\mozilla-build\\kdiff3;C:\\mozilla-build\\moztools-x64\\bin;C:\\mozilla-build\\mozmake;C:\\mozilla-build\\msys\\bin;C:\\mozilla-build\\msys\\local\\bin;C:\\mozilla-build\\nsis-3.0b3;C:\\mozilla-build\\nsis-2.46u;C:\\mozilla-build\\python;C:\\mozilla-build\\python\\Scripts;C:\\mozilla-build\\upx391w;C:\\mozilla-build\\wget;C:\\mozilla-build\\yasm;C:\\windows\\system32;C:\\windows;C:\\windows\\System32\\Wbem;C:\\windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\mozilla-build\\python27;C:\\mozilla-build\\python27\\Scripts;C:\\mozilla-build\\vim\\vim72;C:\\CoreUtils\\bin;C:\\mozilla-build\\buildbotve\\scripts;c:\\Program Files\\Microsoft Windows Performance Toolkit\\;C:\\Users\\cltbld\\AppData\\Local\\Programs\\Common\\Microsoft\\Visual C++ for Python\\9.0;C:\\ProgramData\\chocolatey\\bin;C:\\Program Files\\Puppet Labs\\Puppet\\bin;C:\\mozilla-build\\hg" "PATHEXT=.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC" "PIP_DOWNLOAD_CACHE=y:\\pip-cache" "PROCESSOR_ARCHITECTURE=x86" "PROCESSOR_IDENTIFIER=x86 Family 6 Model 62 Stepping 4, GenuineIntel" "PROCESSOR_LEVEL=6" "PROCESSOR_REVISION=3e04" "ProgramData=C:\\ProgramData" "ProgramFiles=C:\\Program Files" "PROMPT=$P$G" "PSModulePath=C:\\windows\\system32\\WindowsPowerShell\\v1.0\\Modules\\;C:\\Program Files\\AWS Tools\\PowerShell\\" "PUBLIC=C:\\Users\\Public" "SESSIONNAME=RDP-Tcp#0" "SystemDrive=C:" "SystemRoot=C:\\windows" "TEMP=C:\\Users\\GenericWorker\\AppData\\Local\\Temp" "TMP=C:\\Users\\GenericWorker\\AppData\\Local\\Temp" "TOOLTOOL_CACHE=y:\\tooltool-cache" "USERDOMAIN=I-00107EBFEFA9C" "USERNAME=GenericWorker" "USERPROFILE=C:\\Users\\GenericWorker" "windir=C:\\windows" "windows_tracing_flags=3" "windows_tracing_logfile=C:\\BVTBin\\Tests\\installpackage\\csilogfile.log"]
dirp: %!q(*uint16=<nil>)
si: &{'D' %!q(*uint16=<nil>) %!q(*uint16=<nil>) %!q(*uint16=<nil>) '\x00' '\x00' '\x00' '\x00' '\x00' '\x00' '\x00' 'Ā' '\x00' '\x00' %!q(*uint8=<nil>) 'Ɛ' '\x0f' '\x13'}
pi: &{'\x00' '\x00' '\x00' '\x00'}
2016/11/03 12:33:48 fork/exec C:\windows\system32\cmd.exe: A required privilege is not held by the client.
-------------------------------------------------------
I installed Windows Server 2003 Resource Kit Tools from https://www.microsoft.com/en-us/download/confirmation.aspx?id=17657 and ran:
C:\>ntrights -u GenericWorker +r SeIncreaseQuotaPrivilege
Granting SeIncreaseQuotaPrivilege to GenericWorker ... successful
C:\>ntrights -u GenericWorker +r SeAssignPrimaryTokenPrivilege
Granting SeAssignPrimaryTokenPrivilege to GenericWorker ... successful
C:\>
However, that also did not fix it. Some reasons I can think of that might be the cause of this:
1) There are other privileges needed
2) ntrights technically isn't supported on windows 7, so maybe it didn't work even though output suggested it did (unlikely)
3) The task_* user accounts need the privilege(s) rather than the GenericWorker account
4) The user accounts are ok, but the access token retrieved in the login system call requires the privileges
5) The generic-worker.exe process also requires the privileges, which were given to GenericWorker user that launched the process
The contester library https://github.com/contester/runlib doesn't seem to refer explicitly to these privileges, and presumably works, so this leads me to suspect this is more likely related to how the environment is set up, rather than requiring runtime manipulation of policy.
This is a classic case of an error message not describing the instance data it relates to (i.e. not mentioning which privilege is not held by which client). :(
Assignee | ||
Comment 35•8 years ago
|
||
I've managed to reproduce this issue when using the runexe library directly (see https://github.com/contester/runlib):
C:\>runexe -l mozilla -p qwe12345!@#$% C:\Windows\System32\notepad.exe
Invocation failed: Program
Comment: CreateFrozen/CreateProcessAsUser/syscall:CreateProcessAsUser: A required privilege is not held by the client.
The runlib/runexe tool uses the same runlib/subprocess library that we are using to manage process execution. I've therefore contacted the author of runlib to ask if he knows what the problem might be. The runlib library is used for testing code submissions on http://codeforces.com/ so is known to work - so hopefully the resolution is just granting some permissions somewhere, to some object in the environment, in order to get it working.
In parallel, I will see if I can grant the local security policy privileges by some other means than using Windows Server 2003 Resource Kit, in case the problem is running this on Windows 7 (remote test worker) / Windows Server 2008 (local development VM).
Assignee | ||
Comment 36•8 years ago
|
||
Note, this is what I've already tried, which hasn't fixed it so far. "peter james moore" is the admin account that is running runexe, "mozilla" is the test account that the notepad.exe process should be executed under (from the example in comment 35).
C:\Windows\system32>ntrights -u "peter james moore" +r SeIncreaseQuotaPrivilege
Granting SeIncreaseQuotaPrivilege to peter james moore ... successful
C:\Windows\system32>ntrights -u "peter james moore" +r SeAssignPrimaryTokenPrivilege
Granting SeAssignPrimaryTokenPrivilege to peter james moore ... successful
C:\Windows\system32>ntrights -u mozilla +r SeIncreaseQuotaPrivilege
Granting SeIncreaseQuotaPrivilege to mozilla ... successful
C:\Windows\system32>ntrights -u mozilla +r SeAssignPrimaryTokenPrivilege
Granting SeAssignPrimaryTokenPrivilege to mozilla … successful
C:\Windows\system32>wmic useraccount where name='mozilla' set passwordexpires=false
Updating property(s) of '\\WIN-7LN26LS103P\ROOT\CIMV2:Win32_UserAccount.Domain="WIN-7LN26LS103P",Name="mozilla"'
Property(s) update successful.
C:\Windows\system32>net localgroup "Remote Desktop Users" /add "peter james moore"
The command completed successfully.
Assignee | ||
Comment 37•8 years ago
|
||
Agggghhhhhhhhhhhh
A restart fixed the runexe problem!!!!
Let me see if it fixes the problem in generic worker too.............
Assignee | ||
Updated•8 years ago
|
Assignee: rthijssen → pmoore
Assignee | ||
Comment 38•8 years ago
|
||
Now I'm closer - getting
2016/11/03 19:17:18 ERROR encountered: Task not successful due to following exception(s):
Exception 1)
exit status -1073741502
I believe this is "STATUS_DLL_INIT_FAILED" but I'll need to look into this tomorrow.
Assignee | ||
Comment 39•8 years ago
|
||
Looks like this page might help me:
https://blogs.msdn.microsoft.com/winsdk/2015/06/03/what-is-up-with-the-application-failed-to-initialize-properly-0xc0000142-error/
Assignee | ||
Comment 40•8 years ago
|
||
So ........
It turns out that using the mechanics of the runlib library for also creating desktop, windows station etc (not just logging in and loading user profile) will resolve the problem.
I've spent considerable time working my way through the entire code base to understand all these additional parts, and have reached the conclusion that the best approach would be to use higher level functions provided by this library that take care of these concerns under the hood. Fortunately the library is very well written (if not very well documented) such that the code for running processes under different user accounts is cleanly separated from the parts which take care of testing software submissions for software contests. However, this still requires quite a bit of rework in the generic worker to abstract the concepts around execution of commands, resulting processes, and results from completed processes, since runlib is only designed for windows/linux and has no darwin support. This isn't too complicated, but requires a bit of work. In any case, this was in need of doing anyway, as it wasn't very well architected as it was.
Advantages that this approach gives us:
* we should now have separate desktops for each task - this provides a framework for building the "interactive" feature on top of, for windows loaners
* we can potentially run the generic worker as a service, which is cleaner on many counts (and has profound security benefits)
* it makes having a native engine on linux with separate user accounts a step closer
* we can differentiate between wall time / kernel time / user time when running processes, which means regardless of other load on a machine, we can get accurate data for performance of a task
* we have the possibility to add new features to generic worker to take advantage of the runlib features such as:
o limit memory usage
o restrict UI access
o manage process affinity
o apply limits for process user time / kernel time
o inject DLL
o pause processes
Regardless, solving this bug should also solve the performance problems we've been witnessing, as a result of running in a job group.
Comment 41•8 years ago
|
||
this sounds freaking awesome \o/
Assignee | ||
Comment 42•8 years ago
|
||
So I'm pretty close! I finished writing the code, now I am just debugging it / fixing it.
Have a problem with the stdout/stderr/stdin management which I haven't quite got to the bottom of yet. In runlib they are files, rather than io.Reader/io.Writer objects. This shouldn't be too much of a problem, but it isn't quite working yet - so I'll have to part this until Monday morning.
But code is implemented, so just a case of debugging. This is all going on the createprocessasuser git branch of generic worker...
Assignee | ||
Comment 43•8 years ago
|
||
The unit/integration tests are all passing now!
https://github.com/taskcluster/generic-worker/pull/28
I then manually tested by deploying on ami-test-win2012r2, but discovered there is still a permissions problem:
Task log:
[taskcluster 2016-11-06T12:50:01.162Z] goroutine 1 [running]:
[taskcluster 2016-11-06T12:50:01.162Z] runtime/debug.Stack(0xc04233b860, 0x7f23c0, 0xc0421221e0)
[taskcluster 2016-11-06T12:50:01.162Z] C:/Go/src/runtime/debug/stack.go:24 +0x80
[taskcluster 2016-11-06T12:50:01.162Z] main.(*TaskRun).run.func2(0xc042214c18, 0xc0422a5000, 0xa46460, 0xc042214ca8)
[taskcluster 2016-11-06T12:50:01.162Z] C:/gopath/src/github.com/taskcluster/generic-worker/main.go:1024 +0x25e
[taskcluster 2016-11-06T12:50:01.162Z] panic(0x7f23c0, 0xc0421221e0)
[taskcluster 2016-11-06T12:50:01.162Z] C:/Go/src/runtime/panic.go:458 +0x251
[taskcluster 2016-11-06T12:50:01.162Z] main.(*TaskRun).run(0xc0422a5000, 0x0)
[taskcluster 2016-11-06T12:50:01.162Z] C:/gopath/src/github.com/taskcluster/generic-worker/main.go:1045 +0x1738
[taskcluster 2016-11-06T12:50:01.162Z] main.FindAndRunTask(0x3b9aca00)
[taskcluster 2016-11-06T12:50:01.162Z] C:/gopath/src/github.com/taskcluster/generic-worker/main.go:517 +0x235
[taskcluster 2016-11-06T12:50:01.162Z] main.runWorker()
[taskcluster 2016-11-06T12:50:01.162Z] C:/gopath/src/github.com/taskcluster/generic-worker/main.go:447 +0x43b
[taskcluster 2016-11-06T12:50:01.162Z] main.main()
[taskcluster 2016-11-06T12:50:01.162Z] C:/gopath/src/github.com/taskcluster/generic-worker/main.go:284 +0x7fa
[taskcluster 2016-11-06T12:50:01.162Z]
[taskcluster 2016-11-06T12:50:01.162Z] &os.SyscallError{Syscall:"CreateDesktop", Err:0x5}
Worker log:
2016/11/06 13:53:12 Shutting down immediately - panic occurred!
2016/11/06 13:53:12 goroutine 1 [running]:
runtime/debug.Stack(0x87efd1, 0x2b, 0x0)
C:/Go/src/runtime/debug/stack.go:24 +0x80
main.runWorker.func1()
C:/gopath/src/github.com/taskcluster/generic-worker/main.go:407 +0x95
panic(0x7f23c0, 0xc042146920)
C:/Go/src/runtime/panic.go:458 +0x251
main.(*TaskRun).run.func1(0xc04201e0b8, 0xc042403000)
C:/gopath/src/github.com/taskcluster/generic-worker/main.go:1012 +0x202
panic(0x7f23c0, 0xc042146920)
C:/Go/src/runtime/panic.go:458 +0x251
main.(*TaskRun).run.func2(0xc04201e0b8, 0xc042403000, 0xa46460, 0xc04201e128)
C:/gopath/src/github.com/taskcluster/generic-worker/main.go:1030 +0x132
panic(0x7f23c0, 0xc042146920)
C:/Go/src/runtime/panic.go:458 +0x251
main.(*TaskRun).run(0xc042403000, 0x0)
C:/gopath/src/github.com/taskcluster/generic-worker/main.go:1045 +0x1738
main.FindAndRunTask(0x3b9aca00)
C:/gopath/src/github.com/taskcluster/generic-worker/main.go:517 +0x235
main.runWorker()
C:/gopath/src/github.com/taskcluster/generic-worker/main.go:447 +0x43b
main.main()
C:/gopath/src/github.com/taskcluster/generic-worker/main.go:284 +0x7fa
2016/11/06 13:53:12 Cause: CreateDesktop: Access is denied.
I'll have a look into this on Monday. In the morning I have an appointment with the city council, so will be starting a bit later.
Note - the unit tests run everything as current user (since unit tests run without admin privileges) which is why the unit tests/integration tests didn't pick this up. Anyway, hope to get to the bottom of this on Monday, then I can make a new release.
Assignee | ||
Comment 44•8 years ago
|
||
So, unbelievably, I have it working!!!
I can locally run the generic worker on a local VM, it is now using the runlib library to handle desktop/session creation and process execution, and ... works!
In order to have this running in production, tomorrow I just need to:
1) automate assigning required local security privileges to GenericWorker user during AMI creation for all the windows aws provisioner worker types (both relops-managed OpenCloudConfig worker types and taskcluster internally managed worker types)
2) minor task logging fixes (to better show what commands are being executed, and give useful feedback after commands complete)
3) make a new release, and roll it out to the same worker types that were tweaked in step 1 to have extra required permissions
At a point in the future (in other bugs) we can look at running generic worker as a service rather than a login item, and also implementing interactive feature for instant loaners, and even maybe adding more controls around processes (user time / kernel time timeouts, memory caps, etc)....
Assignee | ||
Comment 45•8 years ago
|
||
I made release 7.0.0alpha 4 and am currently rolling it out on ami-test-win2012r2 worker type for manual testing, before rolling out globally...
Assignee | ||
Comment 46•8 years ago
|
||
I'm hitting a problem in production I'm not hitting locally, and am not sure of the cause yet.
I've made a release 7.0.0alpha5 with some more debugging info.
Very strange.
Locally I'm running Windows Server 2008 R2 - the remote test worker is Windows Server 2012 R2 - I don't expect that to be the reason, but I'm not sure yet.
The worker is not able to take a task, I'm getting
2016/11/08 19:25:28 Running command: 'wmic useraccount where name='task_1478633128' set passwordexpires=false'
panic: CreateFrozen/CreateProcess: Access is denied.
goroutine 1 [running]:
panic(0x7edfa0, 0xc0420064e0)
C:/go/src/runtime/panic.go:500 +0x1af
main.runCommands(0x0, 0x0, 0xc0421419a8, 0x2, 0x2, 0x86ad73, 0x1)
C:/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:548 +0x473
main.(*OSUser).createOSUserAccountForce(0xc042321c80, 0x836400, 0xc04234ba01, 0xc042321c80)
C:/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:148 +0x3ae
main.(*OSUser).createNewOSUser(0xc042321c80, 0xc042321c80, 0x2)
C:/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:130 +0x37
main.createNewTaskUser(0x1d, 0x31)
C:/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:106 +0x220
main.taskCleanup(0xc042141c28, 0x1)
C:/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:371 +0x113
main.startup(0x0, 0x8ef)
C:/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:70 +0xab
main.runWorker()
C:/gopath/src/github.com/taskcluster/generic-worker/main.go:392 +0x46
main.main()
C:/gopath/src/github.com/taskcluster/generic-worker/main.go:284 +0x7fa
It seems some more permissions are needed, which is a little baffling, but I'm sure I'll be able to get to the bottom of it. A little frustrating, but since it is working locally, we can't be far from success now.
Assignee | ||
Comment 47•8 years ago
|
||
I've fixed the problem, I believe - but making a new release to be absolutely sure.
My hypothesis is that, during AMI setup, there are some process / start up flag(s) in the runlib library[1] that assume either a graphical environment or impose some other requirement, which are not required/assumed by the go standard library call to CreateProcess[2]. Therefore, during *installation* of generic worker, I have reverted to using the standard go library for executing processes. I now only use the runlib library for executing task processes while the worker is running. This seems to have resolved the problem.
Whether it is related to the execution context of the bootstrapping process having a graphical context or not, the problem is most likely that the AMI bootstrapping process is executed under a system/headless context, which seems not to play nicely when invoking the subprocess functions of the runlib library directly - therefore switching seems the most convenient and simplest solution. It is quite likely the runlib library would also be capable of executing the installation processes, given permission tweaks, or adjusted library calls, but the investment of investigating the root cause seems not to be worth the effort considering we have a workable solution using the standard library for these calls.
To reiterate, the problem was about executing processes during the *installation* of generic worker on the "golden image" (generic worker is a self-installing executable), *not* the execution or processes of the generic worker when running tasks on a worker.
[1] https://github.com/contester/runlib/blob/e1c5f73778c7081e9ce20b930999d03f51e78f4d/subprocess/subprocess_windows.go#L198-L210
[2] https://github.com/golang/go/blob/410974497365b2e6e23811dbc58475b00c69d0db/src/syscall/exec_windows.go#L324
Assignee | ||
Comment 48•8 years ago
|
||
Here is a production job that ran with the new worker (in this case for NSS).
The only strange thing I see is that the reported User Time / Kernel Time seems to be a bit off:
https://tools.taskcluster.net/task-inspector/#RsFuem4OTNipJ1H1TMeg2A/0
I'm considering pulling out the output of User Time / Kernel Time until it is understood why it is reporting strange results.
We explicitly avoid creating a job group - i'm wondering if this has a consequence that user time / kernel time is only tracked in the initiating process, rather than any sub processes created. This might explain why the numbers are so low.
Another option we have is to make the creation of a job group an attribute of the task payload, such that you choose either to opt in or opt out (not sure if "in" or "out" is best for default). If the wonky times are a consequence of explicitly not being in a job group, it might make sense to create a job group by default, and only opt out when it is known a job group is not allowed (such as the tests that caused this bug in the first place).
Maybe for now I'll roll out as is, and we can discuss / investigate in the coming days/weeks and decide how to proceed.
Assignee | ||
Comment 49•8 years ago
|
||
While I'm waiting for the AMI generation process for ami-test-win2012r2 worker type, I'll see if I can upgrade a win7 machine to use the new worker, and rerun an old task, such as this one:
https://queue.taskcluster.net/v1/task/DvCA9fJdQSWNuZemCVh_HQ
Assignee | ||
Comment 50•8 years ago
|
||
So I hijacked a running machine, killed the running worker, upgraded the binary to v7.0.0alpha7, changed the provisionerId (so another machine wouldn't take the required job), upgraded the credentials (since with the new provisionerId, the old ones weren't enough), changed the run script to not use --configure-for-aws (and thereby undo my config change), and manually ran the adjusted run-generic-worker.bat script (rather than rebooting, otherwise DSC would reset the generic worker binary) and then manually touched the dsc flag file C:\dsc\task-claim-state.valid to trigger the updated worker to start.
Then I modified the task from comment 49 to have updated timestamps and the altered provisionerId, and triggered this task run:
https://tools.taskcluster.net/task-inspector/#bdpo2BhkQZOt6G6ja6eucQ/0
Assignee | ||
Comment 51•8 years ago
|
||
The test failed due to screen resolution or mouse pointer location issues - no doubt a consequence of me being logged in.
However, it didn't fail because of being in a job group, yippee!!
In any case, I will submit a PR to OpenCloudConfig to pick this up and build new win7/win10 amis.
If that all goes well, I'll submit a second one to update win2012 amis (builders).
Assignee | ||
Comment 52•8 years ago
|
||
NSS job running against latest ami-test-win2012r2 image: https://tools.taskcluster.net/task-inspector/#fJflHoN-TMOUR1gdYC63pg/0
Assignee | ||
Comment 53•8 years ago
|
||
This is a huge change, but has been tested in production using a test worker type and making an alpha release or two (the most recent of which was 7.0.0alpha7).
This is currently blocking testing - I'd propose we roll out only on win7/win10 machines initially, as testing is currently blocked, so we can't lose anything.
Thanks!
Pete
Attachment #8808987 -
Flags: review?(wcosta)
Comment 54•8 years ago
|
||
Commits pushed to master at https://github.com/taskcluster/generic-worker
https://github.com/taskcluster/generic-worker/commit/509528aff8fa9da56e4fc3a0b333a5149f4c87cb
Bug 1303455 - Use runlib library for process management on windows
https://github.com/taskcluster/generic-worker/commit/7f6a71f0b6f05d3de37396e3b0fc77dc646a26ae
Bug 1303455 - Use runlib library for process management on windows
Assignee | ||
Comment 55•8 years ago
|
||
:grenade currently on PTO so assigning to rail.
Thanks!
Attachment #8809007 -
Flags: review?(rail)
Comment 56•8 years ago
|
||
Comment on attachment 8809007 [details]
Github Pull Request for OpenCloudConfig
stamp
Attachment #8809007 -
Flags: review?(rail) → review+
Updated•8 years ago
|
Attachment #8808987 -
Flags: review?(wcosta) → review+
Assignee | ||
Comment 57•8 years ago
|
||
I'm heading off shortly, but please be aware, when https://tools.taskcluster.net/push-inspector/#/b8YWY5zaT52XeBmMUog85g?_k=6hjtgc completes, the win 7 / win 10 worker types will be available.
It can take up to 30 mins for the older workers to terminate, so after 30 mins of this task graph completing, you should be able to resubmit a task and it will pick up the new worker.
You will know it worked, if at the top of the log file, it says the generic worker version is 7.0.0.
Flags: needinfo?(jmaher)
Flags: needinfo?(gbrown)
Reporter | ||
Comment 58•8 years ago
|
||
Wonderful! I will follow-up with some tests today.
Flags: needinfo?(jmaher)
Flags: needinfo?(gbrown)
Reporter | ||
Comment 59•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d46c58f0d8f7995438feae255fc749e479c08fa4 appears to use generic worker version 7.0.0 for most (maybe all) tests, but the same process launching problems are evident.
https://public-artifacts.taskcluster.net/O5Lu--5DTjS3yNF3fugtpQ/0/public/logs/live_backing.log
[taskcluster 2016-11-09T18:34:15.725Z] "generic-worker": {
[taskcluster 2016-11-09T18:34:15.725Z] "go-arch": "386",
[taskcluster 2016-11-09T18:34:15.725Z] "go-os": "windows",
[taskcluster 2016-11-09T18:34:15.725Z] "go-version": "go1.7.3",
[taskcluster 2016-11-09T18:34:15.725Z] "release": "https://github.com/taskcluster/generic-worker/releases/tag/v7.0.0",
[taskcluster 2016-11-09T18:34:15.725Z] "version": "7.0.0"
ProcessManager UNABLE to use job objects to manage child processes
ProcessManager NOT managing child processes
19:16:53 INFO - [Parent 3416] WARNING: Failed to launch tab subprocess: file z:/task_1478711748/build/src/ipc/glue/GeckoChildProcessHost.cpp, line 576
19:16:53 INFO - [Parent 3416] WARNING: Failed to launch tab subprocess: file z:/task_1478711748/build/src/ipc/glue/GeckoChildProcessHost.cpp, line 576
19:16:53 INFO - [Parent 3416] WARNING: Failed to launch tab subprocess: file z:/task_1478711748/build/src/ipc/glue/GeckoChildProcessHost.cpp, line 576
19:16:53 INFO - JavaScript error: chrome://global/content/bindings/remote-browser.xml, line 376: NS_ERROR_UNEXPECTED: Component returned failure code: 0x8000ffff (NS_ERROR_UNEXPECTED) [nsIScriptSecurityManager.getLoadContextCodebasePrincipal]
Reporter | ||
Comment 60•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c937f1a5580136ae32bdc1c5b63cfeb4806e0925 adds my traditional process debug logging.
Comparing bb to tc, I note that bb runs as an Administrator and tc does not; I cannot seem to get the os-groups to work now -- scope problems -- but maybe privileges are not really an issue?
Anyway, the mozharness script is still having trouble with process management because at https://hg.mozilla.org/try/rev/f752479203c3a0999d8abeed4f8c4436b95889b4#l3.12, on bb we get limitflags == 2048 == JOB_OBJECT_LIMIT_BREAKAWAY_OK while on tc we get limitflags == 0.
Assignee | ||
Comment 61•8 years ago
|
||
Hey guys - I'm really sorry this didn't fix it first time - however, I'm sure we can get to the bottom of it, as all of the machinery is in place to make this possible now. I'll deep dive and when I come up for air, I'll let you know what I found!
Assignee | ||
Comment 62•8 years ago
|
||
d'oh - so it looks like the attribute "NoSub" is the inverse of what we might expect, based on
https://github.com/contester/runlib/blob/e1c5f73778c7081e9ce20b930999d03f51e78f4d/subprocess/subprocess_windows.go#L140
In this statement, it would use CreateProcessWithLogonW if the NoSub boolean is true - and we want to avoid using CreateProcessWithLogonW. The check for being Windows 8 or higher is also ok, because then it is fine to use CreateProcessWithLogonW (such as for Windows 10). So basically, I'm guessing I just need to invert the boolean here (so it is false, not true):
https://github.com/taskcluster/generic-worker/blob/7f6a71f0b6f05d3de37396e3b0fc77dc646a26ae/process/process_windows.go#L163
and then we are done! Hopefully!
So maybe "NoJob" doesn't mean "don't create a job" but means "calling process not already in a job"? Who knows. But the logic appears correct, if we invert it.
I'll hijack a worker, update the worker, rerun the test, and see!
Watch this space...
Assignee | ||
Comment 63•8 years ago
|
||
Good news! With the switch, I now see we are using CreateProcessAsUser instead of CreateProcessWithLogonW:
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: 2016/11/10 11:34:35 Shutting down immediately - panic occurred!
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: 2016/11/10 11:34:35 goroutine 1 [running]:
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: runtime/debug.Stack(0x0, 0x0, 0x0)
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: /home/travis/go/src/runtime/debug/stack.go:24 +0x80
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: main.runWorker.func1()
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: /home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:407 +0x66
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: panic(0x7b5cc0, 0x11e30e80)
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: /home/travis/go/src/runtime/panic.go:458 +0x40b
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: main.(*TaskRun).run.func1(0x12016028, 0x11cecc00)
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: /home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1012 +0x162
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: panic(0x7b5cc0, 0x11e30e80)
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: /home/travis/go/src/runtime/panic.go:458 +0x40b
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: main.(*TaskRun).run.func2(0x12016028, 0x11cecc00, 0x9a14c0, 0x120171f0)
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: /home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1030 +0x1c7
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: panic(0x7b5cc0, 0x11e30e80)
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: /home/travis/go/src/runtime/panic.go:458 +0x40b
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: main.(*TaskRun).ExecuteCommand(0x11cecc00, 0x0, 0xce2f340)
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: /home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:893 +0x43b
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: main.(*TaskRun).run(0x11cecc00, 0x0)
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: /home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1097 +0xba0
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: main.FindAndRunTask(0x3b9aca00)
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: /home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:517 +0x2a9
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: main.runWorker()
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: /home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:447 +0x625
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: main.main()
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: /home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:284 +0x648
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: 2016/11/10 11:34:35 Cause: CreateFrozen/CreateProcessAsUser/syscall:CreateProcessAsUser: A required privilege is not held by the client.
This is great, and also explains why I did not need to set the privileges in comment 34, as it was previously calling CreateProcessWithLogonW instead of CreateProcessAsUser.
Will hijack a new machine, add these privileges, and try again...
Assignee | ||
Comment 64•8 years ago
|
||
Hmmm, this is a little bit tricky - I need to reboot after adding permissions, but then DSC must be running at boot up, because after rebooting, the privilege is revoked. :/
Assignee | ||
Comment 65•8 years ago
|
||
I've managed to set the first privilege[1], trying again:
https://tools.taskcluster.net/task-inspector/#IJsfJe6pTo6QkO4BpVWHCQ/0
[1] "C:\Program Files\Windows Resource Kits\Tools\ntrights.exe" -u GenericWorker -r SeIncreaseQuotaPrivilege
Assignee | ||
Comment 66•8 years ago
|
||
Now trying again without logging in (to avoid screen resolution problems etc).
Assignee | ||
Comment 67•8 years ago
|
||
Assignee | ||
Comment 68•8 years ago
|
||
Stupidly forgot to upgrade the worker!
Trying again:
https://tools.taskcluster.net/task-inspector/#CA6cSzTMSkK7wWaTt4tm5Q/0
Comment 69•8 years ago
|
||
Commit pushed to master at https://github.com/taskcluster/generic-worker
https://github.com/taskcluster/generic-worker/commit/d602f354df03d9561aa15e06bb2e483281800733
Bug 1303455 - Switched NoJob from true to false (see comment 62 of bug)
Assignee | ||
Comment 70•8 years ago
|
||
So I've spent MANY hours looking into this today, and basically, rolling out the new worker is pretty simple - and indeed the boolean flag change was required.
*** HOWEVER ***
Amazingly, assigning the security policy privilege to the generic worker user is non-trivial.
Locally I've been using ntrights.exe from Windows Server 2003 - but this isn't supported on other versions of windows.
In theory, a tool called secedit can be used - I haven't managed to get this to work comfortably - see https://github.com/taskcluster/generic-worker/blob/b8d2251c25979f011bcca65f4519c54464248a83/plat_windows.go#L438-L466
However, I have just discovered a system call I can use:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms721786(v=vs.85).aspx
I'm going to try to use this system call directly, rather than call out to secedit which isn't working for me at this moment in time.
I had some troubles testing on a live worker, as I can't get updates to stop rolling my local changes back (no doubt I'm doing something stupid). I spent quite a while trying to do this, but was unsuccessful, so I'm trying to write the patch so that I don't need to manually alter a machine, but we can roll with some functioning AMIs.
I've been testing locally on a VM and can confirm that granting this permission with ntrights.exe is sufficient to resolve the issue.
So I'm hoping tomorrow this issue will be gone...
Assignee | ||
Comment 71•8 years ago
|
||
:grenade kindly applied some updates to OpenCloudConfig to get the permissions working and upgraded the worker on win7 - now running again .....
https://tools.taskcluster.net/task-inspector/#Mc0T0zYnQTSxn-6RM2CUTQ/0
Assignee | ||
Comment 72•8 years ago
|
||
We are now failing with:
11:41:33 INFO - Running pre test command run mouse & screen adjustment script with 'c:\mozilla-build\python\python.exe Z:\task_1478862106\mozharness\external_tools\mouse_and_screen_resolution.py --configuration-file Z:\task_1478862106\mozharness\external_tools\machine-configuration.json'
11:41:33 INFO - Running command: ('c:\\mozilla-build\\python\\python.exe', 'Z:\\task_1478862106\\mozharness\\external_tools\\mouse_and_screen_resolution.py', '--configuration-file', 'Z:\\task_1478862106\\mozharness\\external_tools\\machine-configuration.json') in Z:\task_1478862106\build
11:41:33 INFO - Copy/paste: c:\mozilla-build\python\python.exe Z:\task_1478862106\mozharness\external_tools\mouse_and_screen_resolution.py --configuration-file Z:\task_1478862106\mozharness\external_tools\machine-configuration.json
11:41:36 INFO - Screen resolution (current): (1600, 1200)
11:41:36 INFO - Changing the screen resolution...
11:41:36 INFO - Screen resolution (new): (1600, 1200)
11:41:36 INFO - Mouse position (current): (0, 0)
11:41:36 INFO - Mouse position (new): (0, 0)
11:41:36 INFO - INFRA-ERROR: The new screen resolution or mouse positions are not what we expected
It might be that the test needs to be adjusted, as maybe it makes an assumption that the system display is attached to the process user's desktop - which is no longer the case. Perhaps the python code that changes the screen resolution needs to first get a handle on the current process's desktop handle, and adjust that desktop size.
This is just a hunch.
Of course, assuming this is the problem, this will be backwardly compatible with buildbot, so is more about fixing a test assumption that previously didn't cause a problem because it was running in a one-shared-desktop environment, whereas now we have secure the system by using sandboxed tasks running as non-privileged users with their own desktops.
Assignee | ||
Comment 73•8 years ago
|
||
Indeed, this looks to be the cause.
I'm not sure if it is possible to resize a desktop that isn't attached to a display, but I did see it is possible to switch desktops using
https://msdn.microsoft.com/en-us/library/windows/desktop/ms686347(v=vs.85).aspx
Therefore the python code could probably make use of this syscall, to switch the active desktop to the desktop associated to the current running process, before it adjusts the desktop size.
I think we can probably close this bug now, as this is a new/different problem. Everyone ok with that?
Flags: needinfo?(jmaher)
Flags: needinfo?(gbrown)
Assignee | ||
Comment 74•8 years ago
|
||
See https://msdn.microsoft.com/en-us/library/windows/desktop/ms682573(v=vs.85).aspx for an overview of handling of desktops in Windows.
Note, we create a dedicated desktop for the given task using CreateDesktop call (https://msdn.microsoft.com/en-us/library/windows/desktop/ms682124(v=vs.85).aspx).
Assignee | ||
Comment 75•8 years ago
|
||
Assignee | ||
Comment 76•8 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #73)
> Indeed, this looks to be the cause.
>
> I'm not sure if it is possible to resize a desktop that isn't attached to a
> display, but I did see it is possible to switch desktops using
>
> https://msdn.microsoft.com/en-us/library/windows/desktop/ms686347(v=vs.85).
> aspx
On closer reading, SwitchDesktop creates and opens a desktop. Since the desktop is created, we just want to open it.
https://msdn.microsoft.com/en-us/library/windows/desktop/ms684303(v=vs.85).aspx
This looks to be the better call.
Comment 77•8 years ago
|
||
running this on try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=3f325de71041c07f65f155975ad78ff7cf46a390&selectedJob=30908170
I see win7 vm running (very slow compared to win10):
https://tools.taskcluster.net/task-inspector/#MLUpDxU0TTm3hnZ6n15-tA/0
this has worker:
[taskcluster 2016-11-11T15:03:11.838Z] "generic-worker": {
[taskcluster 2016-11-11T15:03:11.838Z] "go-arch": "386",
[taskcluster 2016-11-11T15:03:11.838Z] "go-os": "windows",
[taskcluster 2016-11-11T15:03:11.838Z] "go-version": "go1.7.3",
[taskcluster 2016-11-11T15:03:11.838Z] "release": "https://github.com/taskcluster/generic-worker/releases/tag/v7.0.1alpha1",
[taskcluster 2016-11-11T15:03:11.838Z] "version": "7.0.1alpha1"
[taskcluster 2016-11-11T15:03:11.838Z] },
with my try push, I disable the mouse and screen resolution preflight tool to run, so now we try to run firefox and get the same error:
ProcessManager UNABLE to use job objects to manage child processes
Updated•8 years ago
|
Flags: needinfo?(jmaher)
Flags: needinfo?(gbrown)
Assignee | ||
Comment 78•8 years ago
|
||
I've taken another look.
In release 7.0.0 we set `NoJob` to true, to avoid creating a job. In release 7.0.1, we set `NoJob` to false. It looks like neither work, and digging a bit deeper, I found out why:
If we set `NoJob` to true (as we did in 7.0.0), CreateProcessWithLogonW syscall gets used - which creates a job group.
If we set `NoJob` to false, we use the correct syscall, but a job gets created manually:
https://github.com/contester/runlib/blob/master/subprocess/subprocess_windows.go#L249-L272
In other words, switching this from true to false got us out of the frying pan, and into the fire.
Now taking a close look: https://github.com/contester/runlib/blob/master/subprocess/subprocess_windows.go#L190 suggests that we will still be ok, even if we manually create a job, since CREATE_BREAKAWAY_FROM_JOB is set here in the CreateProcess syscall: https://github.com/contester/runlib/blob/master/subprocess/subprocess_windows.go#L190
In other words - being tied to a job is ok, if we can breakaway from it, because aiui the mozprocess library only throws its hands up in the air if the process is already associated to a job, *and* it can't break away from it.
So on first glance, this seems ok, but then https://msdn.microsoft.com/en-us/library/windows/desktop/ms684147(v=vs.85).aspx says:
"JOB_OBJECT_LIMIT_BREAKAWAY_OK: If any process associated with the job creates a child process using the CREATE_BREAKAWAY_FROM_JOB flag while this limit is in effect, the child process is not associated with the job."
In other words, it is not sufficient to just set this flag on the process, the JOBOBJECT_BASIC_LIMIT_INFORMATION structure associated to the job group needs to have JOB_OBJECT_LIMIT_BREAKAWAY_OK set too.
And we discover on line 303, that this flag is *NOT* being set:
https://github.com/contester/runlib/blob/master/subprocess/subprocess_windows.go#L303
Therefore, the flag CREATE_BREAKAWAY_FROM_JOB on line 190 has little effect.
I believe this is the root cause.
Since the subprocess library is a third party library, I think the swiftest solution is to fork it, fix the issue, and then submit a PR back to the original author, to see if he is interested in adopting the change too.
Comment 79•8 years ago
|
||
Commit pushed to master at https://github.com/taskcluster/generic-worker
https://github.com/taskcluster/generic-worker/commit/839e96fe61a38496c9e51b84a8baf885879df52e
Bug 1303455 - Use taskcluster fork of runlib, and see NoJob back to true
Comment 80•8 years ago
|
||
Commit pushed to master at https://github.com/taskcluster/generic-worker
https://github.com/taskcluster/generic-worker/commit/4106ae398c2a2f89d292cafd73f1188c87a0aaa1
Bug 1303455 - Use taskcluster fork of runlib, and see NoJob back to true
Assignee | ||
Comment 81•8 years ago
|
||
I've forked the runlib library and adapted as follows:
https://github.com/taskcluster/runlib/commit/89093431f1d669a007d99ef3fa7362f2fc5faa69
And in generic worker, I now refer to our fork, rather than the original library:
https://github.com/taskcluster/generic-worker/commit/4106ae398c2a2f89d292cafd73f1188c87a0aaa1
There were several options for implementing, including setting the JOB_OBJECT_LIMIT_BREAKAWAY_OK limit flag explicitly in the job - but I decided to opt to not to create a job at all, for simplicity, so set NoJob back to true, and made sure CreateProcessAsUser was used instead of CreateProcessWithLogonW when this flag is set on Windows 7.
I do want to check that without a job, the task still terminates when it times out, as this is no longer managed by the job.
Assignee | ||
Comment 82•8 years ago
|
||
Thanks :grenade for mering OCC update.
New AMIs being generated here: https://tools.taskcluster.net/task-group-inspector/#/b_xzcV7OR0ugm9Z8PVDdtw
When this task is complete, we should be able to post new jobs - HOWEVER ....
Note - *THE WORKER TYPE HAS CHANGED* so tasks should use worker type "gecko-t-win7-32-beta" (note the "-beta" suffix).
This worker type is something grenade has set up so we can test AMI changes without impacting other work. Thanks grenade! ++
Assignee | ||
Comment 83•8 years ago
|
||
Retrying Joel's last job....
https://tools.taskcluster.net/task-inspector/#LuqRLcOmSQ-X7xNNqRLE6Q/0
Assignee | ||
Comment 84•8 years ago
|
||
So logging onto a live machine, and playing around with Process Explorer and Process Monitor sadly did not reveal a lot.
Processes were run with low priority, but bumping the priority higher did not improve the situation (this can be done interactively in Process Explorer).
Process Monitor revealed a lot of file access to mozharness log files, but that seems reasonable, as there are writes across multiple files for every line logged.
From Process Explorer, it was possible to see that no job group is getting created, which is what we wanted.
I am now not quite sure which avenues to explore to understand why the processes are running so slowly.
I checked MSDN to see if I could find extra parameters / settings in StartupInfo or ProcessInfo that might be relevant, but couldn't see anything suspicious.
Maybe a good night's sleep will clear my head and in the morning I'll have a bright fancy idea.
Feel free to play around with workers and let me know if you find anything / have any ideas about how we might proceed.
Maybe a dump from process monitor might be useful...
Assignee | ||
Comment 85•8 years ago
|
||
Maybe we should have another face-to-face tomorrow to brainstorm / live debug.
Assignee | ||
Comment 87•8 years ago
|
||
\o/
Yes, I believe bug 1312383 solved this for us. Thanks Rob!
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(pmoore)
Resolution: --- → FIXED
Updated•7 years ago
|
Component: General Automation → General
Assignee | ||
Comment 88•6 years ago
|
||
You need to log in
before you can comment on or make changes to this bug.
Description
•