Closed Bug 1319453 Opened 8 years ago Closed 7 years ago

marionette test fails on taskcluster windows 7 tests

Categories

(Release Engineering :: General, defect)

x86
Windows 7
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: grenade, Assigned: pmoore)

References

()

Details

when marionette tests are run, they fail with this exception:

10:43:05     INFO -  {"thread": "MainThread", "level": "ERROR", "pid": 3512, "source": "Marionette test runner", "time": 1479811385581, "action": "log", "message": "Failure during harness execution", "stack": "Traceback (most recent call last):\n\n  File \"Z:\\task_1479811084\\build\\tests\\marionette\\marionette\\runtests.py\", line 87, in cli\n    failed = harness_instance.run()\n\n  File \"Z:\\task_1479811084\\build\\tests\\marionette\\marionette\\runtests.py\", line 69, in run\n    runner.run_tests(tests)\n\n  File \"Z:\\task_1479811084\\build\\venv\\lib\\site-packages\\marionette\\runner\\base.py\", line 824, in run_tests\n    need_external_ip = self._start_marionette()\n\n  File \"Z:\\task_1479811084\\build\\venv\\lib\\site-packages\\marionette\\runner\\base.py\", line 775, in _start_marionette\n    self.marionette = self.driverclass(**self._build_kwargs())\n\n  File \"Z:\\task_1479811084\\build\\venv\\lib\\site-packages\\marionette_driver\\marionette.py\", line 581, in __init__\n    self.raise_for_port(timeout=startup_timeout)\n\n  File \"Z:\\task_1479811084\\build\\venv\\lib\\site-packages\\marionette_driver\\decorators.py\", line 28, in _\n    m.handle_socket_failure()\n\n  File \"Z:\\task_1479811084\\build\\venv\\lib\\site-packages\\marionette_driver\\decorators.py\", line 23, in _\n    return func(*args, **kwargs)\n\n  File \"Z:\\task_1479811084\\build\\venv\\lib\\site-packages\\marionette_driver\\marionette.py\", line 692, in raise_for_port\n    self.host, self.port))\n\nIOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for connection on localhost:2828!)\n"}
10:43:05    ERROR - Return code: 1
10:43:05     INFO - TinderboxPrint: marionette<br/><em class="testfail">T-FAIL</em>
10:43:05     INFO - gecko.log not found
10:43:05     INFO - TinderboxPrint: marionette<br/><em class="testfail">T-FAIL</em>
10:43:05     INFO - Marionette exited with return code 1: FAILURE

the gecko.log referenced by the main log is empty.
the test succeeds when run as administrator.
as a note, this is causing all unittests to fail, specifically because we use marionette to install testing addons in the browser.

we fail to connect  to port 2828, looking further in gecko.log, I see error messages about failing to get preferences, this seems like we are failing to either read or write to a profile, I suspect a permissions issue.
running the offending command manually as an unprivileged user on win 7, i was able to get some more detail on the error.

Z:\task_1479823051>runas /user:grenade "Z:\\task_1479823051\\build\\venv\\Scripts\\python -u Z:\\task_1479823051\\build\\tests\\marionette\\marionette\\runtests.py --disable-e10s --gecko-log=Z:\\task_1479823051\\build\\blobber_upload_dir\\gecko.log --log-raw=Z:\\task_1479823051\\build\\blobber_upload_dir\\marionette_raw.log --log-errorsummary=Z:\\task_1479823051\\build\\blobber_upload_dir\\marionette_errorsummary.log --binary=Z:\\task_1479823051\\build\\application\\firefox\\firefox.exe --address=localhost:2828 --symbols-path=https://queue.taskcluster.net/v1/task/MFNZKG1KQeOvAzvhLyTcFw/artifacts/public/build/firefox-53.0a1.en-US.win32.crashreporter-symbols.zip Z:\\task_1479823051\\build\\tests\\marionette\\tests\\testing\\marionette\\harness\\marionette\\tests\\unit-tests.ini"
...
TEST-START | test_accessibility.py TestAccessibility.test_element_is_visible_to_accessibility
TEST-UNEXPECTED-ERROR | test_accessibility.py TestAccessibility.test_element_is_visible_to_accessibility | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: [Errno 10061] No connection could be made because the target machine actively refused it)
Traceback (most recent call last):
  File "Z:\task_1479823051\build\venv\lib\site-packages\marionette\marionette_test\testcases.py", line 143, in run
    self.setUp()
  File "Z:\task_1479823051\build\tests\marionette\tests\testing\marionette\harness\marionette\tests\unit\test_accessibility.py", line 14, in setUp
    super(TestAccessibility, self).setUp()
  File "Z:\task_1479823051\build\venv\lib\site-packages\marionette\marionette_test\testcases.py", line 469, in setUp
    super(MarionetteTestCase, self).setUp()
  File "Z:\task_1479823051\build\venv\lib\site-packages\marionette\marionette_test\testcases.py", line 257, in setUp
    self.marionette.start_session()
  File "Z:\task_1479823051\build\venv\lib\site-packages\marionette_driver\decorators.py", line 28, in _
    m.handle_socket_failure()
  File "Z:\task_1479823051\build\venv\lib\site-packages\marionette_driver\decorators.py", line 23, in _
    return func(*args, **kwargs)
  File "Z:\task_1479823051\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1296, in start_session
    self.protocol, _ = self.client.connect()
  File "Z:\task_1479823051\build\venv\lib\site-packages\marionette_driver\transport.py", line 223, in connect
    self.sock.connect((self.addr, self.port))
  File "c:\mozilla-build\python\Lib\socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
TEST-INFO took 61573ms

so the exception occurs in code that is not in-tree (c:\mozilla-build\python\Lib\socket.py)

unfortunately socket.py also contains these lines:

if os.name == "nt":
    _socketmethods = _socketmethods + ('ioctl',)

and from https://msdn.microsoft.com/en-us/library/windows/desktop/ee309610.aspx:

"On Windows 7 and Windows Server 2008 R2 , this was changed so that local packets sent out of a network interface are also captured. This includes packets received on another interface and then forwarded out the network interface bound to the socket with SIO_RCVALL IOCTL.
Setting this IOCTL requires Administrator privilege on the local computer."

so it seams that using these ioctl sockets requires administrator rights. i am unable to find any documentation on the specific permission required. if i knew what it was, i could attempt to grant just that privilege to the task users group.
great find- how can we make this use admin?
It looks like this is just the equivalent of ECONNREFUSED from connect.  I don't see the ioctl method being related to this issue.

Based on comment 1 and irc messages, I think the issue was that nothing was listening on port 2828, because Apache couldn't start, right?
we don't use apache, marionette runs inside of firefox and tries to listen on 2828- so it seems that firefox is not able to start the server or accept connections.
dustin: if we run the same test as administrator, it succeeds. in my digging/testing, it looked very much like the socket call requires admin rights.
Depends on: 1319525
As discussed on IRC there seem to be issues accessing the profile and that Firefox bails out early, and maybe even don't start when using a non-admin user. Did anyone actually start Firefox manually as non-admin user? Did that work? If yes, I would do the next step and prepare a venv with mozrunner installed. Run "mozrunner --binary  %path_to_firefox_exe%" and see if that works. 

I assume you have VNC access to watch what's going on?
Per the IRC discussion, looking at a debug mochitest log:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=05fa24b28321ed677961d7a873a2facbba754cf7&filter-searchStr=tc%20win%20debug&selectedJob=31648480

It looks like Firefox is failing to initialize nsToolkitProfileService:
10:51:33     INFO -  [4084] ###!!! ASSERTION: nsToolkitProfileService::Init failed!: 'Error', file z:/task_1479808614/build/src/toolkit/profile/nsToolkitProfileService.cpp, line 1059

...which is likely causing the browser to not start, which means we never start the Marionette server and so the Python client in the harness can't connect.

Following the trail of NS_WARNINGs back, it appears that the root cause is that calling SHGetSpecialFolderLocation (to get the AppData/Roaming folder) is failing:
https://dxr.mozilla.org/mozilla-central/rev/1a3194836cb4c3da6ba3a9742a2d25cf26669b55/toolkit/xre/nsXREDirProvider.cpp#1270

To sanity check this theory I wrote a simple Rust program that calls that function (and the other function necessary to print the result):
https://github.com/luser/test-getspecialfolderlocation/blob/master/src/main.rs

I built a statically-linked binary:
https://people-mozilla.org/~tmielczarek/test-getspecialfolderlocation.exe

and ran it in a clone of that mochitest task:
https://tools.taskcluster.net/task-inspector/#GzhmCZt4SqCyUGjNqmZGfg/

where it failed with an "access denied" error:
Z:\task_1479863032>test-getspecialfolderlocation.exe
SHGetSpecialFolderLocation error: 80070005
It looks like, specifically on Windows 7, when using SHGetSpecialFolderLocation (a system call that is not only no longer supported, but its replacement system call is also deprecated, replaced by yet another system call) *in combination with* CreateProcessAsUser (which is how we launch processes on windows 7 from generic worker) that the location of %APPDATA% is the location for GenericWorker account, rather than the task user.

In the wild, firefox would be launched with CreateProcess rather than CreateProcessAsUser, which would explain why this problem does not occur in the wild.

It seems like it might not be a bad idea to update the syscall being used to a more modern, supported syscall, that is still backwardly compatible with the oldest versions of windows we provide support for.

Windows 7 task that shows %APPDATA% default value being wrong:
https://taskcluster.github.io/unified-logviewer/?url=https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FAwvt0T40TXqh_U6Q7YYa0A%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive.log&highlightStart=27&highlightEnd=27&lineNumber=&wrapLines=false&showLineNumbers=true&jumpToHighlight=true&followLog=false&asText=false

Windows 10 task that shows %APPDATA% with correct value (ignore the fact it can't create existing directory - that is expected):
https://taskcluster.github.io/unified-logviewer/?url=https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FNO5RLuS-TH6HqIN8E4M9-Q%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive.log&highlightStart=27&highlightEnd=27&lineNumber=&wrapLines=false&showLineNumbers=true&jumpToHighlight=true&followLog=false&asText=false
Note, these tests merely illustrate what the windows subsystem considers APPDATA location to be, of course if we adapt the value of the environment variable, that won't affect the windows subsystem's view of what the directory is, which will be returned by SHGetSpecialFolderLocation - so just setting APPDATA env var isn't enough to work around the problem.
No longer depends on: 1319525
i put a really nasty hack in place in an attempt to unblock our efforts to green this test suite.
i'll revert it when the g-w process which creates the task user is patched to apply the correct APPDATA path to the task user profile.
https://github.com/mozilla-releng/OpenCloudConfig/commit/f40b7a4341600d87eab00b6c70544fb9e0b161dd

the marionette server now starts and appears to run for many of the tests but eventually stops working again and we get timeouts:
https://treeherder.mozilla.org/#/jobs?repo=try&author=rthijssen@mozilla.com&filter-tier=2&group_state=expanded&selectedJob=31744277
note that teds test program no longer returns access denied, with the ugly hack: https://public-artifacts.taskcluster.net/CkSotDFsT8-gHDJIOsyYKg/0/public/logs/live_backing.log
So, I tried switching syscall from SHGetSpecialFolderLocation[1] to SHGetKnownFolderPath[2], but we get the same problem:

https://tools.taskcluster.net/task-inspector/#Fg1gm5PhR1Cb8yKXz99gpg/0

APPDATA Folder: C:\Users\GenericWorker\AppData\Roaming

[1] https://msdn.microsoft.com/en-us/library/windows/desktop/bb762203(v=vs.85).aspx
[2] https://msdn.microsoft.com/en-us/library/windows/desktop/bb762188(v=vs.85).aspx
Oooooohhhhh exciting!!!!

https://msdn.microsoft.com/en-us/library/windows/desktop/bb762249(v=vs.85).aspx

Maybe we can manually fix by adding this syscall for a bunch of folders in the generic worker....
Assignee: nobody → pmoore
I mentioned this on IRC, but not here. I think what you really need to do is use `CreateEnvironmentBlock`, which lets you get the proper set of environment variables *for a particular user*:
https://msdn.microsoft.com/en-us/library/windows/desktop/bb762270(v=vs.85).aspx

Specifically, here, where the code currently creates the environment using `ListToEnvironmentBlock`:
https://github.com/taskcluster/runlib/blob/c1617795ba0f4fa700b936432f78b587101734d4/subprocess/subprocess_windows.go#L157

It should first call `CreateEnvironmentBlock`, then split that block apart into some more usable data structure (like a hashtable or map or whatever go calls it), add the additional environment variables from `sub.Environment`, then generate an environment block from the result, and pass *that* to `CreateProcessWithLogonW` / `CreateProcessAsUser`.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #15)
> I mentioned this on IRC, but not here. I think what you really need to do is
> use `CreateEnvironmentBlock`, which lets you get the proper set of
> environment variables *for a particular user*:
> https://msdn.microsoft.com/en-us/library/windows/desktop/bb762270(v=vs.85).
> aspx
> 
> Specifically, here, where the code currently creates the environment using
> `ListToEnvironmentBlock`:
> https://github.com/taskcluster/runlib/blob/
> c1617795ba0f4fa700b936432f78b587101734d4/subprocess/subprocess_windows.
> go#L157
> 
> It should first call `CreateEnvironmentBlock`, then split that block apart
> into some more usable data structure (like a hashtable or map or whatever go
> calls it), add the additional environment variables from `sub.Environment`,
> then generate an environment block from the result, and pass *that* to
> `CreateProcessWithLogonW` / `CreateProcessAsUser`.

Implemented in: https://github.com/taskcluster/runlib/pull/1
Tested on a local win7 VM, and it seems to have fixed it!

https://tools.taskcluster.net/task-inspector/#U5yBZJNLSgu1wC6Ev9Oegw/0

In case that expires, here is the payload:

{
  "artifacts": [],
  "maxRunTime": 3600,
  "command": [
    "set GOROOT=%cd%\\go",
    "set GOPATH=%cd%\\gopath",
    "set PATH=%GOROOT%\\bin;%GOPATH%\\bin;%PATH%",
    "go install github.com/petemoore/appdata",
    "appdata"
  ],
  "mounts": [
    {
      "content": {
        "url": "https://storage.googleapis.com/golang/go1.7.3.windows-386.zip"
      },
      "directory": "",
      "format": "zip"
    },
    {
      "content": {
        "url": "https://raw.githubusercontent.com/petemoore/appdata/master/main.go"
      },
      "file": "gopath\\src\\github.com\\petemoore\\appdata\\main.go"
    }
  ]
}

Here is the output:

C:\Users\task_1479990869>appdata

APPDATA Folder: C:\Users\task_1479990869\AppData\Roaming


I've released a new generic worker with the fixes:
https://github.com/taskcluster/generic-worker/releases/tag/v7.1.1
Ted's test, rerun:

https://public-artifacts.taskcluster.net/ZZdnHBzdRZaucnk7T93FcA/0/public/logs/live_backing.log

Z:\task_1479996062>test-getspecialfolderlocation.exe
Got 'C:\Users\task_1479996062\AppData\Roaming'
this works great in the most recent worker/ami configuration
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.