Closed Bug 1319564 Opened 8 years ago Closed 8 years ago

TypeError: 'NoneType' object has no attribute '__getitem__' from Marionette.start_session()

Categories

(Remote Protocol :: Marionette, defect)

Version 3
defect
Not set
normal

Tracking

(firefox52 fixed, firefox53 fixed)

RESOLVED FIXED
mozilla53
Tracking Status
firefox52 --- fixed
firefox53 --- fixed

People

(Reporter: gps, Assigned: whimboo)

References

Details

Attachments

(1 file)

When running various Stylo reftests, we're seeing a failure in Marionette during connection establishment. From https://public-artifacts.taskcluster.net/JVOpoNv2SkOKKa-7z9MGRg/0/public/logs/live_backing.log:

[task 2016-11-22T21:00:10.803571Z] 21:00:10     INFO - Calling ['/home/worker/workspace/build/venv/bin/python', '-u', '/home/worker/workspace/build/tests/reftest/runreftest.py', '--disable-e10s', '--total-chunks', '16', '--this-chunk', '2', '--appname=/home/worker/workspace/build/application/firefox/firefox', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=https://queue.taskcluster.net/v1/task/FJvH8hbXSr6dogLlVZH1aw/artifacts/public/build/target.crashreporter-symbols.zip', '--log-raw=/home/worker/workspace/build/blobber_upload_dir/reftest-stylo_raw.log', '--log-errorsummary=/home/worker/workspace/build/blobber_upload_dir/reftest-stylo_errorsummary.log', '--cleanup-crashes', '--log-raw=-', '--suite=reftest', '--disable-e10s', '--', 'tests/reftest/tests/layout/reftests/reftest-stylo.list'] with output_timeout 1000
[task 2016-11-22T21:00:11.041411Z] 21:00:11     INFO - REFTEST INFO | Checking for orphan ssltunnel processes...
[task 2016-11-22T21:00:11.071343Z] 21:00:11     INFO - REFTEST INFO | Checking for orphan xpcshell processes...
[task 2016-11-22T21:00:11.119936Z] 21:00:11     INFO - REFTEST INFO | Running with e10s: False
[task 2016-11-22T21:00:11.120610Z] 21:00:11     INFO - REFTEST INFO | Application command: /home/worker/workspace/build/application/firefox/firefox -marionette -profile /tmp/tmpzNGi0f.mozrunner
[task 2016-11-22T21:00:11.713480Z] 21:00:11     INFO - ATTENTION: default value of option force_s3tc_enable overridden by environment.
[task 2016-11-22T21:00:12.178316Z] 21:00:12     INFO - 1479848412176	Marionette	INFO	Listening on port 2828
[task 2016-11-22T21:00:13.007346Z] 21:00:13     INFO - ExceptionHandler::GenerateDump cloned child 1199
[task 2016-11-22T21:00:13.008203Z] 21:00:13     INFO - ExceptionHandler::SendContinueSignalToChild sent continue signal to child
[task 2016-11-22T21:00:13.008280Z] 21:00:13     INFO - ExceptionHandler::WaitForContinueSignal waiting for continue signal...
[task 2016-11-22T21:00:13.110759Z] 21:00:13     INFO -  Traceback (most recent call last):
[task 2016-11-22T21:00:13.110955Z] 21:00:13     INFO -    File "/home/worker/workspace/build/tests/reftest/runreftest.py", line 740, in <module>
[task 2016-11-22T21:00:13.111030Z] 21:00:13     INFO -      sys.exit(run_test_harness(parser, options))
[task 2016-11-22T21:00:13.111252Z] 21:00:13     INFO -    File "/home/worker/workspace/build/tests/reftest/runreftest.py", line 734, in run_test_harness
[task 2016-11-22T21:00:13.112353Z] 21:00:13     INFO -      return reftest.runTests(options.tests, options)
[task 2016-11-22T21:00:13.112462Z] 21:00:13     INFO -    File "/home/worker/workspace/build/tests/reftest/runreftest.py", line 426, in runTests
[task 2016-11-22T21:00:13.112530Z] 21:00:13     INFO -      return self.runSerialTests(manifests, options, cmdargs)
[task 2016-11-22T21:00:13.113160Z] 21:00:13     INFO -    File "/home/worker/workspace/build/tests/reftest/runreftest.py", line 681, in runSerialTests
[task 2016-11-22T21:00:13.113220Z] 21:00:13     INFO -      debuggerInfo=debuggerInfo)
[task 2016-11-22T21:00:13.113841Z] 21:00:13     INFO -    File "/home/worker/workspace/build/tests/reftest/runreftest.py", line 622, in runApp
[task 2016-11-22T21:00:13.113916Z] 21:00:13     INFO -      marionette.start_session(timeout=options.marionette_port_timeout)
[task 2016-11-22T21:00:13.114887Z] 21:00:13     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 23, in _
[task 2016-11-22T21:00:13.114951Z] 21:00:13     INFO -      return func(*args, **kwargs)
[task 2016-11-22T21:00:13.116281Z] 21:00:13     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1301, in start_session
[task 2016-11-22T21:00:13.116390Z] 21:00:13     INFO -      self.session_id = resp["sessionId"]
[task 2016-11-22T21:00:13.116465Z] 21:00:13     INFO -  TypeError: 'NoneType' object has no attribute '__getitem__'
[task 2016-11-22T21:00:13.167790Z] 21:00:13    ERROR - Return code: 1

The code in question from marionette_driver/marionette.py:

        resp = self._send_message("newSession", body)

        self.session_id = resp["sessionId"]

"resp" is None, thus triggering the TypeError.

Obviously something is going wrong. Perhaps the Firefox process crashed or something. I'm not sure. But the TypeError isn't exactly intuitive. It would be much more user friendly to help debug this failure if it gave me a helpful error message (e.g. "response was empty") and then e.g. looked to see if the process had crashed or something.

You can find a large batch of these errors at https://treeherder.mozilla.org/#/jobs?repo=stylo&revision=592e380a27d3eceac70778ecf276e29a620742ff. Strangely, some linux64-stylo debug reftests are working. Although all opt reftests and some debug reftests are getting the same TypeError. Most weird.
firefox.exe is definitely crashing. I can reproduce locally with `firefox -marionette`. The crash us due to uniqueness of the Stylo build and isn't a fault of Marionette's. But since there is a crash here, I'd expect the test harness to recover from that gracefully. Ideally it would capture the crash info to aid debugging.
If we crash too early Breakpad might not be able to detect this crash. As such our check with mozcrash will not reveal a minidump file. When you run into this crash, do you see Breakpad coming up? Or does Stylo not support Breakpad (yet)?
This appears to be crashing but still allowing the python code to connect to the server but, at which point I guess it is dead, we can't then do any communications down that socket.
Well, the communication worked at this point. At least we see a return package but this is bogus. If there would have been a problem with the communication the `process_check` decorator would have been called. And then an appropriate IOError message would be visible.

It would be great if we can get those builds run with tracing output of Marionette. Usually you can do this with the `-vv` option of the marionette command.

How important is it to get this working?
Getting the Stylo reftests running is pretty important, as it is tied to Quantum efforts. There was an attempt to fix this shortly-after-startup crash in bug 1319568, but that appears to have failed :( So identifying the crash in automation is pretty important.

I would happily get marionette printing more detailed info. However, I'm not sure how. These automation jobs are running layout/tools/runreftest.py, which launches firefox with `-marionette` and consumes the marionette Python APIs directly. When you suggest to add `-vv` to the "marionette command," I have no clue what command you could be referring to, as I don't see anything obvious in runreftest.py related to verbose logging other than MOZ_REFTEST_VERBOSE, which seems to be unrelated.

FWIW, TC does have log artifacts from the reftest invocation which aren't displayed in the main mozharness logs. From, https://treeherder.mozilla.org/#/jobs?repo=stylo&selectedJob=31586 the "reftest-stylo_raw.log" file contains some interesting info: https://queue.taskcluster.net/v1/task/Xqpm4PwjQUK5K9VFf5eb7g/runs/0/artifacts/public/test_info//reftest-stylo_raw.log

{"source": "reftest harness", "thread": "MainThread", "time": 1480446278499, "action": "log", "message": "Checking for orphan ssltunnel processes...", "level": "INFO", "pid": 1127}
{"source": "reftest harness", "thread": "MainThread", "time": 1480446278529, "action": "log", "message": "Checking for orphan xpcshell processes...", "level": "INFO", "pid": 1127}
{"source": "reftest harness", "thread": "MainThread", "time": 1480446278607, "action": "log", "message": "Running with e10s: False", "level": "INFO", "pid": 1127}
{"source": "reftest harness", "thread": "MainThread", "time": 1480446278607, "action": "log", "message": "Application command: /home/worker/workspace/build/application/firefox/firefox -marionette -profile /tmp/tmpaxaobn.mozrunner", "level": "INFO", "pid": 1127}
{"source": "reftest harness", "process": "ProcessReader", "thread": "ProcessReader", "time": 1480446279235, "action": "process_output", "data": "ATTENTION: default value of option force_s3tc_enable overridden by environment.", "pid": 1127}
{"source": "reftest harness", "process": "ProcessReader", "thread": "ProcessReader", "time": 1480446279776, "action": "process_output", "data": "1480446279768\tMarionette\tINFO\tListening on port 2828", "pid": 1127}
{"source": "reftest harness", "process": "ProcessReader", "thread": "ProcessReader", "time": 1480446280567, "action": "process_output", "data": "thread '<unnamed>' panicked at 'already borrowed', ../src/libcore/option.rs:705", "pid": 1127}
{"source": "reftest harness", "process": "ProcessReader", "thread": "ProcessReader", "time": 1480446280567, "action": "process_output", "data": "note: Run with `RUST_BACKTRACE=1` for a backtrace.", "pid": 1127}
{"source": "reftest harness", "process": "ProcessReader", "thread": "ProcessReader", "time": 1480446280567, "action": "process_output", "data": "Redirecting call to abort() to mozalloc_abort", "pid": 1127}
{"source": "reftest harness", "process": "ProcessReader", "thread": "ProcessReader", "time": 1480446280571, "action": "process_output", "data": "ExceptionHandler::GenerateDump cloned child ExceptionHandler::WaitForContinueSignal waiting for continue signal...", "pid": 1127}
{"source": "reftest harness", "process": "ProcessReader", "thread": "ProcessReader", "time": 1480446280572, "action": "process_output", "data": "1202", "pid": 1127}
{"source": "reftest harness", "process": "ProcessReader", "thread": "ProcessReader", "time": 1480446280573, "action": "process_output", "data": "ExceptionHandler::SendContinueSignalToChild sent continue signal to child", "pid": 1127}

This definitely points to a Rust crash. Sadly, that info is not in the main task log :/
Ah, I see! So if Marionette is only enabled because of the -marionette command line option to Firefox, the -vv option will indeed not work. To get detailed logging including the tracing output you should do something similar like:

https://dxr.mozilla.org/mozilla-central/rev/8d8846f63b74eb930e48b410730ae088e9bdbee8/testing/marionette/client/marionette_driver/geckoinstance.py#163

With the level set to TRACE you will get everything logged to the console (like any dump() call). Currently I cannot find any artifact attached to the job which contains that output. So you have to get that added maybe as a new file like Marionette is doing it for gecko.log:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=mn&filter-tier=1&filter-tier=2&filter-tier=3&bugfiler&selectedJob=5734154

http://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-central/sha512/ddd371dd689cd7f32457934ec714c2ce0411c5369217e211b834a26229d4c47b011a3d669285a233e65223e631bad6c84db601fc30baef6361abd8525bcb5d41

Otherwise if you could tell how to run all that locally and I can replicate it, I might be able to figure out the problem.
I like teaching others how to fish!

You can pull down and build https://hg.mozilla.org/incubator/stylo just like any other mozilla-central based repository. Be warned: this repo adds thousands of files that aren't yet in mozilla-central!

If you push to try, you can specify the "linux64-stylo" platform to trigger Firefox builds with Stylo enabled. e.g. "try: -b do -p linux64-stylo -t none -u all" (the task graph only schedules minimal tests on the linux64-stylo platform, so it is safe to run all of them via try syntax).

I imagine you can hack up the in-tree mozharness configs to enable logging, push that to try, and get decent results. I would recommend this, as I had difficulty reproducing the crash on my local machine (it only reproduced with a specific profile but it seems to be repro consistently in automation).
Ok, so if it is that hard to reproduce locally I don't want to get started in doing it at all. Now that I see that those jobs are running completely in TC, we could use a one click loaner to run the command manually, and make modifications in how we run the tests. Adding the additional settings should work. 

I won't be able to check that today but I might have a bit of time tomorrow.
I actually had a quick look and here the steps to reproduce:

1. Request a one click loaner
2. Connect to the shell and run option 1
3. cd workspace/build && . venv/bin/activate
4. /home/worker/workspace/build/venv/bin/python -u /home/worker/workspace/build/tests/reftest/runreftest.py --disable-e10s --total-chunks 16 --this-chunk 6 --appname=/home/worker/workspace/build/application/firefox/firefox --utility-path=tests/bin --extra-profile-file=tests/bin/plugins --log-raw=/home/worker/workspace/build/blobber_upload_dir/reftest-stylo_raw.log --log-errorsummary=/home/worker/workspace/build/blobber_upload_dir/reftest-stylo_errorsummary.log --cleanup-crashes --log-raw=- --suite=reftest --disable-e10s -- tests/reftest/tests/layout/reftests/reftest-stylo.list

So the command we fail in here is start_session() which correlates to the newSession server command. It first ends-up in the dispatcher and gets executed there. The result of the function call has to be undefined, because that's what send back:

https://dxr.mozilla.org/mozilla-central/rev/8d8846f63b74eb930e48b410730ae088e9bdbee8/testing/marionette/dispatcher.js#123

So afait somewhere in this method we return undefined:

https://dxr.mozilla.org/mozilla-central/rev/8d8846f63b74eb930e48b410730ae088e9bdbee8/testing/marionette/driver.js#490

Andreas might have a better idea here.

But what I find strange is that I do not see any Firefox window via VNC. Given that newSession waits for the initial browser window, I wonder if something is different with the window creation.
Flags: needinfo?(ato)
Oh starting Firefox manually immediately triggers a crash with the following assertion:

###!!! ABORT: Aborting on channel error.: file /home/worker/workspace/build/src
/ipc/glue/MessageChannel.cpp, line 2155
[Child 1799] ###!!! ABORT: Aborting on channel error.: file /home/worker/workspace/build/src/ipc/glue/MessageChannel.cpp, line 2155

So this should be your problem then. Looks like we do not correctly catch such a situation with a very early crash during the startup phase.
Attached file Backtrace
Make sure you are starting with e10s disabled - stylo apparently has known issues with e10s. Not sure how that manifests on debug builds. I'm pretty sure on opt builds it causes the content process to crash.
CCing some stylo people who may be able to provide insight for the crash in comment #10 and 11.
I submitted a crash report. I hope it is helpful for you: bp-3b3c8a69-d944-4308-be64-465d62161129

You better should get a new bug filed to get this startup crash fixed.

On our side we should check how we can better detect those crashes.
When having the venv activated I run marionette directly with:

> marionette --binary application/firefox/firefox --gecko-log - -vv tests/marionette/tests/testing/m
arionette/harness/marionette/tests/unit/unit-tests.ini --disable-e10s

As result I also see constant crashes with:

> IOError: Process crashed (Exit code: 11) (Reason: No data received over socket)
Ok, so the reason why we do not get the same failure message for the reftests is that under this condition Marionette doesn't handle the instance and we fail in re-raising the exception!

My patch on bug 1320380 will fix that and we will get a proper failure message from Marionette.

I think that we should move the bug into another component where it better fits into. It's clearly not a Marionette issue here.
Depends on: 1320380
Flags: needinfo?(ato)
The crash is supposedly tracked/fixed in bug 1319568. I filed this bug explicitly to track the Marionette issue. So it sounds like this bug should be duped on bug 1320380 then?

Will the fix in bug 1320380 result in the crash/stack being properly recorded in automation? We don't want developers having to jump through hoops to diagnose crashes in automation.
(In reply to Gregory Szorc [:gps] from comment #17)
> The crash is supposedly tracked/fixed in bug 1319568. I filed this bug
> explicitly to track the Marionette issue. So it sounds like this bug should
> be duped on bug 1320380 then?

Yes, in that case this is correct.

> Will the fix in bug 1320380 result in the crash/stack being properly
> recorded in automation? We don't want developers having to jump through
> hoops to diagnose crashes in automation.

Good point. It feels that I missed another part in the fix. I will have to recheck it. Given that the crash reporter came up for me during startup we should then be able to find the crash report and print the stack.
Regarding crash reports please see bug 1320380 comment 13. I don't think that Marionette is the place to do that when it doesn't handle the application process itself. However reftests are doing that, they should be responsible for it including the usage of mozcrash to find and analyze minidump files.
Anything we can do right now (bug 1320380) has been landed. So you should see a proper exception now, which then can be handled by the appropriate harness eg. check for crashes.
Assignee: nobody → hskupin
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
I guess this isn't something that we want to uplift to 52.
Well, the patch on bug 1320380 got uplifted to 52, so this is also fixed for beta.
Product: Testing → Remote Protocol
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: