Closed Bug 831010 Opened 9 years ago Closed 7 years ago

Marionette start_session fails if called too early after gecko launch

Categories

(Testing :: Marionette, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jgriffin, Assigned: jlal)

References

Details

Attachments

(3 files)

A recent patch of mine on bug 826111 was backed out because it caused desktop Mn tests to go orange.  The cause was a change of time.sleep(5) to time.sleep(1) in wait_for_port.  The error this produced was:

https://tbpl.mozilla.org/php/getParsedLog.php?id=18799310&tree=Mozilla-Inbound

Apparently, we are depending on something in newSession that isn't available right away at browser launch, and we're not detecting or waiting on this properly.
Summary: Marionette start_session fails if called too early after browser launch → Marionette start_session fails if called too early after gecko launch
This happens in b2g as well. For b2g, we need to figure out when gaia is ready to be used, and only then should we respond to newSession
Assignee: nobody → jlal
So even with the measures we take in the gaia-ui-tests I believe we hit similar problems (or the same problem) https://gist.github.com/lightsofapollo/38c989d8dfb4c7c1a8a5 though that example seems to go one step further
Update-

 - The loadFrameScript call almost always happens before the system app loads (the iframe is not even generated yet) 

 - Some portions of these bugs are "crashes" and don't actually fall into this category of bug I am trying to fix (hangs basically without a crash)

 - Waiting for the system app iframe to become available seems to help (I have no idea why yet)
A little early to tell but another symtom here is the systemapp frame is never created in (some) cases of failure! https://queue.taskcluster.net/v1/task/_MGHRJ__Q2Op3Oz-mAiFzA/runs/0/artifacts/public/marionette_js_tests/debug.log
So shell.js is never "started" (see the start method) probably a settings db bug or something further up.
Hrm so maybe this is totally unrelated to the title of this bug but the problem I see is actually the settingsdb is not opened yet. Log: https://gist.github.com/lightsofapollo/f073da3e9952be91a691 you can see  "Try to open database:settings 5" but never actually successfully opening. I also see indications that the settings db is upgrading.
The settingsdb is a red herring... Increasing the timeout does eventually show that shell.js starts but this does not seem to effect marionette actually by increasing the timeout I think I have a better hint at what is going on. It appears that the marionette server is unresponsive (I have literally no idea why) and the listener is never getting created.
Looks like loadFrameScript is misbehaving or something (not clear yet) from this log https://gist.github.com/lightsofapollo/dca9979606fe9cd57d25 its clear that we do in fact call loadFrameScript but its never running any of the meaningful bits of loading the scripts (a few not just the listener) with the exception of preload.js
See Also: → 1037924
This is likely the same problem that we are seeing in bug 1037924 and the thing that was preventing us from un-hiding Gij.
My latest try: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=6f8b072d1be6

To debug this I have been been doing something like this to debug:

-> hardcode a gecko build (in make b2g) to the gecko try build

(Example)

b2g: node_modules/.bin/mozilla-download
	curl http://ftp.mozilla.org/pub/mozilla.org/b2g/try-builds/jlal@mozilla.com-ef9d5dbad39c/try-linux64_gecko-debug/b2g-35.0a1.en-US.linux-x86_64.tar.bz2 | tar xj

-> Create a pull request in gaia and observe results in https://treeherder.allizom.org/ui/#/jobs?repo=gaia for your PR.

-> Analyze failures which have "before suite" related issues ( I usually trace the logs from newSession down)

-> Add more printf/debug/statements repeat
Attached file normal_settings.log
Normal settings db startup log
Attached file abnormal_settings.log
Settings db startup logs inn the case of session startup failures
It's hard to tell if this is a symptom or the problem but clearly startup is broken here. Trace through "language.current" in both logs and you will see in the error case we never communicate back the settings data for language.current (which in turn prevents shell.js from starting). I thought this was a race (it may still be) but I am allowing a 6 minute! timeout here for shell.js to start.
Ah forgot to turn on SettingsRequestManager DEBUG=true (/me feeling more motivated to make these "loggers" friendly to runtime)
Attached file abnormal2.log
Attached log with "Cannot find lock ID" errors... I think this is the bug (or at least a bug) ! When the lock ID cannot be found we essentially just freeze without a response to the lock requester... Now to figure out why we cannot find the lock id...
This is interesting... In the constructor of the "SettingsLock" from SettingsManager.js we send "CreateLock" I clearly see a bunch of these objects being created (more then 3) but only 3 actually get processed by SettingsRequestManager.jsm. So these messages are getting lost somehow...
Depends on: 1089405
Note the updated context is here 1089405 once that bug lands this should go away... going to leave this open until we are sure this gone (give it a week of TBPL).
have not seen this since landing 1089405 marking as fixed.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Blocks: 1091685
You need to log in before you can comment on or make changes to this bug.