Marionette takes down (child/parent) process if it can't bind to port
Categories
(Remote Protocol :: Marionette, defect, P3)
Tracking
(firefox-esr60 unaffected, firefox66 unaffected, firefox67 fixed, firefox68 fixed)
Tracking | Status | |
---|---|---|
firefox-esr60 | --- | unaffected |
firefox66 | --- | unaffected |
firefox67 | --- | fixed |
firefox68 | --- | fixed |
People
(Reporter: kats, Assigned: nalexander)
References
(Regression)
Details
(Keywords: regression, Whiteboard: [geckoview:p2])
Attachments
(2 files)
281.48 KB,
text/plain
|
Details | |
47 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
|
Details | Review |
So while trying to investigate why I was seeing bug 1533498, I discovered that using a prebuilt GVE app from treeherder totally doesn't work on the Kindle Fire HD8. It starts up, but instead of about:blank showing up in the URL bar, it remains empty. If I type in a URL and hit enter, it has no effect - the URL bar goes back to empty and nothing loads.
I used mozregression to bisect the problem (with a detour to workaround another problem, see bug 1533517), and it looks like it's a regression from bug 1524673.
$ python mozregression/main.py -n gve --good 2019-02-01 --repo mozilla-central --persist ~/Downloads/mozregression/ --adb-profile-dir=/sdcard/tests
<output omitted>
17:24.74 INFO: Last good revision: 887cc76ba1899544f912bf2614a843b03766d017
17:24.74 INFO: First bad revision: a971c53ea2c08f806d5ea1bca5a3c48d68151749
17:24.74 INFO: Pushlog:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=887cc76ba1899544f912bf2614a843b03766d017&tochange=a971c53ea2c08f806d5ea1bca5a3c48d68151749
Updated•5 years ago
|
Reporter | ||
Comment 1•5 years ago
|
||
Just to be clear, GVE works fine on other devices I've tried, like the Nexus 4 and a Sony Xperia XZ1C. It's just the Kindle Fire HD8 that seems to be affected. I didn't see anything particularly useful in logcat either.
Reporter | ||
Comment 2•5 years ago
|
||
In case it helps, here's the logcat. I started collecting logcat, then ran python mozregression/main.py -n gve --launch 2019-03-08 --repo mozilla-central --persist ~/Downloads/mozregression/ --adb-profile-dir=/sdcard/tests
and waited for the app to start. At around 07:22:20.153 in the log I tap the (empty) URL bar to bring up the keyboard, and type in "google.com" and then hit enter around 07:22:25.722, which has zero effect.
Comment 3•5 years ago
|
||
I'm not sure if that's only a red herring or relevant, but I see
03-08 07:22:15.817 I/Gecko ( 8910): 1552047735816 Marionette FATAL Remote protocol server failed to start: Error: > Could not bind to port 2828 (NS_ERROR_SOCKET_ADDRESS_IN_USE)(chrome://marionette/content/server.js:87:17) JS Stack trace: set acceptConnections@server.js:87:17
03-08 07:22:15.817 I/Gecko ( 8910): start@server.js:117:5
03-08 07:22:15.817 I/Gecko ( 8910): init/<@marionette.js:471:21
and then a little later the content process dies:
03-08 07:22:16.149 I/GeckoProcessManager( 8910): Binder died for tab
03-08 07:22:16.149 I/ActivityManager( 581): Process org.mozilla.geckoview_example:tab (pid 8949) has died
Reporter | ||
Comment 4•5 years ago
|
||
Ah, I totally missed that! Based on that I took a look at all the other Mozilla things I had installed:
$ pm list packages | grep mozilla
package:org.mozilla.fenix.debug
package:org.mozilla.reference.browser.debug
package:org.mozilla.fennec_kats
package:org.mozilla.geckoview_example
package:org.mozilla.geckoview_example.test
and after uninstalling the reference browser, GVE started working again. So presumably the reference browser also listens on port 2828, and was interfering with GVE.
Updated•5 years ago
|
Marionette should not take down the entire content process if it can't bind to its port.
Assignee | ||
Comment 6•5 years ago
|
||
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) (he/him) from comment #5)
Marionette should not take down the entire content process if it can't bind to its port.
So I agree we shouldn't be taking down the process, but I have two questions:
- is Marionette even supposed to be running in a content process? I expect it to be all main process...
- what should Marionette do if it can't bind to its port? If you're actually using it, that's a critical failure. But most people will only care about Remote Debugging, and then Marionette failing isn't critical at all.
(In reply to Nick Alexander :nalexander [he/him] from comment #6)
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) (he/him) from comment #5)
Marionette should not take down the entire content process if it can't bind to its port.
So I agree we shouldn't be taking down the process, but I have two questions:
- is Marionette even supposed to be running in a content process? I expect it to be all main process...
Good point, I agree that seems wrong. Maybe :dburns can clarify that.
- what should Marionette do if it can't bind to its port? If you're actually using it, that's a critical failure. But most people will only care about Remote Debugging, and then Marionette failing isn't critical at all.
IMHO it should be enough to log an error to logcat. Our harnesses should generally find those and report them.
Comment 8•5 years ago
|
||
Under which conditions does Marionette bind to the port inside a content process? The code which is doing that in marionette.js is always running in chrome scope (parent). So I'm a bit confused by that.
We currently shutdown the application because under automation there is no need to further wait until a timeout when it is clear that a connection will never happen. Maybe we have add an option here which specifies were it is allowed to shutdown the process?
Ah, sorry, it's killing the parent process. Still, it shouldn't do that when running in GV at least because we don't own that process.
Comment 10•5 years ago
|
||
Moving to Marionette component
Comment 11•5 years ago
|
||
init(quit=false)
could be used when the marionette-startup-requested
notification is handled and the current application is GeckoView based. The question is how to best know that here.
Please note that we don't have any man power to work on that bug at the moment.
Updated•5 years ago
|
Updated•5 years ago
|
You can use Services.androidBridge.isFennec
to detect Fennec. The presence of Services.androidBridge
and !isFennec
means GeckoView.
Comment 13•5 years ago
|
||
This is probably the cause of https://bugzilla.mozilla.org/show_bug.cgi?id=1528279
Assignee | ||
Comment 14•5 years ago
|
||
(In reply to Agi | :agi | ⏰ EST | he/him from comment #13)
This is probably the cause of https://bugzilla.mozilla.org/show_bug.cgi?id=1528279
I will take this. I will probably randomize the Marionette port in the case it's not explicitly defined (if we can deduce that is the case), or I will follow Whimboo's suggestion in #c11.
I would really prefer to use a Unix local socket, like the Remote Debugger does, 'cuz these are implicitly per Application. I wonder how hard that is...
Updated•5 years ago
|
(In reply to Nick Alexander :nalexander [he/him] from comment #14)
I would really prefer to use a Unix local socket, like the Remote Debugger does, 'cuz these are implicitly per Application. I wonder how hard that is...
That would indeed be nice. I think the most difficulty there probably lies in the client side since it now has to use adb to forward the connection, etc.
Assignee | ||
Comment 16•5 years ago
|
||
Just a status update here: it's quite difficult to square the circle of:
- keeping existing Marionette TCP/IP ports, which default to 2828 (and don't appear to be randomized in automation)
- not binding too aggressively to TCP/IP ports, which are a global resource
- wanting to tie Marionette directly to the Remote Debugger flag in GeckoView. (Really, we don't want Marionette at all -- we want one protocol for remote control, and it's not Marionette long-term.)
I'm leaning toward backing out the pref-management parts of the patch I landed in Bug 1524673, but it actually looks like it might not be hard to support UNIX sockets and/or abstract sockets. (Just like the Remote Debugger and the new Remote Control protocols do.) So if that's true, we might:
- make GV-based Apps default to a UNIX socket in the wild
- make automation know how to get those Apps started with TCP/IP sockets
- make sure that Fennec is unchanged
I see that the port in Marionette is an integer (i.e., doesn't easily accommodate a UNIX socket name), which is unfortunate. But it might still be possible... I shall investigate further.
Note to self: this will require uplift, 'cuz the bad behaviour has escaped the lab.
Assignee | ||
Comment 17•5 years ago
|
||
We really want GeckoView's single remote debugigng setting to
determine whether the engine can be remote controlled, but we're not
quite there yet. The devtools use an abstract UNIX socket for this
purpose, but Marionette uses a TCP socket that defaults to port 2828,
and that means we see cross-App clashes for that port.
Functionally this means that enabling Marionette reverts to the "old
method": either pass the "--marionette" command line argument or set
the MOZ_MARIONETTE=1
environment variable to enable. Callers remain
responsible for ensuring that the Marionette port is available.
Comment 18•5 years ago
|
||
Pushed by nalexander@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/3840d093c11b Don't make GeckoView's remote debugging setting control Marionette. r=snorp
Comment 19•5 years ago
|
||
bugherder |
Assignee | ||
Comment 20•5 years ago
|
||
Comment on attachment 9054041 [details]
Bug 1533704 - Don't make GeckoView's remote debugging setting control Marionette. r?snorp
Beta/Release Uplift Approval Request
- Feature/Bug causing the regression: Bug 1524673
- User impact if declined: GeckoView consuming Apps will fight for port 2828, meaning that if two GV-consuming Apps are run on a device, one (the loser of the fight) will crash.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): It's not particularly risky: our suites run in automation; I've tested locally; and we'd see issues in our own developer base if it was a problem.
- String changes made/needed:
Comment 21•5 years ago
|
||
Comment on attachment 9054041 [details]
Bug 1533704 - Don't make GeckoView's remote debugging setting control Marionette. r?snorp
Marionnette fix for geckoview, uplift accepted for our next beta, thanks.
Comment 22•5 years ago
|
||
bugherder uplift |
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Comment 23•5 years ago
|
||
【 Digression 】
The following digresses from the original topic of the bug, but may
provide some useful insight into Marionette’s future.
First, I’m surprised anyone would find it surprising behaviour for
a program to exit early if it fails to bind to a TCP port essential
to its performance. If you can’t bind to the port when you request
the server to start and ignore the error, you have no way of knowing
what went wrong other than by parsing stdout which is not a reliable
way to tell in this case, because Log.jsm logging can be turned off
by configuration.
The correct way to use parallel Firefox instances in automation is
to set marionette.port
to 0 to have the system atomically allocate
the port, then read back the same preference once the server has
started. We have many times considered making use of domain sockets
for Marionette and you can see a bit of the discussion around that
and other alternatives in
https://bugzilla.mozilla.org/show_bug.cgi?id=1240830.
There is some movement towards unifying on a single CDP-based remote
protocol for Firefox happening with the https://wiki.mozilla.org/Remote
project. We’re likely to be porting Marionette to in the long-term,
which means the --marionette
flag and the 2828 TCP port will go
away in favour of a WebSocket based interface. This change should
be totally transparent to any users of geckodriver.
Updated•2 years ago
|
Updated•1 year ago
|
Description
•