Marionette takes down (child/parent) process if it can't bind to port

RESOLVED FIXED in Firefox 67

Status

defect
P3
normal
RESOLVED FIXED
6 months ago
2 months ago

People

(Reporter: kats, Assigned: nalexander)

Tracking

(Regression, {regression})

unspecified
mozilla68
Unspecified
Android
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr60 unaffected, firefox66 unaffected, firefox67 fixed, firefox68 fixed)

Details

(Whiteboard: [geckoview:p2])

Attachments

(2 attachments)

So while trying to investigate why I was seeing bug 1533498, I discovered that using a prebuilt GVE app from treeherder totally doesn't work on the Kindle Fire HD8. It starts up, but instead of about:blank showing up in the URL bar, it remains empty. If I type in a URL and hit enter, it has no effect - the URL bar goes back to empty and nothing loads.

I used mozregression to bisect the problem (with a detour to workaround another problem, see bug 1533517), and it looks like it's a regression from bug 1524673.

$ python mozregression/main.py -n gve --good 2019-02-01 --repo mozilla-central --persist ~/Downloads/mozregression/ --adb-profile-dir=/sdcard/tests
<output omitted>
17:24.74 INFO: Last good revision: 887cc76ba1899544f912bf2614a843b03766d017
17:24.74 INFO: First bad revision: a971c53ea2c08f806d5ea1bca5a3c48d68151749
17:24.74 INFO: Pushlog:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=887cc76ba1899544f912bf2614a843b03766d017&tochange=a971c53ea2c08f806d5ea1bca5a3c48d68151749
Flags: needinfo?(nalexander)
Keywords: regression

Just to be clear, GVE works fine on other devices I've tried, like the Nexus 4 and a Sony Xperia XZ1C. It's just the Kindle Fire HD8 that seems to be affected. I didn't see anything particularly useful in logcat either.

Posted file Logcat

In case it helps, here's the logcat. I started collecting logcat, then ran python mozregression/main.py -n gve --launch 2019-03-08 --repo mozilla-central --persist ~/Downloads/mozregression/ --adb-profile-dir=/sdcard/tests and waited for the app to start. At around 07:22:20.153 in the log I tap the (empty) URL bar to bring up the keyboard, and type in "google.com" and then hit enter around 07:22:25.722, which has zero effect.

I'm not sure if that's only a red herring or relevant, but I see

03-08 07:22:15.817 I/Gecko ( 8910): 1552047735816 Marionette FATAL Remote protocol server failed to start: Error: > Could not bind to port 2828 (NS_ERROR_SOCKET_ADDRESS_IN_USE)(chrome://marionette/content/server.js:87:17) JS Stack trace: set acceptConnections@server.js:87:17
03-08 07:22:15.817 I/Gecko ( 8910): start@server.js:117:5
03-08 07:22:15.817 I/Gecko ( 8910): init/<@marionette.js:471:21

and then a little later the content process dies:

03-08 07:22:16.149 I/GeckoProcessManager( 8910): Binder died for tab
03-08 07:22:16.149 I/ActivityManager( 581): Process org.mozilla.geckoview_example:tab (pid 8949) has died

Ah, I totally missed that! Based on that I took a look at all the other Mozilla things I had installed:

$ pm list packages | grep mozilla                                                                                                                                                                 
package:org.mozilla.fenix.debug
package:org.mozilla.reference.browser.debug
package:org.mozilla.fennec_kats
package:org.mozilla.geckoview_example
package:org.mozilla.geckoview_example.test

and after uninstalling the reference browser, GVE started working again. So presumably the reference browser also listens on port 2828, and was interfering with GVE.

Whiteboard: [geckoview]

Marionette should not take down the entire content process if it can't bind to its port.

Summary: GVE app from TreeHerder doesn't load pages on HD8 → Marionette takes down content process if it can't bind to port

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) (he/him) from comment #5)

Marionette should not take down the entire content process if it can't bind to its port.

So I agree we shouldn't be taking down the process, but I have two questions:

  • is Marionette even supposed to be running in a content process? I expect it to be all main process...
  • what should Marionette do if it can't bind to its port? If you're actually using it, that's a critical failure. But most people will only care about Remote Debugging, and then Marionette failing isn't critical at all.
Flags: needinfo?(nalexander)

(In reply to Nick Alexander :nalexander [he/him] from comment #6)

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) (he/him) from comment #5)

Marionette should not take down the entire content process if it can't bind to its port.

So I agree we shouldn't be taking down the process, but I have two questions:

  • is Marionette even supposed to be running in a content process? I expect it to be all main process...

Good point, I agree that seems wrong. Maybe :dburns can clarify that.

  • what should Marionette do if it can't bind to its port? If you're actually using it, that's a critical failure. But most people will only care about Remote Debugging, and then Marionette failing isn't critical at all.

IMHO it should be enough to log an error to logcat. Our harnesses should generally find those and report them.

Under which conditions does Marionette bind to the port inside a content process? The code which is doing that in marionette.js is always running in chrome scope (parent). So I'm a bit confused by that.

We currently shutdown the application because under automation there is no need to further wait until a timeout when it is clear that a connection will never happen. Maybe we have add an option here which specifies were it is allowed to shutdown the process?

Ah, sorry, it's killing the parent process. Still, it shouldn't do that when running in GV at least because we don't own that process.

Summary: Marionette takes down content process if it can't bind to port → Marionette takes down process if it can't bind to port

Moving to Marionette component

Component: General → Marionette
OS: All → Android
Product: GeckoView → Testing
Whiteboard: [geckoview] → [geckoview:p2]

init(quit=false) could be used when the marionette-startup-requested notification is handled and the current application is GeckoView based. The question is how to best know that here.

https://searchfox.org/mozilla-central/rev/f1c7ba91fad60bfea184006f3728dd6ac48c8e56/testing/marionette/components/marionette.js#417

Please note that we don't have any man power to work on that bug at the moment.

Priority: -- → P3

You can use Services.androidBridge.isFennec to detect Fennec. The presence of Services.androidBridge and !isFennec means GeckoView.

See Also: → 1528279

(In reply to Agi | :agi | ⏰ EST | he/him from comment #13)

This is probably the cause of https://bugzilla.mozilla.org/show_bug.cgi?id=1528279

I will take this. I will probably randomize the Marionette port in the case it's not explicitly defined (if we can deduce that is the case), or I will follow Whimboo's suggestion in #c11.

I would really prefer to use a Unix local socket, like the Remote Debugger does, 'cuz these are implicitly per Application. I wonder how hard that is...

Assignee: nobody → nalexander
Status: NEW → ASSIGNED

(In reply to Nick Alexander :nalexander [he/him] from comment #14)

I would really prefer to use a Unix local socket, like the Remote Debugger does, 'cuz these are implicitly per Application. I wonder how hard that is...

That would indeed be nice. I think the most difficulty there probably lies in the client side since it now has to use adb to forward the connection, etc.

Just a status update here: it's quite difficult to square the circle of:

  1. keeping existing Marionette TCP/IP ports, which default to 2828 (and don't appear to be randomized in automation)
  2. not binding too aggressively to TCP/IP ports, which are a global resource
  3. wanting to tie Marionette directly to the Remote Debugger flag in GeckoView. (Really, we don't want Marionette at all -- we want one protocol for remote control, and it's not Marionette long-term.)

I'm leaning toward backing out the pref-management parts of the patch I landed in Bug 1524673, but it actually looks like it might not be hard to support UNIX sockets and/or abstract sockets. (Just like the Remote Debugger and the new Remote Control protocols do.) So if that's true, we might:

  1. make GV-based Apps default to a UNIX socket in the wild
  2. make automation know how to get those Apps started with TCP/IP sockets
  3. make sure that Fennec is unchanged

I see that the port in Marionette is an integer (i.e., doesn't easily accommodate a UNIX socket name), which is unfortunate. But it might still be possible... I shall investigate further.

Note to self: this will require uplift, 'cuz the bad behaviour has escaped the lab.

We really want GeckoView's single remote debugigng setting to
determine whether the engine can be remote controlled, but we're not
quite there yet. The devtools use an abstract UNIX socket for this
purpose, but Marionette uses a TCP socket that defaults to port 2828,
and that means we see cross-App clashes for that port.

Functionally this means that enabling Marionette reverts to the "old
method": either pass the "--marionette" command line argument or set
the MOZ_MARIONETTE=1 environment variable to enable. Callers remain
responsible for ensuring that the Marionette port is available.

Pushed by nalexander@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/3840d093c11b
Don't make GeckoView's remote debugging setting control Marionette. r=snorp
Status: ASSIGNED → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68

Comment on attachment 9054041 [details]
Bug 1533704 - Don't make GeckoView's remote debugging setting control Marionette. r?snorp

Beta/Release Uplift Approval Request

  • Feature/Bug causing the regression: Bug 1524673
  • User impact if declined: GeckoView consuming Apps will fight for port 2828, meaning that if two GV-consuming Apps are run on a device, one (the loser of the fight) will crash.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): It's not particularly risky: our suites run in automation; I've tested locally; and we'd see issues in our own developer base if it was a problem.
  • String changes made/needed:
Attachment #9054041 - Flags: approval-mozilla-beta?

Comment on attachment 9054041 [details]
Bug 1533704 - Don't make GeckoView's remote debugging setting control Marionette. r?snorp

Marionnette fix for geckoview, uplift accepted for our next beta, thanks.

Attachment #9054041 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
No longer blocks: 1524673
Regressed by: 1524673
Summary: Marionette takes down process if it can't bind to port → Marionette takes down (child/parent) process if it can't bind to port

【 Digression 】

The following digresses from the original topic of the bug, but may
provide some useful insight into Marionette’s future.

First, I’m surprised anyone would find it surprising behaviour for
a program to exit early if it fails to bind to a TCP port essential
to its performance. If you can’t bind to the port when you request
the server to start and ignore the error, you have no way of knowing
what went wrong other than by parsing stdout which is not a reliable
way to tell in this case, because Log.jsm logging can be turned off
by configuration.

The correct way to use parallel Firefox instances in automation is
to set marionette.port to 0 to have the system atomically allocate
the port, then read back the same preference once the server has
started. We have many times considered making use of domain sockets
for Marionette and you can see a bit of the discussion around that
and other alternatives in
https://bugzilla.mozilla.org/show_bug.cgi?id=1240830.

There is some movement towards unifying on a single CDP-based remote
protocol for Firefox happening with the https://wiki.mozilla.org/Remote
project. We’re likely to be porting Marionette to in the long-term,
which means the --marionette flag and the 2828 TCP port will go
away in favour of a WebSocket based interface. This change should
be totally transparent to any users of geckodriver.

You need to log in before you can comment on or make changes to this bug.