Closed Bug 800138 Opened 7 years ago Closed 7 years ago

setting "devtools.debugger.force-local" to false breaks Marionette in b2g

Categories

(DevTools :: Debugger, defect, P1, blocker)

13 Branch
x86
All
defect

Tracking

(blocking-basecamp:+, firefox18 fixed, firefox19 fixed, firefox20 fixed, b2g18 fixed)

RESOLVED FIXED
Firefox 20
blocking-basecamp +
Tracking Status
firefox18 --- fixed
firefox19 --- fixed
firefox20 --- fixed
b2g18 --- fixed

People

(Reporter: mdas, Assigned: jgriffin)

References

Details

Attachments

(3 files, 5 obsolete files)

So we've run into a strange issue where adding:
user_pref("devtools.debugger.force-local", false);

to our user.js in B2G breaks marionette. The marionette component still loads, but if we try to make a connection (even from localhost), it doesn't accept it. We're not sure of the cause, especially since this same preference works fine when running in a Firefox environment. We were able to use this before, so there's been a regression. This problem showed up sometime last week at least.

To reproduce, first ensure that marionette is running on your b2g device:'adb forward tcp:2828 tcp:2828' and test that marionette is running with 'telnet localhost 2828.

Then add the preference to /data/local/user.js and in adb shell, run 'stop b2g' then 'start b2g'. Now try to telnet, and the connection won't be made.

Panos, do you have any hypotheses on what's going on?
I verified this on my Nexus S, but I can't get Marionette to run on my desktop b2g build (with or without the force-local pref). Did anything change in regards to network permissions in b2g lately? Can you see if this worked before bug 770778 landed?
Marionette is temporarily broken on B2G due to bug 800154, which is separate from this issue, but probably blocks investigation unless you back out the patch from bug 799536.
I still can't get marionette to listen to port 2828 on desktop even after bug 800154. I used to get the following error on desktop B2G:

Chrome file doesn't exist: /Users/past/src/mozilla-central/testing/marionette/../mochitest/specialpowers/components/SpecialPowersObserver.js
Error loading: chrome://marionette/content/marionette-actors.js: Error opening input stream (invalid filename?) - undefined

...but it went away after clobbering. I can't see any relevant error messages, but nothing listens on that port (the debugger listens on port 6000 just fine).
I don't have a desktop b2g build locally, so I pulled down the one here ftp://ftp.mozilla.org/pub/b2g/nightly/latest-mozilla-central/, and out of the box, I can communicate with marionette here with no problem. The problem is I can't reproduce the force-local problem in this environment. 

This is currently reproducible on the otoro device (I just tested it today) and on the emulator, so you can test it on the emulator. I will email you information on how to do this.
Blocks: 807126
Assignee: nobody → past
Status: NEW → ASSIGNED
Priority: -- → P2
This is needed for panda automation.
blocking-basecamp: --- → ?
This is blocking B2G automation, raising priority.
Severity: normal → blocker
Priority: P2 → P1
Hi guys,

I worked on this some over the b2g workweek.

I verified that if we turn the devtools.debugger.force-local pref to false then we lose both the marionette socket on the loopback interface and the ability to connect to that port on the main IP address. I verified this using a B2G emulator build.
If I leave that setting alone, then I can use adb to forward port 2828 to localhost:
adb forward tcp:2828 tcp:2828
And then I can connect marionette to that socket as much as as I want.

When I set devtools.debugger.force-local to false I can't connect to it.  I'm not sure what is up with that. One thing I noticed is that this code: http://mxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.js#569 which is using the same mechanism that we are in dbg-server.js works fine even with this pref flipped.

I can't seem to find any errors in the C++ code or get any information out about what is going wrong.
Actually I'm wrong. They set that localonly bit to true when they create the server socket. So no wonder that works.  I wonder if I can repro this on a desktop build (the team assured me I couldn't, but I'm running out of ideas).
I spent most of the day today trying to debug this and things get weirder and weirder. First of all, the quickest way to observe the bug during my build/flash/test iterations was to keep a shell connected to the device and run 'netstat' on it. When the bug occurs, nothing listens on port 2828. I was initially concerned about tcp wrappers-style or netfilter-like blocking of random ports on non-loopback addresses, but that proved not to be the case.

My other theory was that this has something to do with opening server sockets on child processes or b2g in general, but it turns out that I can open a server socket on port 2828 just fine in the same place in mc_init, if I just add the relevant code from DS_openListener. And we know that the same code in DS_openListener works fine, because I can connect to the debugger without issues when marionette is disabled, even with force-local disabled.

So it appears that some weird interaction is happening when a non-local socket is opened from the marionette (or any?) component, on any port, by running the DS_openListener method, even though it returns true, which indicates success (no error thrown, no problems encountered).

To top it off, I can't see any dump() output from dbg-server.js, even though I get dump() output from shell.js and from marionettecomponent.js. I had to modify the return value of DS_openListener to make sure I was actually changing the correct file.
(In reply to Panos Astithas [:past] from comment #9)
> To top it off, I can't see any dump() output from dbg-server.js, even though
> I get dump() output from shell.js and from marionettecomponent.js. I had to
> modify the return value of DS_openListener to make sure I was actually
> changing the correct file.

There's some missing dump implementations for B2G.  To work around this, locally, apply the patch from bug 752116.
Thanks for the reminder, that's cleaner than what I ended up doing. I had already managed to get dumps from dbg-server.js by sticking dump to DebuggerServer.dump and using that. That led me to discover that after the listener is started something later comes up and shuts it down, causing nsIServerSocketListener.onStopListening to get called with status NS_ERROR_ABORT.
A couple of places that seem related are the following, but I need to verify this with GDB:

http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsServerSocket.cpp#199
http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsSocketTransport2.cpp#1617

I see that when marionette is disabled, the debugger starts the listener on ContentStart, which is much later in the bootstrap process than when the marionette component is initialized. I wonder if there are some limits on server sockets on component initialization that are lifted further down the startup path. Can we get some platform people to look into this?

My observations so far lead me to believe that fixing bug 764913 would solve this issue, by avoiding marionettecomponent.js and loading marionette actors the same way as we do for, say profiler actors. But that seems like a much bigger task, than figuring out what causes this bug.
Blocks: 802317
Hi past,
Gaia smoketests on b2g pandas (Bug 802317 - which I'm working on) depends on this bug.
The bug I'm working on has a hard deadline for Dec. 10th and I believe I cannot make much progress without this bug being solved first.

IIUC this prevents us from running Gaia smoketests on the b2g pandas that we have racked at the colo.

If this is a correct assessment, do you believe that we could have this fixed for next week?
Who from platform could give you a hand with this bug?
(In reply to Panos Astithas [:past] from comment #12)
> 
> My observations so far lead me to believe that fixing bug 764913 would solve
> this issue, by avoiding marionettecomponent.js and loading marionette actors
> the same way as we do for, say profiler actors. But that seems like a much
> bigger task, than figuring out what causes this bug.

Just to follow-up on this, past did try to resolve bug 764913, but it breaks Marionette's ability to test WebAPI's, due to permissions issues.  In order to avoid those, Marionette needs to be loaded from a context outside of shell.js.

Next Monday, I'll try to see if loading Marionette later, but still from marionettecomponent.js, helps this issue.
I tried out jgriffin's idea from comment 14.  Trying to change the observer in marionettecomponent.js that calls init from waiting for final-ui-startup to server-message-listener-ready winds up never being called.  It could be that my panda board has something wrong in gaia, which is preventing that notice from being called.  Given that, I created a timer and had it fire init after 30s. This should have successfully caused us to wait long enough.

With the timer solution in place, marionette does get instantiated, but it is still only available from localhost. I do have devtools.debugger.force-local set to false in my modules/libpref/src/init/all.js, so that should cause the debugger code to not force loopback-only.
Ok, I figured out what was up.  Forcing the marionettecomponent.js::init to wait *does* solve the problem.  Forcing that to wait (via the timer, or better, via server-message-listener-ready if we can make that work) will ensure that the marionette socket is bound to 0.0.0.0 which will then make it remotely accessible.  The reason I didn't see this in my last comment is that I had to blow away my existing profile on the panda (adb shell rm -r /data/b2g/mozilla) so that the new default pref settings would take effect thus setting devtools.debugger.force-local to false.

Once I get my happy panda with marionette bound to the remote address:
ctalbert@ubuntu:~/projects/temp/b2gcache$ adb shell netstat
Proto Recv-Q Send-Q Local Address          Foreign Address        State
 tcp       0      0 0.0.0.0:2828           0.0.0.0:*              LISTEN
 tcp       0      0 127.0.0.1:5037         0.0.0.0:*              LISTEN
 tcp       0      0 127.0.0.1:9999         0.0.0.0:*              LISTEN
 tcp       0      0 0.0.0.0:20700          0.0.0.0:*              LISTEN
 tcp       0      0 0.0.0.0:20701          0.0.0.0:*              LISTEN

I can then try to telnet to the 2828 port.  However when I do that, the connection is refused.  Looking in the logcat shows that when I try to connect, Marionette makes the connection, queues up some output and then the dbg-transport.js function explodes before the output makes it back to me, thus closing the connection. You can see this in the logcat:
I/Gecko   (  100): 150965	Marionette	INFO	sendToClient: {"from":"conn0.marionette1","ok":true}, undefined, null

E/GeckoConsole(  100): [JavaScript Error: "[Exception... "Component returned failure code: 0x804b0010 (NS_ERROR_OFFLINE) [nsIOutputStream.write]"  nsresult: "0x804b0010 (NS_ERROR_OFFLINE)"  location: "JS frame :: chrome://global/content/devtools/dbg-transport.js :: DT_onOutputStreamReady :: line 94"  data: no]" {file: "chrome://global/content/devtools/dbg-transport.js" line: 94}]

So, past ^ want to take a look at that?  I'll attach my hacked timer patch so you can have a place to start.
Should the milestone of this bug be "B2G C2 (20nov-10dec)"?
I've been fighting with this all day yesterday, trying to reproduce the findings from comment 16, without success. Marionette never binds to 0.0.0.0, because for some reason the force-local pref is always true. I have a more invasive patch that changes all definitions and checks in the code to assume a default value of 'false' and have modified both prefs.js and user.js in the device, to no avail. Somehow a value of 'true' is set in this file:

out/target/product/crespo/system/b2g/defaults/settings.json

Clearing 'objdir-gecko' and 'out' directories didn't help, that file still gets the same value.
Apparently force-local is being defined in gaia/build/settings.py as well and due to the reflecting code in gecko/b2g/chrome/content/settings.js, I couldn't get the pref change to stick. With that fixed, I can now get a wifi connection to port 2828 with the timer hack. Looking into the server-message-listener-ready notification next.
Attached patch Working patchSplinter Review
So the problem apparently was twofold: marionette was opening the listener too early in the startup process, and gaia forced a default value of 'true' to the force-local pref. This patch in association with the following gaia pull request fixes the issue for me:

https://github.com/mozilla-b2g/gaia/pull/6603 

I got the tests to run over wifi, although most didn't succeed due to marionette-related issues (this.currBrowser is null and others). I also had wifi attachment problems in my Nexus S that required me to manually connect to my AP before running the tests, but that's a separate issue.

Let me know if this works for you, too.
Attachment #684249 - Attachment is obsolete: true
Attachment #684648 - Flags: review?(jgriffin)
Attachment #684648 - Flags: review?(21)
Thanks so much for working on this past.  I tried this patch (along with the Gaia PR) and unfortunately found the same result as in https://bugzilla.mozilla.org/show_bug.cgi?id=764913#c6, namely it breaks all the WebAPI tests.  It's looking like we'll need to make some fix or workaround in Marionette for this, so I've posted a question about this to dev.b2g.

We'll need to figure out this issue before we can land this patch in order to avoid breaking all the WebAPI tests that are currently running on TBPL.

Depending on the input from the dev.b2g thread, I'll file follow-up bugs.
Depends on: 815807
FYI, I removed the settings in the fix for bug 815522.
Comment on attachment 684648 [details] [diff] [review]
Working patch

Review of attachment 684648 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks past.

This shouldn't be landed until part 3 of bug 815807 lands, at which time this patch will be a little bitrotted.  Please push this to try before landing with the syntax:

-b o -p ics_armv7a_gecko -t none -u all

That will catch any b2g breakage this could potentially cause (although hopefully it won't).
Attachment #684648 - Flags: review?(jgriffin) → review+
I believe this patch is mostly subsumed by bug 815522 and bug 815807. The only thing remaining were a couple of cleanups I did (the URI for dbg-server.js and removing openListsner's second argument). Do you want to land them in bug 815807 as well?
(In reply to Panos Astithas [:past] from comment #25)
> I believe this patch is mostly subsumed by bug 815522 and bug 815807. The
> only thing remaining were a couple of cleanups I did (the URI for
> dbg-server.js and removing openListsner's second argument). Do you want to
> land them in bug 815807 as well?

Yes, will do!
Marking as a blocker; we can't have Marionette broken.
blocking-basecamp: ? → +
Attached file Pointer to pull request (obsolete) —
(In reply to Jonathan Griffin (:jgriffin) from comment #28)
> Created attachment 687182 [details]
> Pointer to pull request

This is for the Gaia bits which haven't been approved yet.
This bug is bb+ so you're clear to land with r+.
(In reply to Jonathan Griffin (:jgriffin) from comment #29)
> (In reply to Jonathan Griffin (:jgriffin) from comment #28)
> > Created attachment 687182 [details]
> > Pointer to pull request
> 
> This is for the Gaia bits which haven't been approved yet.

:past, can you unbitrot this so it can be cleanly merged?  In order for it to be merged, it should be squashed to 1 commit with the commit message in the format "Bug xyz - Commit message, r=vivien"
Attachment #687182 - Attachment is obsolete: true
Attachment #688047 - Flags: review?(21)
(In reply to Jonathan Griffin (:jgriffin) from comment #31)
> (In reply to Jonathan Griffin (:jgriffin) from comment #29)
> > (In reply to Jonathan Griffin (:jgriffin) from comment #28)
> > > Created attachment 687182 [details]
> > > Pointer to pull request
> > 
> > This is for the Gaia bits which haven't been approved yet.
> 
> :past, can you unbitrot this so it can be cleanly merged?  In order for it
> to be merged, it should be squashed to 1 commit with the commit message in
> the format "Bug xyz - Commit message, r=vivien"

I went ahead and submitted a new PR for this that can be automatically merged, https://github.com/mozilla-b2g/gaia/pull/6795
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Blocks: 818210
Unfortunately, I have to reopen this.  Even with all these patches landed (here and from bug 815807), setting force-local to false still kills Marionette.

We are now loading Marionette later in the cycle (at the 'system-message-listener-ready' event), but maybe this isn't late enough?  We're still loading before the UI comes up.  We never understood why loading early was breaking this, so we don't know how long we have to wait to get around it.

I'll play with it some more.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This appears to be the same thing as reported in comment #11.
I'm going to try to load Marionette even later, and see if that fixes this.
So currently, the Marionette socket never stays open if force-local is false, no matter how late we start it, even if we wait 5 minutes after boot.

This seems to indicate that something has changed in the last few weeks, since this didn't used to be the case.
Fabrice says he may be able to help debug this.
This may be a side effect of bug 815522.
I talked with Fabrice on IRC.  The plan now is to change Gaia's build so that it enables the remote debugger only for eng builds, but leaves things as-is for user builds.  This will mean that we don't have to change that pref at runtime when Marionette wants to load, which will avoid having the code landed as part of bug 815522 reset the debugger.

I'll work on this tomorrow.
Assignee: past → jgriffin
Depends on: 815522
The plan in comment #41 didn't entirely pan out, due bug 764913; the remote debugger ends up stomping on Marionette.  I discussed with Fabrice again; I'm going to add a Marionette setting to setting.py, which will enable Marionette but not the debugger, only for eng builds.  If someone wants to use the debugger on an eng build, they'll have to tweak this setting.

There will be no impact on user builds.
Wouldn't it help to just revert the 4-line change in bug 81522 and add back the ContentStart listener that starts the debugger?
(In reply to Panos Astithas [:past] from comment #43)
> Wouldn't it help to just revert the 4-line change in bug 81522 and add back
> the ContentStart listener that starts the debugger?

Which 4 lines?
This is the part that used to exist in shell.js:

window.addEventListener('ContentStart', function(evt) {
  if (Services.prefs.getBoolPref('devtools.debugger.remote-enabled')) {
    startDebugger();
  }
});

If I understand correctly what part of bug 815522 caused you problems, putting this back may fix it.
That in itself doesn't fix this.  The problem is just another manifestation of bug 764913; in this case, the fact that Marionette can't enable itself without setting devtools.debugger.remote-enabled to true, but that pref usually has another meaning.

Since bug 815522, there's a race between Marionette and the remote debugger; whichever starts first, wins.

The real fix to this is to either fix bug 764913, or to have Marionette use a custom wrapper around dbg-transport.js (so we no longer depend on dbg-server.js).  Neither of these will be a quick fix, so I'm going to go ahead with a lame pref/settings hack for now.
So, the plot thickens.  I really think there's some strange B2G bug involved here.  It turns out that when force-local=false, the first debugger connection that is opened will be closed immediately by *something*.  It doesn't matter whether this is for Marionette or the remote debugger.  The only way I'm able to get Marionette to successfully load here is to do the following:

1 - set devtools.debugger.remote-enabled to True in gaia's settings.py, so that it will try to open the debugger's socket during startup.
2 - set force-local=false very early in startup process.
3 - go ahead and let the debugger attempt to open a listener because of the Gaia settings and bug 815522.  It's socket will be closed immediately.
4 - wait long enough for 2 to happen, then open Marionette's listener; it will succeed, and will listen on 0.0.0.0:2828, just like we want.

If I omit step 3, Marionette's socket gets closed immediately upon open.  I don't know what's closing the sockets; it isn't http://mxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.js#816 since I've completely disabled that code for testing.  If I reverse the order of 3 and 4, Marionette's socket gets closed immediately but the debugger's doesn't.

I can make some very lame hack to implement 1 - 4 above, but this is obviously very fragile and likely to break.  We really need to understand what is closing that socket, and why.  Fabrice, is this something you can help figure out, or do you know who could?

I'm going to attach my patches for 1 - 4 for debugging's state; they include a lot of debugging dumps etc.  If we decide to go with this terrible workaround, I'll clean them up before asking for review.
Attached patch Gecko debugging patch (obsolete) — Splinter Review
Attached patch gaia debugging patch (obsolete) — Splinter Review
So now that I understand the problem properly, the workaround is pretty simple:  open a dummy socket using nsIServerSocket before Marionette starts listening; this dummy socket will be the one that gets stomped on with NS_ERROR_ABORT, and Marionette will load correctly.  This doesn't depend on the order in which Marionette and the debugger are loaded, and in fact does't rely on starting late.  After this lands, I can remove all the early/late loading code I landed in bug 815807, since it turns out to be unnecessary.  This works quite reliably in all my testing.

I'll push to try to make sure I'm not accidentally breaking anything, however.
Attachment #689039 - Attachment is obsolete: true
I neglected to mention that I will file a separate bug about the root cause of this and try to get it addressed correctly, so that we can remove this workaround.
Try run for 07a4f16b3db2 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=07a4f16b3db2
Results (out of 6 total builds):
    success: 6
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jgriffin@mozilla.com-07a4f16b3db2
Comment on attachment 689481 [details] [diff] [review]
Open a dummy socket before Marionette starts listening,

Review of attachment 689481 [details] [diff] [review]:
-----------------------------------------------------------------

I don't know what amazes me the most, the insanity of the problem or the ingenuity of your solution.

::: testing/marionette/components/marionettecomponent.js
@@ +78,5 @@
>        case "system-message-listener-ready":
>          this.logger.info("marionette initializing at system-message-listener-ready");
>          observerService.removeObserver(this, "system-message-listener-ready");
> +
> +        Services.prefs.setBoolPref(DEBUGGER_FORCELOCAL_PREF, false);

Shouldn't you store the previous value and restore it afterwards?

@@ +85,5 @@
> +        // force-local=false fails, we open a dummy socket that will fail.
> +        // This allows the following attempt by Marionette to open a socket
> +        // to succeed.
> +        let insaneSacrificialGoat = new ServerSocket(666, false, 4);
> +        insaneSacrificialGoat.asyncListen(this);

You should close the socket after you're done with it. Probably in onStopListening?

::: toolkit/devtools/debugger/server/dbg-server.js
@@ +289,5 @@
>      }
>    },
>  
> +  onStopListening: function DS_onStopListening(aSocket, status) {
> +    dumpn("onStopListening, status: " + status);

I wish I could see these dumps in b2g. Any progress with that bug?
Addresses review comments.  No, I haven't done anything about the dump issue, I usually just use the workaround of adding gGlobal.dump = dump in dbg-server.jsm.
Attachment #689481 - Attachment is obsolete: true
Duplicate of this bug: 818210
Attachment #689863 - Flags: review?(past)
Attachment #689040 - Attachment is obsolete: true
Attachment #689863 - Flags: review?(past) → review+
green on aurora, so: https://hg.mozilla.org/releases/mozilla-beta/rev/73e86fb35a78
Whiteboard: [automation-needed-in-beta]
https://hg.mozilla.org/mozilla-central/rev/278dc672e058
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 20
Blocks: 821084
Whiteboard: [status-b2g18:fixed]
Whiteboard: [status-b2g18:fixed]
Does this issue affects Desktop Firefox too?
(In reply to MarioMi from comment #61)
> Does this issue affects Desktop Firefox too?

No, it doesn't.
Try run for 07a4f16b3db2 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=07a4f16b3db2
Results (out of 7 total builds):
    success: 6
    failure: 1
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jgriffin@mozilla.com-07a4f16b3db2
Product: Firefox → DevTools
You need to log in before you can comment on or make changes to this bug.