Closed Bug 921850 Opened 11 years ago Closed 11 years ago

[app manager] Can't connect to Simulator on Mac

Categories

(DevTools Graveyard :: WebIDE, defect, P1)

x86
macOS
defect

Tracking

(firefox26 verified, firefox27 verified, b2g-v1.2 fixed)

RESOLVED FIXED
Firefox 27
Tracking Status
firefox26 --- verified
firefox27 --- verified
b2g-v1.2 --- fixed

People

(Reporter: paul, Assigned: ochameau)

Details

(Keywords: verifyme)

Attachments

(2 files, 3 obsolete files)

> Firefox OS 1.2 Simulator 6.0pre4dev.20130926

The Simulator starts, but the connection timeouts.
Assignee: nobody → poirot.alex
Priority: -- → P1
I think this issue might be Mac specific.  I've had this issue with all builds of Simulator 6.0 & App Manager on Mac, but it was intermittent for me and I hadn't yet heard of anyone else experiencing it, so I thought it might be profile-specific at the time.  Also, I believe Alex had never heard of the issue when I mentioned it to him.

In any event, I am able to workaround the issue and connect to the simulator by triggering other activity to take place.  It feels as if the connection to the Simulator is queue in a buffer somewhere and never actually gets sent.  Here are steps I've used in the past:

1. Start Simulator in App Manager
2. It will say "Connecting..." and the Simulator is now running
3. Open a new tab and you are shown the grid of recent sites
4. Mouse over one of your recent sites
5. Go back to the App Manager and see that you are now connected

Since it now seems like I am not the only one with this issue, we should definitely do something about, as it is quite confusing!
OS: All → Mac OS X
+1, the xpis I just built for windows and linux don't seem to have such issue.
I'm still downloading stuff on the mac I pulled from my attic to start building and testing on Mac...

We had some mac specific issues when working with simulator addon. When we dropped the pingback server and switch to "auto retry connecting until it succeed", we had some mac-specific breakage when using a setTimeout 0 between each try.

If you have everything set up in your hands and are able to reproduce it easily, could you give it a try by increasing the timeout over here?
http://mxr.mozilla.org/mozilla-central/source/toolkit/devtools/client/connection-manager.js#247

Otherwise, I will take a look at that, as soon as I get a mac with fully configured.
I tried several different values between 0 - 1000ms for the timeout you pointed out: 0, 100, 500, 1000ms.

As the timeout value gets larger, I had a higher chance of success connecting.  At 500ms, I could connect about 80% of the time.  At 1000ms, I was always able to connect successfully.

I guess I don't yet see *why* this is needed, or why it is Mac-specific, but at least we have some kind of fix.
Ok then if setting a significant delay fix this issue, it may be exactly the same issue we faced with the addon. You can get some more info from this pull request, I described what I've seen in this particular comment:
  https://github.com/mozilla/r2d2b2g/pull/488#issuecomment-18642230

Increasing the timeout is something we should do in any case, as it may loop too agressively and introduce intense CPU usage whereas we don't need to. A delay between 50 and 200 sounds like a reasonable one. But given my experience in this pull request, and your observations, that won't be enough. I workaround that issue by adding a explicit timeout to the connection attempt. So that, if the network layer does not timeout by itself, nor accept/reject the attempt, after xxxx ms, we automatically cancel the connection. The good thing with this failure is that's an easy wait to produce some wrong behavior of debugger client and/or server that is introducing failures in debugger server tests (see bug 775924).
Summary: [app manager] Can't connect to Simulator → [app manager] Can't connect to Simulator on Mac
How important is this issue? On my new mac, I don't see the problem anymore (version 20130929).
Arf... I spent time on setting a mac development env for this... and just figured out like you that it has been fixed.
Unfortunately, I can't say what fixed this, but we are very lucky, as it is also fixed on Aurora.
So the underlying fix has already been uplifted \o/

Please reopen if you still see it on a aurora >= 2013-10-09 or nightly >= 2013-10-09.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Sorry, I meant nightly >= 2013-10-08.
Resolution: FIXED → WORKSFORME
Yep, looks good to me as well!
Resolution: WORKSFORME → FIXED
Resolution: FIXED → WORKSFORME
meh. What happened?
Either a fix on network layer, or debugger server has landed.
The underlying issue described in detail in this comment:
  https://github.com/mozilla/r2d2b2g/pull/488#issuecomment-18642230
appears to be fixed.
Just tested on Firefox Nightly. Can't connect.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Whiteboard: [simulator]
Still unable to reproduce it on Firefox nightly 2013-10-14, Mac 10.7.5 with the simulator xpi being on the ftp.
I'd really like to fix the underlying issue happening around stream pump component, but I can't do it without being able to reproduce it.
Can you try increasing the timeout here to 500:
  http://mxr.mozilla.org/mozilla-central/source/toolkit/devtools/client/connection-manager.js#247
And see how much it fixes this on your mac?
We can increase this delay until I manage to reproduce it.
For some reason, it works for me today (wtf?), but I see "n/a" strings all over the app manager.
Yep, currently appears to connect most of the time for me.  I only saw it fail once in about ~10 or so attempts.
With this new patch applied, I am now able to connect every time.  I went through different scenarios of starting, stopping, restarting, etc. ~10 times.  Seems like a good improvement to me.
I've identified at least two particular issues, especially happening on Mac.
The first one was nsITransport.openOutputStream throws if the remote server is still initializing.
We must be racing between the call to createTransport and openOutputStream, where the socket starts connecting,
but ends up being closed in the main thread before we get a chance to call openOutputStream.
I'm not catching the exception in debuggerSocketConnect as we can't create a DebuggerTransport without valid output stream.
so the caller should eventually handle this exception. The cool thing is that connection-manager does it,
and behave especially well when you use its keepConnecting option!

The second race we have, that was challenging to figure out is that, at the network layer,
the client socket ended up being stuck for a 18hours timeout in STATE_CONNECTING state...
That happens in nsSocketTransport2.cpp and nsSocketTransportService2.cpp:
The socket timeout is set to a high value here:
 http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsSocketTransport2.cpp#1260
And we end up waiting with this high value over here:
 http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsSocketTransportService2.cpp#347
For some reason, on other OSes, we never wait on the CONNECTING state, it either connect or get rejected immediatly.
But when we race and try to connect while the server is still initializing, we are stuck in this state for 18hours!
Attachment #817200 - Attachment is obsolete: true
Attachment #817805 - Flags: review?(paul)
Attachment #817805 - Flags: review?(jryans)
New try run:
https://tbpl.mozilla.org/?tree=Try&rev=4aada61817d0
(previous one broken by another patch in its queue)
Comment on attachment 817805 [details] [diff] [review]
Fix races in debugger client when connecting agressively

Review of attachment 817805 [details] [diff] [review]:
-----------------------------------------------------------------

Overall, I think this change works well to get the Simulator connecting.  But, I'd like to understand a few more things first before approving.

::: toolkit/devtools/client/dbg-client.jsm
@@ +2177,5 @@
>    let s = socketTransportService.createTransport(null, 0, aHost, aPort, null);
> +  // By default the CONNECT socket timeout is very long, 65535 seconds,
> +  // so that if we race to be in CONNECT state while the server socket is still
> +  // initializing, the connection is stuck in connecting state for 18.20 hours!
> +  s.setTimeout(Ci.nsISocketTransport.TIMEOUT_CONNECT, 2);

Is this 2 seconds instead of 2 milliseconds (like it would be with window.setTimeout)?  It seems that way from looking at the code...

Even if it is 2 seconds, that might be fine for the simulator, but it feels like it would be too short if someone tried to connect to a remote host.  Maybe 10 seconds is better?

@@ +2179,5 @@
> +  // so that if we race to be in CONNECT state while the server socket is still
> +  // initializing, the connection is stuck in connecting state for 18.20 hours!
> +  s.setTimeout(Ci.nsISocketTransport.TIMEOUT_CONNECT, 2);
> +
> +  // openOutputStream may throw NS_ERROR_NOT_INITIALIZED if we hit some race

Who throws NS_ERROR_NOT_INITIALIZED?  Is it the socket during createTransport, or the stream during openOutputStream?  I only see NS_ERROR_NOT_INITIALIZED being used in nsSocketTransportService2.cpp, so that makes me think it is from socket creation.

@@ +2180,5 @@
> +  // initializing, the connection is stuck in connecting state for 18.20 hours!
> +  s.setTimeout(Ci.nsISocketTransport.TIMEOUT_CONNECT, 2);
> +
> +  // openOutputStream may throw NS_ERROR_NOT_INITIALIZED if we hit some race
> +  // where the nsISocketTranport get shutdown in between its instanciation and

nsISocketTranport -> nsISocketTransport
get -> gets
instanciation -> instantiation

@@ +2186,2 @@
>    let transport = new DebuggerTransport(s.openInputStream(0, 0, 0),
>                                          s.openOutputStream(0, 0, 0));

Is there a way we could refactor this in a follow up to check for some kind of state or event on the socket to be sure it is safe to call openOutputStream before doing it?
Attachment #817805 - Flags: review?(jryans)
(In reply to J. Ryan Stinnett [:jryans] from comment #20)
> Comment on attachment 817805 [details] [diff] [review]
> Fix races in debugger client when connecting agressively
> 
> Review of attachment 817805 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Overall, I think this change works well to get the Simulator connecting. 
> But, I'd like to understand a few more things first before approving.
> 
> ::: toolkit/devtools/client/dbg-client.jsm
> @@ +2177,5 @@
> >    let s = socketTransportService.createTransport(null, 0, aHost, aPort, null);
> > +  // By default the CONNECT socket timeout is very long, 65535 seconds,
> > +  // so that if we race to be in CONNECT state while the server socket is still
> > +  // initializing, the connection is stuck in connecting state for 18.20 hours!
> > +  s.setTimeout(Ci.nsISocketTransport.TIMEOUT_CONNECT, 2);
> 
> Is this 2 seconds instead of 2 milliseconds (like it would be with
> window.setTimeout)?  It seems that way from looking at the code...

This one is in seconds, this surprised me as well!
http://mxr.mozilla.org/mozilla-central/source/netwerk/base/public/nsISocketTransport.idl#92

> 
> Even if it is 2 seconds, that might be fine for the simulator, but it feels
> like it would be too short if someone tried to connect to a remote host. 
> Maybe 10 seconds is better?

Actually what happens with devices is that the CONNECT state immediatly switch to TRANSFERRING.
Cases where we stay waiting on CONNECT seems to be very rare from what I've seen.
So this timeout isn't directly related to the amount of time the user accept the remote connection on device. You can verify this, you will see that the connection times out after 20s, the delay we use in connection-manager to timeout the connection.

> 
> @@ +2179,5 @@
> > +  // so that if we race to be in CONNECT state while the server socket is still
> > +  // initializing, the connection is stuck in connecting state for 18.20 hours!
> > +  s.setTimeout(Ci.nsISocketTransport.TIMEOUT_CONNECT, 2);
> > +
> > +  // openOutputStream may throw NS_ERROR_NOT_INITIALIZED if we hit some race
> 
> Who throws NS_ERROR_NOT_INITIALIZED?  Is it the socket during
> createTransport, or the stream during openOutputStream?  I only see
> NS_ERROR_NOT_INITIALIZED being used in nsSocketTransportService2.cpp, so
> that makes me think it is from socket creation.

It is the stream during openOutputStream. We have a race between createTransport and openOutputStream. Remember that we keep trying to create a client socket. Without this patch we are doing that on a setTimeout(0) basis!

I guessed that it is being thrown by one of the two cases in 
nsSocketTransportService::Dispatch:
  http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsSocketTransportService2.cpp#115
That is called from nsSocketTransport::PostEvent and nsSocketTransport::OpenOutputStream:
  http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsSocketTransport2.cpp#1962


> @@ +2186,2 @@
> >    let transport = new DebuggerTransport(s.openInputStream(0, 0, 0),
> >                                          s.openOutputStream(0, 0, 0));
> 
> Is there a way we could refactor this in a follow up to check for some kind
> of state or event on the socket to be sure it is safe to call
> openOutputStream before doing it?

We can trap the exception and throw a custom one, but I don't think it will help much. We can return null in case of exception, but this won't be much helpfull than throw an assertion. The important point is about handling the exception and doing something appropriate when it throws. the connection-manager is a ideal place to do that.
Otherwise, I tried to look at `s`, it is a nsISocketTransport, and there isn't much information exposed out of it:
http://mxr.mozilla.org/mozilla-central/source/netwerk/base/public/nsISocketTransport.idl#31
isAlive() always returns false after createTransport call.
Comment on attachment 817805 [details] [diff] [review]
Fix races in debugger client when connecting agressively

Thanks for the additional explanation!

(In reply to Alexandre Poirot (:ochameau) from comment #21)
> > Even if it is 2 seconds, that might be fine for the simulator, but it feels
> > like it would be too short if someone tried to connect to a remote host. 
> > Maybe 10 seconds is better?
> 
> Actually what happens with devices is that the CONNECT state immediatly
> switch to TRANSFERRING.
> Cases where we stay waiting on CONNECT seems to be very rare from what I've
> seen.
> So this timeout isn't directly related to the amount of time the user accept
> the remote connection on device. You can verify this, you will see that the
> connection times out after 20s, the delay we use in connection-manager to
> timeout the connection.

I guess I was more thinking of a truly remote scenario where you go over the public Internet.  Perhaps, from one desktop Firefox to another somewhere totally different.  In that case, I could imagine it might take more than 2 seconds to CONNECT.  Surely this is a rare use case today though, since I am sure few people even know this is possible to do...  It would only really make sense if the target Firefox is browsing some non-public site, and even then you'd have to allow incoming connections from some random person. :)

If this value is larger, I guess it takes longer to finally connect to the simulator, since you have to reach this timeout and try again before it succeeds?  At least it is on the client-side of the connection, so it easier to change later if we need to.
Attachment #817805 - Flags: feedback+
Fabulous investigation Alex! Could you make sure that connect.js has the same fix as well? It doesn't look like we will be retiring that any time soon.

Also, while this workaround may be fine for now, shouldn't we try to fix this Mac-specific bug in the networking layer? Did you ping any of our necko folks?
It's not really clear if that's a necko bug. It may just be a different behavior on this platform.
But that would be really great if someone with strong network experience can reproduce that and look at nspr logs. If you know someone... I can give nspr logs and explain how to reproduce that.

The openOutputStream exception may be expected. I mean what we are doing is very agressive. We really keep trying to create a client socket to a server that is being initializing.

And the fact we end up stuck in connect state can be regular network behavior. But I also imagined that our server implementation may have a race between server socket instanciation and the beginning of socket listening/accepting.

Regarding the connect screen, I think these races only occurs when connecting (agressively) on the simulator. But if we have bugs reported against the connect screen failing to connect on Mac, it may be worth porting it.
(In reply to Alexandre Poirot (:ochameau) from comment #24)
> It's not really clear if that's a necko bug. It may just be a different
> behavior on this platform.
> But that would be really great if someone with strong network experience can
> reproduce that and look at nspr logs. If you know someone... I can give nspr
> logs and explain how to reproduce that.

I don't think a socket being stuck in the CONNECTING state for 18 hours is expected behavior. Patrick does the behavior described in comment 17 sound like a bug? If you aren't familiar with this part of the code, do you know who would be the right person to ask?

> The openOutputStream exception may be expected. I mean what we are doing is
> very agressive. We really keep trying to create a client socket to a server
> that is being initializing.
> 
> And the fact we end up stuck in connect state can be regular network
> behavior. But I also imagined that our server implementation may have a race
> between server socket instanciation and the beginning of socket
> listening/accepting.
> 
> Regarding the connect screen, I think these races only occurs when
> connecting (agressively) on the simulator. But if we have bugs reported
> against the connect screen failing to connect on Mac, it may be worth
> porting it.

We had similar reports in the past, and I think people still use the "Connect..." menu item with b2g desktop, which should be pretty much the same thing.
Flags: needinfo?(mcmanus)
(In reply to Panos Astithas [:past] from comment #25)
> (In reply to Alexandre Poirot (:ochameau) from comment #24)
> > It's not really clear if that's a necko bug. It may just be a different
> > behavior on this platform.
> > But that would be really great if someone with strong network experience can
> > reproduce that and look at nspr logs. If you know someone... I can give nspr
> > logs and explain how to reproduce that.
> 
> I don't think a socket being stuck in the CONNECTING state for 18 hours is
> expected behavior. Patrick does the behavior described in comment 17 sound
> like a bug? If you aren't familiar with this part of the code, do you know
> who would be the right person to ask?
> 

it is OS dependent. Higher level protocols (http etc) tend to enforce shorter timeouts, but if you are just using the socket transport directly then it is going to be up to the OS kernel.. you can of course use an nsitimer yourself to time it out if necessary or I suppose use the SetTimeout method on the socket - though I'm not aware of any code that actually does that.

the question is why the server hasn't closed things down sooner than that assuming it is healthy.
Flags: needinfo?(mcmanus)
Here is a nspr log sample, where we can see the connection being stuck on CONNECTING state...
So this scenario happens while we connect to a server that is in process of starting/binding/listening/accepting the server socket.
That happens over here:
  http://mxr.mozilla.org/mozilla-central/source/toolkit/devtools/server/main.js#422
While the client start from here:
  http://mxr.mozilla.org/mozilla-central/source/toolkit/devtools/client/dbg-client.jsm#2177
Attached patch Address review comments (obsolete) — Splinter Review
- fixed the comment in dbg-client
- catch the debuggerSocketConnect exception in connect screen
(The other fix, i.e. setTimeout, being in debuggerSocketConnect ensure that the connect screen already gains this fix)

I haven't increased the CONNECT timeout and let it at 2 seconds.
I feel like the common experience app manager users will have takes the lead
over a undocumented and may be totally unused feature.
And yes, you are right, the more we increase this timeout,
the more time it will take to finally connect to a simulator facing this race on mac.
Attachment #817805 - Attachment is obsolete: true
Attachment #817805 - Flags: review?(paul)
Attachment #818445 - Flags: review?(past)
Comment on attachment 818445 [details] [diff] [review]
Address review comments

Review of attachment 818445 [details] [diff] [review]:
-----------------------------------------------------------------

This is great, I only have a few minor comments.

::: browser/devtools/framework/connect/connect.js
@@ +59,5 @@
> +    gClient = new DebuggerClient(transport);
> +    let delay = Services.prefs.getIntPref("devtools.debugger.remote-timeout");
> +    gConnectionTimeout = setTimeout(handleConnectionTimeout, delay);
> +    gClient.connect(onConnectionReady);
> +  } catch(e) {

You should limit the try block to the debuggerSocketConnect call, otherwise we will never find out about other unrelated errors.

::: toolkit/devtools/client/connection-manager.js
@@ +224,5 @@
> +      try {
> +        transport = debuggerSocketConnect(this.host, this.port);
> +      } catch (e) {
> +        // In some cases, especially on Mac, openOutputStream method,
> +        // within debuggerSocketConnect, do throw NS_ERROR_NOT_INITIALIZED.

Perhaps a slightly better wording: "In some cases, especially on Mac, the openOutputStream call in debuggerSocketConnect may throw NS_ERROR_NOT_INITIALIZED."

::: toolkit/devtools/client/dbg-client.jsm
@@ +2186,2 @@
>    let transport = new DebuggerTransport(s.openInputStream(0, 0, 0),
>                                          s.openOutputStream(0, 0, 0));

We like all errors to be logged somewhere, so since you made call sites rely on the exception being thrown, I would suggest that you catch any errors here, Cu.reportError() and dumpn() them, and then rethrow. Alternatively have call sites log the error themselves.
Attachment #818445 - Flags: review?(past) → review+
Attachment #818445 - Attachment is obsolete: true
Attachment #819607 - Flags: review+
Keywords: checkin-needed
Whiteboard: [simulator] → [simulator][needed-in-aurora-26]
https://hg.mozilla.org/integration/fx-team/rev/568d296995e0
Keywords: checkin-needed
Whiteboard: [simulator][needed-in-aurora-26] → [simulator][needed-in-aurora-26][fixed-in-fx-team]
https://hg.mozilla.org/mozilla-central/rev/568d296995e0
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Whiteboard: [simulator][needed-in-aurora-26][fixed-in-fx-team] → [simulator][needed-in-aurora-26]
Target Milestone: --- → Firefox 27
Comment on attachment 819607 [details] [diff] [review]
Address comments, second round

[Approval Request Comment]
Bug caused by (feature/regressing bug #): new feature (app manager)
User impact if declined: some mac users can't connect to the simulator
Testing completed (on m-c, etc.): fx-team, m-c
Risk to taking this patch (and alternatives if risky): low
String or IDL/UUID changes made by this patch: none
Attachment #819607 - Flags: approval-mozilla-aurora?
Whiteboard: [simulator][needed-in-aurora-26] → [simulator][waiting-for-aurora-approval]
Attachment #819607 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Whiteboard: [simulator][waiting-for-aurora-approval] → [land-in-aurora-26]
Keywords: verifyme
Verified as fixed with both latest Aurora and latest beta, Firefox 26 beta 2 (build ID: 20131104182142) on Mac OS X 10.8.4 in 32-bit mode.

I can normally start the Firefox OS 1.2 simulator and I don't get any connection timeouts.
Product: Firefox → DevTools
Product: DevTools → DevTools Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: