Closed Bug 1214279 Opened 8 years ago Closed 8 years ago

Websocket disconnected in firefox beta 42.0b5

Categories

(Core :: WebRTC: Networking, defect, P1)

42 Branch
x86_64
Windows
defect

Tracking

()

VERIFIED FIXED
mozilla44
Tracking Status
firefox41 --- unaffected
firefox42 + fixed
firefox43 + fixed
firefox44 + fixed
firefox-esr38 --- unaffected
b2g-master --- fixed

People

(Reporter: rdaware, Assigned: bwc)

Details

Attachments

(2 files, 1 obsolete file)

User Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36

Steps to reproduce:

We are developing a conference product based on WebRTC, we have two websockets, one connect to the conference server, another one connect to the MCU server as the WebRTC signalling channel. At first, the two web sockets are ok and work fine, then we create local offer, and send the offer SDP via the second websocket


Actual results:

both web sockets will disconnect automatically after we send the offer sdp


Expected results:

the websocket should not disconnect, this issue is not seen in firefox 41 and is not seen in firefox 42 on mac, it is only reproducible on firefox 42 on windows
We have a code similar to following code mentioned on developer.mozilla.org webpage
var pc = new PeerConnection();
pc.addStream(video);
pc.createOffer(function(desc) {
  pc.setLocalDescription(desc, function() {
    // send the offer to a server THROUGH WEBSOCKET
  });
});

After this segment, all websocket will hang.
We use the following code mentioned on develop.mozilla.org page. 


var pc = new PeerConnection();
pc.addStream(video);
pc.createOffer(function(desc) {
  pc.setLocalDescription(desc, function() {
    // send the offer to a server USING WEBSOCKET
  });
});

As we run this segment, the websocket connections hang. We see this problem in windows OS only.
Severity: normal → blocker
OS: Unspecified → Windows
Hardware: Unspecified → x86_64
Component: Untriaged → Networking: WebSockets
Product: Firefox → Core
Michal, can you take a look at this?
Flags: needinfo?(michal.novotny)
rdaware: It would probably be helpful if you could get us an NSPR log that includes the websocket disconnect here.  I.e. follow these instructions

  https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging#Windows

but with "nsWebSocket" in the list, i.e.

  set NSPR_LOG_MODULES=timestamp,nsHttp:5,nsSocketTransport:5,nsStreamPump:5,nsHostResolver:5,nsWebSocket
  
There's generally not any private info in the LOGS (doesn't show cookies, etc), but feel free to email it privately to me or michal if you don't want to attach it here.
Flags: needinfo?(rdaware)
I tried to reproduce this issue on Mac FF41.0.2. No success. Might be a windows only problem?
(In reply to steinbrecher.johann from comment #5)
> I tried to reproduce this issue on Mac FF41.0.2. No success. Might be a
> windows only problem?

Yes this is windows only problem
Flags: needinfo?(rdaware)
(In reply to steinbrecher.johann from comment #5)
> I tried to reproduce this issue on Mac FF41.0.2. No success. Might be a
> windows only problem?

It occurs in FF on Windows only. I tested with FF 42.0b7 today using multiple different webRTC apps and all websockets hang on sending the offer SDP in setLocalDescription.

Michal, NSPR log was sent by rdaware a few days ago. If you need more let us know.
(In reply to keithgrif from comment #7)
> (In reply to steinbrecher.johann from comment #5)
> > I tried to reproduce this issue on Mac FF41.0.2. No success. Might be a
> > windows only problem?
> 
> It occurs in FF on Windows only. I tested with FF 42.0b7 today using
> multiple different webRTC apps and all websockets hang on sending the offer
> SDP in setLocalDescription.

Which apps?  if you can give us links that would help a ton since then we can test ourselves
Flags: needinfo?(keithgrif)
(In reply to keithgrif from comment #7)

> Michal, NSPR log was sent by rdaware a few days ago. If you need more let us
> know.

Those weren't NSPR logs; those were about:webrtc info.  Please follow the steps from Michal to get the logs we need to look at this.  Thanks
Flags: needinfo?(rdaware)
rdaware: also, I assume there's something holding the websocket 'alive' - since no code for the websocket send was given, it's hard to tell what's happening.

The more context/reproducibility it has, the better
(In reply to Randell Jesup [:jesup] from comment #8)
> Which apps?  if you can give us links that would help a ton since then we
> can test ourselves

Yes please. I've tried apprtc, talky.io and appear.in on my windows box running 42.0b8 and they all work for me.
(In reply to Jan-Ivar Bruaroey [:jib] from comment #11)
> (In reply to Randell Jesup [:jesup] from comment #8)
> > Which apps?  if you can give us links that would help a ton since then we
> > can test ourselves
> 
> Yes please. I've tried apprtc, talky.io and appear.in on my windows box
> running 42.0b8 and they all work for me.

Please try this one. https://github.com/emannion/webrtc-web-socket
Flags: needinfo?(rdaware)
(In reply to Randell Jesup [:jesup] from comment #9)
> (In reply to keithgrif from comment #7)
> 
> > Michal, NSPR log was sent by rdaware a few days ago. If you need more let us
> > know.
> 
> Those weren't NSPR logs; those were about:webrtc info.  Please follow the
> steps from Michal to get the logs we need to look at this.  Thanks

I believe he sent NSPR logs by email.
Flags: needinfo?(keithgrif)
(In reply to Jan-Ivar Bruaroey [:jib] from comment #11)
> (In reply to Randell Jesup [:jesup] from comment #8)
> > Which apps?  if you can give us links that would help a ton since then we
> > can test ourselves
> 
> Yes please. I've tried apprtc, talky.io and appear.in on my windows box
> running 42.0b8 and they all work for me.

I also tested with apprtc initially and I could be wrong but looks like it was using XHR and not websockets so did not hit the same scenario.
I have sent NSPR logs by email to Jason, Michal, jesup and jib
(In reply to rdaware from comment #12)
> Please try this one. https://github.com/emannion/webrtc-web-socket

I installed this on my windows box and it worked fine for me, running FF 42.0b8 on Windows on one end and a mac on the same LAN on the other. I tired both as offerer and answerer. Not able to reproduce.
I see a lot of this in the node output:

> send() error: Error: write after send

but it works.
(In reply to Jan-Ivar Bruaroey [:jib] from comment #17)
> I see a lot of this in the node output:
> 
> > send() error: Error: write after send
> 
> but it works.

I can still reproduce it on Windows 10, FF 42.0b8. After SDP offer, websocket gets disconnected. This does not happen in FF 41.0.2.
I also tried with talky.io and Cisco Spark between 42.0b8 on Win7 and DevEdition on Mac, but can't reproduce. Although it is not 100% clear to me what I'm looking for here. The call does not get established at all, or is the call setup just delayed?
(In reply to rdaware from comment #18)
> I can still reproduce it on Windows 10, FF 42.0b8. After SDP offer,
> websocket gets disconnected. This does not happen in FF 41.0.2.

Is this a Win 10 only problem?
We can reproduce this on most windows machines but not all. Please try some more windows machines. We reproduced this on Win 7 and 10. Mac does not have this issue.
(In reply to Nils Ohlmeier [:drno] from comment #20)
> (In reply to rdaware from comment #18)
> > I can still reproduce it on Windows 10, FF 42.0b8. After SDP offer,
> > websocket gets disconnected. This does not happen in FF 41.0.2.
> 
> Is this a Win 10 only problem?

I am using Win 7 and can reproduce.
Perhaps it's network topology dependent? For those who can reproduce consistently, another way to narrow this down might be to run mozregression to find a regression-window. See http://mozilla.github.io/mozregression/
(In reply to Nils Ohlmeier [:drno] from comment #19)
> I also tried with talky.io and Cisco Spark between 42.0b8 on Win7 and
> DevEdition on Mac, but can't reproduce. Although it is not 100% clear to me
> what I'm looking for here. The call does not get established at all, or is
> the call setup just delayed?

The call does not get established. When you try to establish the call all websockets hang and the app loses connectivity.
(In reply to Jan-Ivar Bruaroey [:jib] from comment #23)
> Perhaps it's network topology dependent? For those who can reproduce
> consistently, another way to narrow this down might be to run mozregression
> to find a regression-window. See http://mozilla.github.io/mozregression/

I am having trouble using mozregression. It crashes all the time. Here is crash log:

platform: Windows-8-6.2.9200
python: 2.7.6 FROZEN (32bit)
mozregui: 0.4.1
mozregression: 1.0.1
message: HTTPError: 404 Client Error: Not Found
traceback:   File ".\mozregui\bisection.py", line 130, in bisect
  File "..\mozregression\bisector.py", line 425, in bisect
  File ".\mozregui\bisection.py", line 156, in _bisect
  File ".\mozregui\bisection.py", line 186, in _bisect_next
  File "..\mozregression\bisector.py", line 296, in init_handler
  File "..\mozregression\bisector.py", line 134, in initialize
  File "..\mozregression\bisector.py", line 64, in initialize
  File "..\mozregression\build_range.py", line 78, in __getitem__
  File "..\mozregression\build_range.py", line 34, in build_info
  File "..\mozregression\build_range.py", line 28, in _fetch
  File "..\mozregression\fetch_build_info.py", line 236, in find_build_info
  File "..\mozregression\fetch_build_info.py", line 213, in _get_urls
  File "..\mozregression\fetch_build_info.py", line 200, in _get_month_links
  File "..\mozregression\network.py", line 69, in url_links
  File "C:\Python27\lib\site-packages\requests\models.py", line 831, in raise_for_status
Shucks. FWIW it is possible to do manually, by downloading and testing from https://ftp.mozilla.org/pub/firefox/nightly/ . Using a binary search pattern it's not too terrible.
(In reply to keithgrif from comment #24)
> (In reply to Nils Ohlmeier [:drno] from comment #19)
> > I also tried with talky.io and Cisco Spark between 42.0b8 on Win7 and
> > DevEdition on Mac, but can't reproduce. Although it is not 100% clear to me
> > what I'm looking for here. The call does not get established at all, or is
> > the call setup just delayed?
> 
> The call does not get established. When you try to establish the call all
> websockets hang and the app loses connectivity.

The only thing I can think of that WebSockets and PeerConenctions really share (other than NSS) is the Socket thread.  If the Socket thread locks up, it would break WebSockets ... but it would also break all network traffic, not just WebSockets.

Please use my work email address (jesup at mozilla.com) for direct email; the other address gets bugmail filtered from it, and the rest goes into a low-signal mailbox I frequently miss things in.

Also: that log only has websocket traffic, so there's no way to tell when setlocaldescription was called.  Please add signaling:4,mtransport:4,timestamp to the NSPR_LOG_MODULES line and resend, thanks
Flags: needinfo?(rdaware)
Flags: needinfo?(keithgrif)
(In reply to Randell Jesup [:jesup] from comment #27)
> (In reply to keithgrif from comment #24)
> > (In reply to Nils Ohlmeier [:drno] from comment #19)
> > > I also tried with talky.io and Cisco Spark between 42.0b8 on Win7 and
> > > DevEdition on Mac, but can't reproduce. Although it is not 100% clear to me
> > > what I'm looking for here. The call does not get established at all, or is
> > > the call setup just delayed?
> > 
> > The call does not get established. When you try to establish the call all
> > websockets hang and the app loses connectivity.
> 
> The only thing I can think of that WebSockets and PeerConenctions really
> share (other than NSS) is the Socket thread.  If the Socket thread locks up,
> it would break WebSockets ... but it would also break all network traffic,
> not just WebSockets.
> 
> Please use my work email address (jesup at mozilla.com) for direct email;
> the other address gets bugmail filtered from it, and the rest goes into a
> low-signal mailbox I frequently miss things in.
> 
> Also: that log only has websocket traffic, so there's no way to tell when
> setlocaldescription was called.  Please add
> signaling:4,mtransport:4,timestamp to the NSPR_LOG_MODULES line and resend,
> thanks

NSPR Logs with signaling and mtransport sent by email. I also noticed that some calls were being established so the issue was not occuring always like I suggested yesterday. I think you your observation about the socket thread is related as I need to restart FF to get any connection including in new tabs after the issue occurs.
Flags: needinfo?(keithgrif)
(In reply to Jason Duell [:jduell] (needinfo? me) from comment #4)
>   set
> NSPR_LOG_MODULES=timestamp,nsHttp:5,nsSocketTransport:5,nsStreamPump:5,
> nsHostResolver:5,nsWebSocket

There is no log level specified for nsWebSocket module, so default level 1 is used and we log nothing for this level. There must be nsWebSocket:5
Flags: needinfo?(michal.novotny)
(In reply to Michal Novotny (:michal) from comment #29)
> (In reply to Jason Duell [:jduell] (needinfo? me) from comment #4)
> >   set
> > NSPR_LOG_MODULES=timestamp,nsHttp:5,nsSocketTransport:5,nsStreamPump:5,
> > nsHostResolver:5,nsWebSocket
> 
> There is no log level specified for nsWebSocket module, so default level 1
> is used and we log nothing for this level. There must be nsWebSocket:5

Emailed new logs with nsWebSocket:5 and additional parameters from Randell.
I'm taking this on as a WebRTC bug and asking Randell to investigate it at a priority until we know exactly what's going on here. I'm copying Byron in case we need his help;  Nils is on PTO today and back tomorrow.

NOTE: We still have been able to repro this, but 2 Cisco folks can repro it on Windows ONLY.
Assignee: nobody → rjesup
Rank: 5
Component: Networking: WebSockets → WebRTC: Networking
Priority: -- → P1
backlog: --- → webrtc/webaudio+
It might be good to know whether _any_ network operations work once this error is encountered. It is not impossible that you're somehow getting the STS thread to deadlock.
(In reply to Jan-Ivar Bruaroey [:jib] from comment #26)
> Shucks. FWIW it is possible to do manually, by downloading and testing from
> https://ftp.mozilla.org/pub/firefox/nightly/ . Using a binary search pattern
> it's not too terrible.

I manually checked. The last good build is 41.0b9. I can reproduce this issue from 42.0b1 onwards. Hope that helps to narrow down.
Flags: needinfo?(rdaware)
(also asked in email):
Can you add some additional logging, as shown in https://wiki.mozilla.org/Media/WebRTC/Logging?  In particular, the R_LOG info - you'll need to dump stderr/out to a file when running firefox, note!!!  And use R_LOG_LEVEL=7.  And continue to use the existing logs you were grabbing (you'll end up with two files, one from stderr, one from NSPR_LOG_FILE).

Also: did you test 41 release?
Thanks
Flags: needinfo?(rdaware)
Flags: needinfo?(keithgrif)
Correcting a typo in comment 31:
NOTE: We still have *NOT* been able to repro this, but 2 Cisco folks can repro it on Windows ONLY.

@keithgrif, @rdaware -- Were you on Cisco's corporate network when you saw/repro'd this problem?  Have you ever seen this problem from outside that corporate network (i.e. a home network)?  I'm wondering if this bug follows the network or the machines -- or if it's the combination that triggers this for you.
(In reply to Maire Reavy [:mreavy] (Plz needinfo me) from comment #35)
> Correcting a typo in comment 31:
> NOTE: We still have *NOT* been able to repro this, but 2 Cisco folks can
> repro it on Windows ONLY.
> 
> @keithgrif, @rdaware -- Were you on Cisco's corporate network when you
> saw/repro'd this problem?  Have you ever seen this problem from outside that
> corporate network (i.e. a home network)?  I'm wondering if this bug follows
> the network or the machines -- or if it's the combination that triggers this
> for you.

Yes I can reproduce on the corporate network and from home over VPN to corporate network. I made three calls from home without VPN just now with no issue. The difference in the calls being that the offer goes through. When this fails the offer does not go through and the connection hangs. I think it is safe to assume that it is a combination.
Flags: needinfo?(keithgrif)
This is based on Beta, with the fix:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e3ba8bbced83

You should be able to download builds with the fix in an hour or so.  Click on the green 'B', on the lower left click on "Build:" and it will take you to the download directory.
Flags: needinfo?(keithgrif)
Flags: needinfo?(emannion)
Bug 1214279: Fix the same infinite loop from bug 957236 in a different place. r?drno
Attachment #8677450 - Flags: review?(drno)
Attachment #8677439 - Attachment is obsolete: true
Attachment #8677439 - Flags: review?(docfaraday)
[Tracking Requested - why for this release]:
Locks up the browser if you have a specific network interface name and try to use webrtc - even a hidden datachannel use.  Trivial safe fix which had previously been done for another version of the same code in this file.
Assignee: rjesup → docfaraday
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Attachment #8677450 - Flags: review?(drno) → review+
Comment on attachment 8677450 [details]
MozReview Request: Bug 1214279: Fix the same infinite loop from bug 957236 in a different place. r?drno

https://reviewboard.mozilla.org/r/22941/#review20479
Although it is good to fix that infinite loop I'm wondering how this could be the cause of the problem as this infinite loop has been in our code since release 21 (introduce by bug 825927).
The other explanation could be that the Cisco VPN recently started to use +'s in its interface names. But that would then still not explain why it works with older Fx version, but not with 42.
This function was unused until 42 when IPv6 support landed.
Keywords: checkin-needed
Comment on attachment 8677450 [details]
MozReview Request: Bug 1214279: Fix the same infinite loop from bug 957236 in a different place. r?drno

Approval Request Comment
[Feature/regressing bug #]:

   Bug 797262

[User impact if declined]:

   STS will hang for some windows users (with unusual interface names) when using webrtc. This is probably going to be corporate users, since ordinary users don't name their interfaces.

[Describe test coverage new/current, TreeHerder]:

   None, sadly. We have no control over the interface names on our CI boxes.

[Risks and why]: 

   Extremely low; the fix is innocuous, and we already landed the same bugfix to another function in the same file in bug 957236 with no ill effect.

[String/UUID change made/needed]:

   None.
Attachment #8677450 - Flags: approval-mozilla-beta?
Attachment #8677450 - Flags: approval-mozilla-aurora?
Sylvestre, you probably want this for today's build but I'm going to leave that to you.
Tracking for 43+ , sounds like a regression from 42.
Flags: needinfo?(sledru)
(In reply to Byron Campen [:bwc] from comment #45)
> Comment on attachment 8677450 [details]
> MozReview Request: Bug 1214279: Fix the same infinite loop from bug 957236
> in a different place. r?drno
> 
> Approval Request Comment
> [Feature/regressing bug #]:
> 
>    Bug 797262

This was the recent (42) enabling of IPV6.  The bug is in the initial import of nICEr (this was broken in the initial import in 2012 to Alder).  This was fixed almost 2 years ago, but only one copy of the routine was fixed, and the IPV6 landing switched us to the other (unpatched) copy.  That's why it regressed in 42.

> [User impact if declined]:
> 
>    STS will hang for some windows users (with unusual interface names) when
> using webrtc. This is probably going to be corporate users, since ordinary
> users don't name their interfaces.

For example, all of Cisco seems affected.  And any use of WebRTC will trigger a browser lockup for those users, including incidental use of DataChannels by a page in place of WebSockets or attempts to de-anonymize users or block clickfraud (c.f. the NYTimes issue earlier).

> 
> [Describe test coverage new/current, TreeHerder]:
> 
>    None, sadly. We have no control over the interface names on our CI boxes.

None today; we might be able to test this in a cppunit test in the future.

> 
> [Risks and why]: 
> 
>    Extremely low; the fix is innocuous, and we already landed the same
> bugfix to another function in the same file in bug 957236 with no ill effect.
> 
> [String/UUID change made/needed]:
> 
>    None.
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #47)
> Sylvestre, you probably want this for today's build but I'm going to leave
> that to you.
> Tracking for 43+ , sounds like a regression from 42.

We definitely want this one.  Very, very safe.  I'm ready to run on the hamster wheel to power the build for this one.  :-)
Comment on attachment 8677450 [details]
MozReview Request: Bug 1214279: Fix the same infinite loop from bug 957236 in a different place. r?drno

We do want that in beta 9, approved.
Flags: needinfo?(sledru)
Attachment #8677450 - Flags: approval-mozilla-beta?
Attachment #8677450 - Flags: approval-mozilla-beta+
Attachment #8677450 - Flags: approval-mozilla-aurora?
Attachment #8677450 - Flags: approval-mozilla-aurora+
(In reply to Sylvestre Ledru [:sylvestre] from comment #50)
> Comment on attachment 8677450 [details]
> MozReview Request: Bug 1214279: Fix the same infinite loop from bug 957236
> in a different place. r?drno
> 
> We do want that in beta 9, approved.

and it landed for beta 9 in time , just the commit message here is/was missing :) adding now 

https://hg.mozilla.org/releases/mozilla-beta/rev/894bc1fcf2a9
Initial testing on Nightly looks good. No longer able to reproduce the issue and will continue to test with other affected webrtc products. Thanks for your efforts.
Flags: needinfo?(keithgrif)
https://hg.mozilla.org/mozilla-central/rev/e56eb1908faf
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla44
Flags: needinfo?(rdaware)
Flags: needinfo?(emannion)
Rahul and Keith (who originally found the bug) report the issue is fixed for them in Beta 9.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.