Closed Bug 807647 Opened 7 years ago Closed 7 years ago

Hang when calling DataChannel.send

Categories

(Core :: WebRTC, defect)

x86_64
Linux
defect
Not set

Tracking

()

VERIFIED FIXED
mozilla19
Tracking Status
firefox18 --- fixed
firefox19 --- fixed

People

(Reporter: ted, Assigned: jesup)

References

Details

(Keywords: hang)

Attachments

(2 files, 3 obsolete files)

Attached file stack
I have a demo I'm hacking on that reliably hangs on my Linux64 system when I call DataChannel.send. Attached is "thread apply all bt" from the hung process.

The salient bits seem to be:
1) Connect a PeerConnection
2) Add two DataChannels (in this demo: one reliable, one unreliable)
3) Call send on the reliable DataChannel

I'll try to reduce my demo down to a smaller testcase.
Note that I have a different demo that only creates one DataChannel, and this does not hang (on the exact same build).
jesup's suggested hack makes the hang go away:

diff --git a/netwerk/sctp/datachannel/DataChannel.cpp b/netwerk/sctp/datachannel
--- a/netwerk/sctp/datachannel/DataChannel.cpp
+++ b/netwerk/sctp/datachannel/DataChannel.cpp
@@ -410,17 +410,18 @@ DataChannelConnection::SendPacket(const 
 /* static */
 int
 DataChannelConnection::SctpDtlsOutput(void *addr, void *buffer, size_t length,
                                       uint8_t tos, uint8_t set_df)
 {
   DataChannelConnection *peer = static_cast<DataChannelConnection *>(addr);
   int res;
 
-  if (peer->IsSTSThread()) {
+  if (1) {
Keywords: hang
Version: unspecified → Trunk
I've had no luck producing a really reduced testcase, but here's a test that does repro for me, albeit with a few manual steps:
http://people.mozilla.org/~tmielczarek/webrtc-datachannels-hang/data-demo.html

1) Load some other page (not sure why, but this is required)
2) Load that page
3) Once the URL changes to include ?id=..., copy the URL and load it in a new tab.
4) Once both tabs say "Connected" type something into the input on the first tab and hit enter
5) Hang
WIP #2, this appears to work in the reduced testcase - still with a few cleanups to do
Attachment #677429 - Attachment is obsolete: true
This patch seems to break my demo from comment 4 in other interesting ways. Notably I don't get onconnect notifications for my PeerConnections on both sides, so things just get stuck. Also I get a 100% CPU hang on shutdown.
This should solve crashing and shutdown problems with your testcases; please try (had wrong return value for cheating about whether it succeded sending)
Attachment #677444 - Attachment is obsolete: true
The combination of the latest patch here + the patch from bug 807929 fixes my issues.
Cleaned up, with commentary on why we're doing this and disabled code we can use if we resolve the lock issues in SCTP
Attachment #677696 - Attachment is obsolete: true
Comment on attachment 677757 [details] [diff] [review]
proxy incoming DTLS data to SCTP via the STS thread to avoid locking issues

Once I have a chance to unravel the locks in SCTP I may be able to make the output call without any locks held, or just 'safe' locks, so I'm not deleting the IsSTSThread case, just disabling it and commenting.
Attachment #677757 - Flags: review?(ekr)
Comment on attachment 677757 [details] [diff] [review]
proxy incoming DTLS data to SCTP via the STS thread to avoid locking issues

Review of attachment 677757 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm
Attachment #677757 - Flags: review?(ekr) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/ff560027a8ba

The previous fix bug 806169 and this one (fallout from that) should be uplifted to Aurora after verification.  Ted's tests and mine seem to run and stop cleanly now.
Marking FF18 as unaffected because it isn't until Bug 806169 uplifts (which fixes a major problem with calling NSS on teh wrong thread)
Target Milestone: --- → mozilla19
Assignee: nobody → rjesup
https://hg.mozilla.org/mozilla-central/rev/ff560027a8ba
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: in-testsuite-
Resolution: --- → FIXED
Comment on attachment 677757 [details] [diff] [review]
proxy incoming DTLS data to SCTP via the STS thread to avoid locking issues

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 806169
User impact if declined: Can't land bug 801221 (or need to disable DataChannels)
Testing completed (on m-c, etc.): baked on m-c for weeks
Risk to taking this patch (and alternatives if risky): Minimal; this copies all the data and makes the send asynchronous instead of synchronous (which can cause a deadlock on usrsctp_connect()).
String or UUID changes made by this patch: none
Attachment #677757 - Flags: approval-mozilla-aurora?
Comment on attachment 677757 [details] [diff] [review]
proxy incoming DTLS data to SCTP via the STS thread to avoid locking issues

[Triage Comment]
Approving for FF18, as our understanding is that this could only regress WebRTC (pref'd off in FF18).

If this is landed on mozilla-aurora before ~11AM PT tomorrow, this will make the merge from Aurora 18 to Beta 18. If landed 11AM-5PM PT tomorrow on mozilla-beta, it will make the first FF18 Beta. If landed after that, it will end up in the second FF18 beta.
Attachment #677757 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Ted - I tried running your test case steps from comment 4. However, when I try to run it, I end up getting an error reported:

Timestamp: 11/23/2012 11:41:58 PM
Error: Firefox can't establish a connection to the server at http://webrtc-broker.herokuapp.com/offer/6263524D-7C25-4FEE-BEF8-93AFC8322278/answer.
Source File: http://people.mozilla.org/~tmielczarek/webrtc-datachannels-hang/webrtc-broker-client.js
Line: 40

Any ideas?
Flags: needinfo?(ted)
Flags: needinfo?(ted)
Verified by doing a sending of a message from a reliable DC to a unreliable DC.
Status: RESOLVED → VERIFIED
Keywords: verifyme
Depends on: 1125947
You need to log in before you can comment on or make changes to this bug.