Closed
Bug 876167
Opened 12 years ago
Closed 6 years ago
SCTP library crash when processing association abort while closing socket
Categories
(Core :: WebRTC: Networking, defect, P3)
Tracking
()
RESOLVED
WONTFIX
mozilla24
backlog | webrtc/webaudio+ |
People
(Reporter: jesup, Assigned: tuexen)
References
Details
(Keywords: crash, intermittent-failure)
Crash Data
Attachments
(3 files)
232.65 KB,
text/plain
|
Details | |
4.66 KB,
patch
|
jesup
:
review+
lsblakk
:
approval-mozilla-aurora+
lsblakk
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
3.89 KB,
patch
|
tuexen
:
review+
|
Details | Diff | Splinter Review |
With the patch in bug 872978 (patch https://bugzilla.mozilla.org/attachment.cgi?id=754142), the leaks are fixed but a probable bug in the SCTP library is exposed, where it's processing an incoming association ABORT while in the process of closing the local socket. This occurs on Windows (mostly on opt builds, but seen once on a debug build).
I'll attach stack traces.
The other option is to try to work around it; since this didn't occur before landing the leak-fix patch; we can probably defer SCTP shutdown until the peerconnection itself is shut down in order to land the leak fix. Under that assumption, marking as blocking-minus, but if we have a safe fix we'd take it and probably uplift it at least to Aurora/23, maybe to beta/22.
Reporter | ||
Comment 1•12 years ago
|
||
Reporter | ||
Comment 2•12 years ago
|
||
Reporter | ||
Comment 3•12 years ago
|
||
Note: This also needs the mochitest patch for datachannels from Bug 796894
Reporter | ||
Updated•12 years ago
|
Updated•12 years ago
|
Crash Signature: [@ ntdll.dll@0x32239]
Assignee | ||
Comment 4•12 years ago
|
||
Does the following patch fix the issue:
http://code.google.com/p/sctp-refimpl/source/detail?r=8481
It fixes an issue I can reproduce, but the crash I was able to reproduce
was a couple of lines earlier...
Reporter | ||
Comment 5•12 years ago
|
||
Trying that...
https://tbpl.mozilla.org/?tree=Try&rev=924b9d40e84a
Reporter | ||
Comment 6•12 years ago
|
||
Nope. First Win7 opt testrun hit the same crash
Assignee | ||
Comment 7•12 years ago
|
||
Thanks, I saw it. Will continue debugging...
Reporter | ||
Comment 8•12 years ago
|
||
Reporter | ||
Comment 9•12 years ago
|
||
Comment on attachment 757063 [details] [diff] [review]
avoid usrsctp library race condition on close with ABORT
includes the 1-liner from rev 8481 (though it may not matter)
This is a workaround patch that avoids the race but does not fix it, and will be backed out when the fixed library is imported.
Attachment #757063 -
Flags: review?(tuexen)
Reporter | ||
Comment 10•12 years ago
|
||
See Try at https://tbpl.mozilla.org/?tree=Try&rev=2632e699f5d4
The early oranges were an infrastructure problem; the later ones were a bug in the old version of the mochitests for datachannel. Before this patch this would have failed 50-75% of Win7 opt mochitest runs.
Whiteboard: [WebRTC][blocking-webrtc-][webrtc-uplift] → [WebRTC][blocking-webrtc-][webrtc-uplift][leave-open]
Assignee | ||
Comment 11•12 years ago
|
||
Comment on attachment 757063 [details] [diff] [review]
avoid usrsctp library race condition on close with ABORT
Review of attachment 757063 [details] [diff] [review]:
-----------------------------------------------------------------
Look good to me. r+
Reporter | ||
Comment 12•12 years ago
|
||
Comment on attachment 757063 [details] [diff] [review]
avoid usrsctp library race condition on close with ABORT
r=tuexen
Attachment #757063 -
Flags: review?(tuexen) → review+
Reporter | ||
Comment 13•12 years ago
|
||
Whiteboard: [WebRTC][blocking-webrtc-][webrtc-uplift][leave-open] → [WebRTC][blocking-webrtc-][leave-open]
Target Milestone: --- → mozilla24
Updated•12 years ago
|
Whiteboard: [WebRTC][blocking-webrtc-][leave-open] → [WebRTC][blocking-webrtc-][leave-open][webrtc-uplift]
Comment 14•12 years ago
|
||
Reporter | ||
Comment 15•12 years ago
|
||
Comment on attachment 757063 [details] [diff] [review]
avoid usrsctp library race condition on close with ABORT
[Approval Request Comment]
Bug caused by (feature/regressing bug #): N/A (bug in upstream library)
User impact if declined: Crash (maybe worse in theory) if an incoming ABORT collides with a local close of the DataChannelConnection. If the leak fix lands, the mochitests about to be added to the tree will hit this failure frequently on Win7 Opt builds; rarely on debug builds. In theory this might fail on Linux, but we haven't seen it.
Testing completed (on m-c, etc.): on m-c, ~45 retriggers with no failures. This is a workaround to avoid the race condition until a root-cause fix can land and be imported (Michael Tuexen is looking at the base problem).
Risk to taking this patch (and alternatives if risky): Low. Proxies existing close to the STS thread to it can't race against incoming packets on STS. Risk would be if I missed something in the state of the DataChannelConnection; overall a simple (wallpaper) fix though.
Even if we don't take the leak fix, we should strongly consider taking this, as it could in theory happen to any connection with the right timing - the leak fix makes it easier to hit with the mochitests.
String or IDL/UUID changes made by this patch: none
Attachment #757063 -
Flags: approval-mozilla-beta?
Attachment #757063 -
Flags: approval-mozilla-aurora?
Comment 16•12 years ago
|
||
Tracking this so we can get early warning if there are regressions or unexpected fallout.
status-firefox21:
--- → unaffected
status-firefox22:
--- → affected
status-firefox23:
--- → affected
status-firefox24:
--- → fixed
tracking-firefox22:
--- → +
tracking-firefox23:
--- → +
Updated•12 years ago
|
Attachment #757063 -
Flags: approval-mozilla-beta?
Attachment #757063 -
Flags: approval-mozilla-beta+
Attachment #757063 -
Flags: approval-mozilla-aurora?
Attachment #757063 -
Flags: approval-mozilla-aurora+
Reporter | ||
Comment 17•12 years ago
|
||
https://hg.mozilla.org/releases/mozilla-beta/rev/de4c9a61681a
aurora is blocked on closure
Reporter | ||
Comment 18•12 years ago
|
||
Whiteboard: [WebRTC][blocking-webrtc-][leave-open][webrtc-uplift] → [WebRTC][blocking-webrtc-][leave-open]
Comment 19•12 years ago
|
||
I verified the TBPL logs for Mozilla-Beta/Win 7 and the crash does not appear.
Is there anything else QA can do to manually verify this?
Reporter | ||
Comment 20•12 years ago
|
||
Not really; the patch that exposed the bug has landed; if you add the datachannel tests patch referenced above, you can see if you hit failures on Win7 opt Try builds when retriggered a number of times. True verification would be to do that, and then back this patch out and see if you see the crashes. If you don't see the crashes with this backed out, and the datachannel test in, then you can't be sure because the tests aren't hitting the intermittent.
Not an easy verification.
Reporter | ||
Comment 21•11 years ago
|
||
Still needs to be backed out when we update SCTP
Depends on: 916427
Reporter | ||
Comment 22•11 years ago
|
||
Reporter | ||
Comment 23•11 years ago
|
||
Comment on attachment 8389843 [details] [diff] [review]
backout temporary fix for SCTP library bug (now fixed)
This is just undoing the previous patch. Since we've updated the SCTP library, we should no longer need to avoid this collision.
Attachment #8389843 -
Flags: review?(tuexen)
Assignee | ||
Updated•11 years ago
|
Attachment #8389843 -
Flags: review?(tuexen) → review+
Reporter | ||
Comment 24•11 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/7ada8c5a0480
I had a try run that was orange due to non-related issues in the underlying repo, but all the datachannel tests ran without errors.
Whiteboard: [WebRTC][blocking-webrtc-][leave-open] → [WebRTC][blocking-webrtc-]
Comment 25•11 years ago
|
||
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 27•11 years ago
|
||
Turns out the library bug was *not* fixed; backing out the backout
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 28•11 years ago
|
||
Reporter | ||
Comment 29•11 years ago
|
||
Whiteboard: [WebRTC][blocking-webrtc-] → [WebRTC][blocking-webrtc-][leave-open]
Updated•10 years ago
|
backlog: --- → webRTC+
Rank: 14
Whiteboard: [WebRTC][blocking-webrtc-][leave-open] → [leave-open]
Reporter | ||
Comment 33•10 years ago
|
||
Poke to check if we've fixed this in SCTP and can back out the wallpaper
Severity: critical → normal
Rank: 14 → 25
Flags: needinfo?(rjesup)
Priority: P1 → P2
Reporter | ||
Updated•10 years ago
|
Flags: needinfo?(rjesup)
Whiteboard: [leave-open]
Reporter | ||
Comment 34•9 years ago
|
||
Michael - is this bug fixed upstream? Or do we still need to proxy the socket close?
Flags: needinfo?(tuexen)
Assignee | ||
Comment 35•9 years ago
|
||
I'm not aware of a specific fix, since I couldn't reproduce it the problem.
Keep the work around for now.
Flags: needinfo?(tuexen)
Comment 36•7 years ago
|
||
Mass change P2->P3 to align with new Mozilla triage process.
Priority: P2 → P3
Comment 37•6 years ago
|
||
Closing because no crash reported since 12 weeks.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 6 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•