Closed Bug 876167 Opened 12 years ago Closed 6 years ago

SCTP library crash when processing association abort while closing socket

Categories

(Core :: WebRTC: Networking, defect, P3)

x86
Windows 7
defect

Tracking

()

RESOLVED WONTFIX
mozilla24
Tracking Status
firefox21 --- unaffected
firefox22 + fixed
firefox23 + fixed
firefox24 --- fixed

People

(Reporter: jesup, Assigned: tuexen)

References

Details

(Keywords: crash, intermittent-failure)

Crash Data

Attachments

(3 files)

With the patch in bug 872978 (patch https://bugzilla.mozilla.org/attachment.cgi?id=754142), the leaks are fixed but a probable bug in the SCTP library is exposed, where it's processing an incoming association ABORT while in the process of closing the local socket. This occurs on Windows (mostly on opt builds, but seen once on a debug build). I'll attach stack traces. The other option is to try to work around it; since this didn't occur before landing the leak-fix patch; we can probably defer SCTP shutdown until the peerconnection itself is shut down in order to land the leak fix. Under that assumption, marking as blocking-minus, but if we have a safe fix we'd take it and probably uplift it at least to Aurora/23, maybe to beta/22.
Attached file callback
Note: This also needs the mochitest patch for datachannels from Bug 796894
Blocks: 872978, 796894
Crash Signature: [@ ntdll.dll@0x32239]
Does the following patch fix the issue: http://code.google.com/p/sctp-refimpl/source/detail?r=8481 It fixes an issue I can reproduce, but the crash I was able to reproduce was a couple of lines earlier...
Nope. First Win7 opt testrun hit the same crash
Thanks, I saw it. Will continue debugging...
Comment on attachment 757063 [details] [diff] [review] avoid usrsctp library race condition on close with ABORT includes the 1-liner from rev 8481 (though it may not matter) This is a workaround patch that avoids the race but does not fix it, and will be backed out when the fixed library is imported.
Attachment #757063 - Flags: review?(tuexen)
See Try at https://tbpl.mozilla.org/?tree=Try&rev=2632e699f5d4 The early oranges were an infrastructure problem; the later ones were a bug in the old version of the mochitests for datachannel. Before this patch this would have failed 50-75% of Win7 opt mochitest runs.
Whiteboard: [WebRTC][blocking-webrtc-][webrtc-uplift] → [WebRTC][blocking-webrtc-][webrtc-uplift][leave-open]
Comment on attachment 757063 [details] [diff] [review] avoid usrsctp library race condition on close with ABORT Review of attachment 757063 [details] [diff] [review]: ----------------------------------------------------------------- Look good to me. r+
Comment on attachment 757063 [details] [diff] [review] avoid usrsctp library race condition on close with ABORT r=tuexen
Attachment #757063 - Flags: review?(tuexen) → review+
Whiteboard: [WebRTC][blocking-webrtc-][webrtc-uplift][leave-open] → [WebRTC][blocking-webrtc-][leave-open]
Target Milestone: --- → mozilla24
Whiteboard: [WebRTC][blocking-webrtc-][leave-open] → [WebRTC][blocking-webrtc-][leave-open][webrtc-uplift]
Comment on attachment 757063 [details] [diff] [review] avoid usrsctp library race condition on close with ABORT [Approval Request Comment] Bug caused by (feature/regressing bug #): N/A (bug in upstream library) User impact if declined: Crash (maybe worse in theory) if an incoming ABORT collides with a local close of the DataChannelConnection. If the leak fix lands, the mochitests about to be added to the tree will hit this failure frequently on Win7 Opt builds; rarely on debug builds. In theory this might fail on Linux, but we haven't seen it. Testing completed (on m-c, etc.): on m-c, ~45 retriggers with no failures. This is a workaround to avoid the race condition until a root-cause fix can land and be imported (Michael Tuexen is looking at the base problem). Risk to taking this patch (and alternatives if risky): Low. Proxies existing close to the STS thread to it can't race against incoming packets on STS. Risk would be if I missed something in the state of the DataChannelConnection; overall a simple (wallpaper) fix though. Even if we don't take the leak fix, we should strongly consider taking this, as it could in theory happen to any connection with the right timing - the leak fix makes it easier to hit with the mochitests. String or IDL/UUID changes made by this patch: none
Attachment #757063 - Flags: approval-mozilla-beta?
Attachment #757063 - Flags: approval-mozilla-aurora?
Tracking this so we can get early warning if there are regressions or unexpected fallout.
Attachment #757063 - Flags: approval-mozilla-beta?
Attachment #757063 - Flags: approval-mozilla-beta+
Attachment #757063 - Flags: approval-mozilla-aurora?
Attachment #757063 - Flags: approval-mozilla-aurora+
Whiteboard: [WebRTC][blocking-webrtc-][leave-open][webrtc-uplift] → [WebRTC][blocking-webrtc-][leave-open]
I verified the TBPL logs for Mozilla-Beta/Win 7 and the crash does not appear. Is there anything else QA can do to manually verify this?
Not really; the patch that exposed the bug has landed; if you add the datachannel tests patch referenced above, you can see if you hit failures on Win7 opt Try builds when retriggered a number of times. True verification would be to do that, and then back this patch out and see if you see the crashes. If you don't see the crashes with this backed out, and the datachannel test in, then you can't be sure because the tests aren't hitting the intermittent. Not an easy verification.
No longer blocks: 796894
Still needs to be backed out when we update SCTP
Depends on: 916427
Comment on attachment 8389843 [details] [diff] [review] backout temporary fix for SCTP library bug (now fixed) This is just undoing the previous patch. Since we've updated the SCTP library, we should no longer need to avoid this collision.
Attachment #8389843 - Flags: review?(tuexen)
Attachment #8389843 - Flags: review?(tuexen) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/7ada8c5a0480 I had a try run that was orange due to non-related issues in the underlying repo, but all the datachannel tests ran without errors.
Whiteboard: [WebRTC][blocking-webrtc-][leave-open] → [WebRTC][blocking-webrtc-]
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Turns out the library bug was *not* fixed; backing out the backout
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [WebRTC][blocking-webrtc-] → [WebRTC][blocking-webrtc-][leave-open]
backlog: --- → webRTC+
Rank: 14
Whiteboard: [WebRTC][blocking-webrtc-][leave-open] → [leave-open]
Poke to check if we've fixed this in SCTP and can back out the wallpaper
Severity: critical → normal
Rank: 14 → 25
Flags: needinfo?(rjesup)
Priority: P1 → P2
Flags: needinfo?(rjesup)
Whiteboard: [leave-open]
Michael - is this bug fixed upstream? Or do we still need to proxy the socket close?
Flags: needinfo?(tuexen)
I'm not aware of a specific fix, since I couldn't reproduce it the problem. Keep the work around for now.
Flags: needinfo?(tuexen)
Mass change P2->P3 to align with new Mozilla triage process.
Priority: P2 → P3
Closing because no crash reported since 12 weeks.
Status: REOPENED → RESOLVED
Closed: 11 years ago6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: