Open Bug 1776143 Opened 2 years ago Updated 3 months ago

Crash in [@ RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | RtlEnterCriticalSection | sctp_inpcb_free | sctp_close]

Categories

(Core :: WebRTC: Networking, defect)

Unspecified
All
defect

Tracking

()

People

(Reporter: gsvelto, Assigned: bwc)

References

(Blocks 1 open bug)

Details

(Keywords: crash, leave-open)

Crash Data

Attachments

(1 file)

Crash report: https://crash-stats.mozilla.org/report/index/054f6c00-b865-4941-b5ba-21ec80220623

Reason: EXCEPTION_ACCESS_VIOLATION_WRITE

Top 10 frames of crashing thread:

0 ntdll.dll RtlpWaitOnCriticalSection 
1 ntdll.dll RtlpEnterCriticalSectionContended 
2 ntdll.dll RtlEnterCriticalSection 
3 xul.dll sctp_inpcb_free netwerk/sctp/src/netinet/sctp_pcb.c:3857
4 xul.dll sctp_close netwerk/sctp/src/netinet/sctp_usrreq.c:842
5 xul.dll sofree netwerk/sctp/src/user_socket.c:287
6 xul.dll mozilla::DataChannelConnection::DestroyOnSTS netwerk/sctp/datachannel/DataChannel.cpp:399
7 xul.dll mozilla::detail::runnable_args_base<mozilla::detail::NoResult>::Run dom/media/webrtc/transport/runnable_utils.h:41
8 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:465
9 xul.dll mozilla::net::nsSocketTransportService::Run netwerk/base/nsSocketTransportService2.cpp:1202

It appears we're trying to lock a mutex that has been set to NULL. The crash seems to happen only on Windows but bug 1775214 points to the same issue on Linux. This does not appear to be a new bug but it was recently detected by clouseau due to a visible spike.

Alright, now that I can see the whole graph this looks like a regression introduced in version 100. The volume here is non-trivial.

There's several more signatures for this, ouch. At least one of them is on Android so this is indeed a problem that affects all platforms albeit with different signatures.

Crash Signature: [@ RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | RtlEnterCriticalSection | sctp_inpcb_free | sctp_close] → [@ RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | RtlEnterCriticalSection | sctp_inpcb_free | sctp_close] [@ RtlpWaitOnCriticalSection | EtwEventEnabled | sctp_inpcb_free | sctp_close] [@ RtlpWaitOnCriticalSection | RtlpDeCommitFreeBloc…
OS: Windows → All
Component: Networking → WebRTC: Networking

Quite a few crashes on related signatures; there's some type of race here, though it doesn't seem security-sensitive

Flags: needinfo?(tuexen)
Flags: needinfo?(docfaraday)
Crash Signature: sctp_inpcb_free | sctp_close] → sctp_inpcb_free | sctp_close] [@ RtlpWaitOnCriticalSection | sctp_inpcb_free | sctp_close ]
Crash Signature: sctp_inpcb_free | sctp_close] [@ RtlpWaitOnCriticalSection | sctp_inpcb_free | sctp_close ] → sctp_inpcb_free | sctp_close] [@ RtlpWaitOnCriticalSection | sctp_inpcb_free | sctp_close ] [@ RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | RtlpEnterCriticalSectionContended | sctp_inpcb_free | sctp_close ]
Crash Signature: sctp_inpcb_free | sctp_close] [@ RtlpWaitOnCriticalSection | sctp_inpcb_free | sctp_close ] [@ RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | RtlpEnterCriticalSectionContended | sctp_inpcb_free | sctp_close ] → sctp_inpcb_free | sctp_close] [@ RtlpWaitOnCriticalSection | sctp_inpcb_free | sctp_close ] [@ RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | RtlpEnterCriticalSectionContended | sctp_inpcb_free | sctp_close ] [@ RtlpWaitOnCriticalSect…

You might want to update to the current version. There are still a couple of known issues, but more on the receive path, not sending or closing.

Flags: needinfo?(tuexen)
Flags: needinfo?(docfaraday)
See Also: → CVE-2022-46871

shifting deps here slightly.

Depends on: CVE-2022-46871
See Also: CVE-2022-46871

For the record, updating libusrsctp to the latest version didn't seem to affect this. Anything we can do to move this forward?

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 content process crashes on beta

For more information, please visit BugBot documentation.

Keywords: topcrash

Is there any more information than the stack traces?

Jesup, do you have any insight here? We're kind of stalled out.

Flags: needinfo?(rjesup)

No. I'll try to look more deeply. Michael, since updating didn't help, can you look to see what possible paths might lead to this?

Flags: needinfo?(rjesup) → needinfo?(tuexen)

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash
Severity: S2 → S3
See Also: → 1848814
Duplicate of this bug: 1848814
See Also: 1848814
No longer blocks: webrtc-triage

Clear a needinfo that is pending on an inactive user.

Inactive users most likely will not respond; if the missing information is essential and cannot be collected another way, the bug maybe should be closed as INCOMPLETE.

For more information, please visit BugBot documentation.

Flags: needinfo?(tuexen)

Cleaning up the signatures after bug 1895527, adding a bunch of Android ones. The crash appear to be still valid and has significant volume, can we get someone to look into it?

Crash Signature: [@ RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | RtlEnterCriticalSection | sctp_inpcb_free | sctp_close] [@ RtlpWaitOnCriticalSection | EtwEventEnabled | sctp_inpcb_free | sctp_close] [@ RtlpWaitOnCriticalSection | RtlpDeCommitFreeBloc… → [@ abort | __fortify_fatal] [@ libc.so | sctp_close] [@ libc.so | sctp_inpcb_free | mozilla::DataChannelConnection::DestroyOnSTS] [@ libc.so | sctp_inpcb_free | sctp_close] [@ RtlpWaitOnCriticalSection | EtwEventEnabled | sctp_inpcb_free | sctp_close]…

Added a few more top Android signatures

Crash Signature: RtlEnterCriticalSection | sctp_inpcb_free | sctp_close] [@ RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | sctp_inpcb_free | sctp_close] → RtlEnterCriticalSection | sctp_inpcb_free | sctp_close] [@ RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | sctp_inpcb_free | sctp_close] [@ libc.so@0x1ce72 | libc.so@0x94273 | libc.so@0x64f61 | libc.so@0x646ef | libc.so@0x94273 | libc.s…

Ok, not related, since we always build with that check true. The init and finish functions are pretty complicated, and might have holes when parts of init fail. Looking into it...

I've looked at this for a while, and while there are flaws in the init/deinit functions, I'm not seeing one that would cause this specific problem on Windows. I do see that there is no error-checking for the initialization of this mutex/critical section, but the documentation for InitializeCriticalSection says that it is infallible on modern versions of windows, and this happens pretty much only on Windows and Android. It might be that this documentation is wrong or misleading.

I did notice a couple of flaws in our code that muddy the waters somewhat, so I think I'll fix them and hope that we get some more clarity on what is going on here.

Assignee: nobody → docfaraday
Pushed by bcampen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/23320214ae00
Fix leak on init failure, make DataChannelRegistry non-refcounted, and add some assertions. r=ng
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: