Closed Bug 1847904 Opened 2 years ago Closed 2 years ago

Crash in [@ NSSSocketControl::CloseSocketAndDestroy | mozilla::net::TLSTransportLayer::~TLSTransportLayer]

Categories

(Core :: Networking: WebSockets, defect, P1)

defect

Tracking

()

RESOLVED FIXED
118 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox-esr115 117+ fixed
firefox116 --- wontfix
firefox117 + fixed
firefox118 + fixed

People

(Reporter: aryx, Assigned: valentin)

References

Details

(4 keywords, Whiteboard: [necko-triaged][necko-priority-queue] [adv-main117+r] [adv-esr115.2+r])

Crash Data

Attachments

(1 file)

10 crashes from 5 installations of the latest Firefox 118.0a1 Nightly. No time to identify the regressing change at the moment.

Crash report: https://crash-stats.mozilla.org/report/index/48f2f18e-4793-4a91-bf4a-5433e0230809

MOZ_CRASH Reason: MOZ_DIAGNOSTIC_ASSERT(mOwningThread == PR_GetCurrentThread())

Top 10 frames of crashing thread:

0  xul.dll  NSSSocketControl::CloseSocketAndDestroy  security/manager/ssl/NSSSocketControl.cpp:474
1  xul.dll  mozilla::net::TLSTransportLayer::~TLSTransportLayer  netwerk/protocol/http/TLSTransportLayer.cpp:348
2  xul.dll  mozilla::net::TLSTransportLayer::Release  netwerk/protocol/http/TLSTransportLayer.cpp:323
3  xul.dll  nsCOMPtr<nsISocketTransport>::~nsCOMPtr  xpcom/base/nsCOMPtr.h:340
3  xul.dll  mozilla::net::CallOnTransportAvailable::~CallOnTransportAvailable  netwerk/protocol/websocket/WebSocketChannel.cpp:741
4  xul.dll  mozilla::net::CallOnTransportAvailable::~CallOnTransportAvailable  netwerk/protocol/websocket/WebSocketChannel.cpp:741
5  xul.dll  mozilla::RefPtrTraits<nsIRunnable>::Release  mfbt/RefPtr.h:54
5  xul.dll  RefPtr<nsIRunnable>::ConstRemovingRefPtrTraits<nsIRunnable>::Release  mfbt/RefPtr.h:420
5  xul.dll  RefPtr<nsIRunnable>::assign_assuming_AddRef  mfbt/RefPtr.h:73
5  xul.dll  RefPtr<nsIRunnable>::operator=  mfbt/RefPtr.h:188

The Bugbug bot thinks this bug should belong to the 'Core::Security: PSM' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Security: PSM
Product: Firefox → Core

Could this be caused by the changes in bug 1842754? The change in bug 1847604 is only flagged as affecting performance.

Full changelog for the first affected build.

Flags: needinfo?(manuel)

I don't see how bug 1842754 could be the cause, but I find it more likely that bug 1842754 causes the crash than bug 1847604.

Flags: needinfo?(manuel)

:manuel any updates since earlier?
Is it ok to set bug 1842754 as the regressor or have you determined otherwise? Wondering if it should be backed out or you have a patch in mind?

Flags: needinfo?(manuel)

I don't have a patch in mind. Let's back out bug 1842754 and see if that fixes the crashes? My further investigation only led me to
https://searchfox.org/mozilla-central/rev/dc8348b3730c0d29dafd01c653d9151eaa9bc30f/netwerk/protocol/websocket/WebSocketChannel.cpp#3722-3725
that the channel (or other object holding the TLSTransportLayer) is dropped after the dispatch, but before the dispatch is run. That way the lifetime of the (refcounted) TLSTransportLayer object end in the different thread unexpectedly.

Flags: needinfo?(manuel)

I'll ask the sheriffs to backout

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash

I don't think bug 1842754 is the issue here. The reason we're seeing crashes in the wild is because bug 1847260 upgraded an assertion to a diagnostic assertion to determine if/where a thread invariant is being violated, and lo and behold, this is one of them. It looks like CallOnTransportAvailable is releasing a TLSTransportLayer on the main thread rather than the socket thread.

Component: Security: PSM → Networking: WebSockets

:(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #10)

I don't think bug 1842754 is the issue here. The reason we're seeing crashes in the wild is because bug 1847260 upgraded an assertion to a diagnostic assertion to determine if/where a thread invariant is being violated, and lo and behold, this is one of them. It looks like CallOnTransportAvailable is releasing a TLSTransportLayer on the main thread rather than the socket thread.

:keeler Thanks for the info, given this is a new top crash what's the next step for investigation/assignee for a fix?

Flags: needinfo?(dkeeler)

Valentin is the triage owner for this component.
As for a fix, should be a pretty straightforward use of ProxyRelease.

Flags: needinfo?(dkeeler) → needinfo?(valentin.gosu)

This indicates a race, so it should probably be a security bug. I'm assuming it affects all branches and we're just seeing the crashes here because of the diagnostic asserts (which won't apply to older branches, even if we uplifted the checks).

[@ NSSSocketControl::CloseSocketAndDestroy | nsSSLIOLayerClose ] looks like the same basic thing: bp-7281f805-6efa-44aa-9ae0-43d6e0230810

Crash Signature: [@ NSSSocketControl::CloseSocketAndDestroy | mozilla::net::TLSTransportLayer::~TLSTransportLayer] → [@ NSSSocketControl::CloseSocketAndDestroy | mozilla::net::TLSTransportLayer::~TLSTransportLayer] [@ NSSSocketControl::CloseSocketAndDestroy | nsSSLIOLayerClose ]

It seems TLSTransportLayer is marked NS_DECL_THREADSAFE_ISUPPORTS, but we only expect ~TLSTransportLayer to be called on the socket thread.
I'll try dispatching to the socket thread instead.

Assignee: nobody → valentin.gosu
Severity: -- → S3
Flags: needinfo?(valentin.gosu)
Priority: -- → P2
Whiteboard: [necko-triaged][necko-priority-queue]

The bug is marked as tracked for firefox118 (nightly). However, the bug still has low severity.

:ghess, could you please increase the severity for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(ghess)
Severity: S3 → S2
Flags: needinfo?(ghess)
Priority: P2 → P1
Pushed by valentin.gosu@gmail.com: https://hg.mozilla.org/integration/autoland/rev/ed7b5e4aa348 Dispatch PR_Close to socketthread in ~TLSTransportLayer r=keeler,necko-reviewers,kershaw
Duplicate of this bug: 1848780
Group: network-core-security → core-security-release
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 118 Branch
Regressions: 1848999

The patch landed in nightly and beta is affected.
:valentin, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval. Also, don't forget to request an uplift for the patches in the regression caused by this fix.
  • If no, please set status-firefox117 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(valentin.gosu)

Requested uplift in bug 1848999 instead.

Flags: needinfo?(valentin.gosu)

Comment on attachment 9348525 [details]
Bug 1847904 - Dispatch PR_Close to socketthread in ~TLSTransportLayer r=keeler

Needed for bug 1848999, approved for 117.0b9 and 115.2esr.

Attachment #9348525 - Flags: approval-mozilla-esr115+
Attachment #9348525 - Flags: approval-mozilla-beta+
QA Whiteboard: [post-critsmash-triage]
Flags: qe-verify-
Whiteboard: [necko-triaged][necko-priority-queue] → [necko-triaged][necko-priority-queue] [adv-main117+r] [adv-esr115.2+r]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: