Closed Bug 1938471 Opened 1 year ago Closed 1 year ago

Poison value crash in [@ mozilla::dom::FetchDriverObserver::Release]

Categories

(Core :: Networking, defect, P2)

Unspecified
Windows 11
defect

Tracking

()

RESOLVED FIXED
137 Branch
Tracking Status
firefox-esr115 136+ fixed
firefox-esr128 136+ fixed
firefox134 --- wontfix
firefox135 --- wontfix
firefox136 + fixed
firefox137 + fixed

People

(Reporter: mccr8, Assigned: smayya)

References

(Regression)

Details

(4 keywords, Whiteboard: [necko-triaged] [necko-priority-queue][adv-main136+r][adv-esr128.8+r][adv-esr115.21+r])

Crash Data

Attachments

(2 files, 1 obsolete file)

Crash report: https://crash-stats.mozilla.org/report/index/594532a2-9267-4bd4-ab87-eacc30241208

Reason:

EXCEPTION_ACCESS_VIOLATION_READ

Top 10 frames:

0  xul.dll  std::_Atomic_integral<unsigned long long, 8>::fetch_add(const unsigned long l...  /builds/worker/fetches/vs/VC/Tools/MSVC/14.39.33519/include/atomic:1619
0  xul.dll  std::_Atomic_integral_facade<unsigned long long>::fetch_sub(const unsigned lo...  /builds/worker/fetches/vs/VC/Tools/MSVC/14.39.33519/include/atomic:1723
0  xul.dll  mozilla::ThreadSafeAutoRefCnt::operator--()  xpcom/base/nsISupportsImpl.h:395
0  xul.dll  mozilla::dom::FetchDriverObserver::Release()  dom/fetch/FetchDriver.h:55
0  xul.dll  mozilla::RefPtrTraits<mozilla::dom::FetchService::FetchInstance>::Release(moz...  mfbt/RefPtr.h:49
0  xul.dll  RefPtr<mozilla::dom::FetchService::FetchInstance>::ConstRemovingRefPtrTraits<...  mfbt/RefPtr.h:409
0  xul.dll  RefPtr<mozilla::dom::FetchService::FetchInstance>::~RefPtr()  mfbt/RefPtr.h:80
0  xul.dll  nsBaseHashtableET<nsRefPtrHashKey<mozilla::dom::FetchServicePromises>, RefPtr...  xpcom/ds/nsBaseHashtable.h:253
1  xul.dll  PLDHashTable::RawRemove(PLDHashTable::Slot&)  xpcom/ds/PLDHashTable.cpp:575
1  xul.dll  PLDHashTable::RawRemove(PLDHashEntryHdr*)  xpcom/ds/PLDHashTable.cpp:558

I'm not sure how actionable this is, but there's a decent volume of these. Lots on 115 but also a few on newer versions like 133. Mostly but not entirely with FetchService::Observe on the stack. It looks like we're trying to release the ref pointer to the FetchDriverService, but it has already been freed? Could the entry.Data()->Cancel(true); call or something else in here already have removed it from mFetchInstanceTable? That's the only obvious thing I can think of.

Hi Sunil,

Do you probably have an idea here?

Thanks.

Severity: -- → S2
Flags: needinfo?(smayya)
Priority: -- → P2
Whiteboard: [necko-triaged] [necko-priority-queue]

The crash is happened on main thread and the stack below shows that FetchParent::RecvAbortFetchOp is called.

2 	xul.dll 	mozilla::dom::FetchService::CancelFetch(RefPtr<mozilla::dom::FetchServicePromises> const&&, bool) 	dom/fetch/FetchService.cpp:900 	cfi
3 	xul.dll 	mozilla::dom::FetchParent::RecvAbortFetchOp::<lambda_11>::operator()() 	dom/fetch/FetchParent.cpp:269 	inlined

However, this line indicates that this function should be called on the background thread.
We should at least add a MOZ_RELEASE_ASSERT(!NS_IsMainThread()); to avoid FetchParent::RecvAbortFetchOp being called on main thread.

(In reply to Kershaw Chang [:kershaw] from comment #2)

The crash is happened on main thread and the stack below shows that FetchParent::RecvAbortFetchOp is called.

2 	xul.dll 	mozilla::dom::FetchService::CancelFetch(RefPtr<mozilla::dom::FetchServicePromises> const&&, bool) 	dom/fetch/FetchService.cpp:900 	cfi
3 	xul.dll 	mozilla::dom::FetchParent::RecvAbortFetchOp::<lambda_11>::operator()() 	dom/fetch/FetchParent.cpp:269 	inlined

However, this line indicates that this function should be called on the background thread.
We should at least add a MOZ_RELEASE_ASSERT(!NS_IsMainThread()); to avoid FetchParent::RecvAbortFetchOp being called on main thread.

I dont think RecvAbortFetchOp is called on the Mainthread as PFetch Actors are always created on the background thread

3 xul.dll mozilla::dom::FetchParent::RecvAbortFetchOp::<lambda_11>::operator()() dom/fetch/FetchParent.cpp:269 inlined

I think this refers to the lambda dispatched from RecvAbortFetchOp here.

I will check this further

Assignee: nobody → smayya
Flags: needinfo?(smayya)
Flags: needinfo?(kershaw)

I noticed that in mozilla::dom::FetchService::CancelFetch we call entry.Data()->Cancel(aForceAbort); before entry.Remove();.
But the call to Cancel goes through mozilla::dom::FetchService::FetchInstance::Cancel, mozilla::dom::FetchDriver::FetchDriverAbortActions, mozilla::dom::FetchService::FetchInstance::OnResponseEnd which also calls entry.Remove().

If I didn't get anything wrong, that means we just need to hold a refPtr of entry.Data() while calling cancel on it.

(In reply to Valentin Gosu [:valentin] (he/him) {{ PTO 21 Dec - 06 Jan }} from comment #4)

If I didn't get anything wrong, that means we just need to hold a refPtr of entry.Data() while calling cancel on it.

Or better yet, hold a refPtr to it, remove it from the hashtable, and then call cancel on the FetchInstance.

Same thing applies here:
https://searchfox.org/mozilla-central/rev/495dafe9d89894e4768afe6d978139351574cc84/dom/fetch/FetchService.cpp#832-839

mFetchInstanceTable.RemoveIf([](auto& entry) {
  bool res = entry.Data()->IsLocalHostFetch();
  if (res) {
    return false;
  }
  entry.Data()->Cancel(true);
  return true;
});

Since we're canceling before returning true, that same chain will remove it from the table before we return.
We need to append them to an RefPtrArray in the RemoveIf lambda, and cancel after RemoveIf returns.

(In reply to Sunil Mayya from comment #3)

(In reply to Kershaw Chang [:kershaw] from comment #2)

The crash is happened on main thread and the stack below shows that FetchParent::RecvAbortFetchOp is called.

2 	xul.dll 	mozilla::dom::FetchService::CancelFetch(RefPtr<mozilla::dom::FetchServicePromises> const&&, bool) 	dom/fetch/FetchService.cpp:900 	cfi
3 	xul.dll 	mozilla::dom::FetchParent::RecvAbortFetchOp::<lambda_11>::operator()() 	dom/fetch/FetchParent.cpp:269 	inlined

However, this line indicates that this function should be called on the background thread.
We should at least add a MOZ_RELEASE_ASSERT(!NS_IsMainThread()); to avoid FetchParent::RecvAbortFetchOp being called on main thread.

I dont think RecvAbortFetchOp is called on the Mainthread as PFetch Actors are always created on the background thread

3 xul.dll mozilla::dom::FetchParent::RecvAbortFetchOp::<lambda_11>::operator()() dom/fetch/FetchParent.cpp:269 inlined

I think this refers to the lambda dispatched from RecvAbortFetchOp here.

I will check this further

Yes, I think you are right. Sorry for the noise.

Flags: needinfo?(kershaw)

Working on the fix. Will submit patch for review this week.

(In reply to Valentin Gosu [:valentin] (he/him) {{ PTO 21 Dec - 06 Jan }} from comment #4)

I noticed that in mozilla::dom::FetchService::CancelFetch we call entry.Data()->Cancel(aForceAbort); before entry.Remove();.
But the call to Cancel goes through mozilla::dom::FetchService::FetchInstance::Cancel, mozilla::dom::FetchDriver::FetchDriverAbortActions, mozilla::dom::FetchService::FetchInstance::OnResponseEnd which also calls entry.Remove().

I think with this path we shouldnt hit the double removal case as we skip removing the entry due to this check.

Calling FetchService::OnResponseEnd() with reason other than eAborted should cause removal of the entry.
From the code path looks like cancelling the channel from here could trigger nsHttpChannel::OnStopRequest thus resulting in FetchService::OnResponseEnd() being called with end reason eByNetworking which could remove the entry here
In the current scenario we are triggering FetchService::OnResponseEnd() twice, once due to the abort operation and second due to the cancelling of the channel.

From Bug 1810805, we defferred assigning nullptr to mObserver to OnStopRequest, this seems to have triggered the second FetchService::OnResponseEnd().
Looking at the crash signature, we are seeing the crash after the fix for Bug 1810805 had been merged.

We need to ensure that FetchService::OnResponseEnd() is called only once. Our fix should be around this.

Flags: needinfo?(valentin.gosu)

(In reply to Sunil Mayya from comment #9)

From Bug 1810805, we defferred assigning nullptr to mObserver to OnStopRequest, this seems to have triggered the second FetchService::OnResponseEnd().
Looking at the crash signature, we are seeing the crash after the fix for Bug 1810805 had been merged.

We need to ensure that FetchService::OnResponseEnd() is called only once. Our fix should be around this.

Sounds good.

Flags: needinfo?(valentin.gosu)

Currently running tests to confirm our theory.

We have posted a diagnistic patch for review and review is under progress

Attachment #9459857 - Attachment is obsolete: true

We have clear indications that this is a regression from Bug 1810805 .
We will revert our changes for Bug 1810805 and fix the race using a suitable synchronization mechanism.
This should fix both the issues.
Patch with the aforementioned fix has been posted for review - https://phabricator.services.mozilla.com/D235761

Regressions: 1810805

The fix for bug 1810805 landed in 121/115.6, but I don't see any crashes with this signature in 115.6 or 115.7. It's possible there just weren't enough users of those versions, or maybe something changed in 115.8 to make this crash more likely. Either way we will need to back-port the fix for this to both ESR branches. Because of that this is not a candidate for the mid-cycle point-release.

The crash was more frequent for ESR-115 users. The reason the volume of crashes goes down after 2024-10-01 is because we migrated most ESR-115 users to ESR-128 with that release. This is visible at
https://crash-stats.mozilla.org/signature/?signature=mozilla%3A%3Adom%3A%3AFetchDriverObserver%3A%3ARelease&date=%3E%3D2024-07-28T21%3A41%3A00.000Z&date=%3C2025-01-28T21%3A41%3A00.000Z#graphs
if you enter "Major version" in the drop-down.

Regressed by: 1810805
No longer regressions: 1810805

Comment on attachment 9462180 [details]
Bug 1938471 - Backed out changeset dcfa4149aaf3. r=#necko

Security Approval Request

  • How easily could an exploit be constructed based on the patch?: Not easily.
  • Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
  • Which branches (beta, release, and/or ESR) are affected by this flaw, and do the release status flags reflect this affected/unaffected state correctly?: beta, release, ESR and yes
  • If not all supported branches, which bug introduced the flaw?: None
  • Do you have backports for the affected branches?: No
  • If not, how different, hard to create, and risky will they be?: they should be straightforward uplift
  • How likely is this patch to cause regressions; how much testing does it need?: the patch backouts the regressing patch and fixes the original race issue with the regressing Bug.
  • Is the patch ready to land after security approval is given?: Yes
  • Is Android affected?: Yes
Attachment #9462180 - Flags: sec-approval?

Comment on attachment 9462180 [details]
Bug 1938471 - Backed out changeset dcfa4149aaf3. r=#necko

Approved to land. It's not clear if both patches are needed; if so they should both be flagged for sec-approval; but I looked at the other one and it is okay to land and request uplift also.

Attachment #9462180 - Flags: sec-approval? → sec-approval+

Comment on attachment 9462181 [details]
Bug 1938471 - synchronize access to FetchDriver::mObserver. r=#necko

Security Approval Request

  • How easily could an exploit be constructed based on the patch?: This is not straightforward as it fixes a race condition meant for the regressing bug.
  • Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
  • Which branches (beta, release, and/or ESR) are affected by this flaw, and do the release status flags reflect this affected/unaffected state correctly?: yes
  • If not all supported branches, which bug introduced the flaw?: None
  • Do you have backports for the affected branches?: No
  • If not, how different, hard to create, and risky will they be?: The patches could be constructed straightforward. We dont see any major risk
  • How likely is this patch to cause regressions; how much testing does it need?: The patch does not introduce new behaviour, but rather introduces a syncrhonization mechanism for the data member.
  • Is the patch ready to land after security approval is given?: Yes
  • Is Android affected?: Yes
Attachment #9462181 - Flags: sec-approval?

(In reply to Tom Ritter [:tjr] from comment #19)

Comment on attachment 9462180 [details]
Bug 1938471 - Backed out changeset dcfa4149aaf3. r=#necko

Approved to land. It's not clear if both patches are needed; if so they should both be flagged for sec-approval; but I looked at the other one and it is okay to land and request uplift also.

Thanks for correcting Tom.
Both patches require sec-approval, and I have just flagged them to ensure compliance with the process.

Flags: needinfo?(tom)
Flags: needinfo?(tom)
Attachment #9462181 - Flags: sec-approval? → sec-approval+

Set release status flags based on info from the regressing bug 1810805

Pushed by smayya@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c121a140cb3e Backed out changeset dcfa4149aaf3. r=necko-reviewers,kershaw https://hg.mozilla.org/integration/autoland/rev/60cf06a5ac6a synchronize access to FetchDriver::mObserver. r=necko-reviewers,kershaw
Group: network-core-security → core-security-release
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 137 Branch

The patch landed in nightly and beta is affected.
:smayya, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox136 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(smayya)

To add to comment 25, please also add uplift requests for esr115 and esr128

Comment on attachment 9462180 [details]
Bug 1938471 - Backed out changeset dcfa4149aaf3. r=#necko

Beta/Release Uplift Approval Request

  • User impact if declined/Reason for urgency: crash
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The patch backs out a recently introduced change hence restoring into a stable version.
  • String changes made/needed: No
  • Is Android affected?: Yes

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration:
  • User impact if declined: crash
  • Fix Landed on Version:
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The patch backs out a recently introduced change hence restoring into a stable version.
Flags: needinfo?(smayya)
Attachment #9462180 - Flags: approval-mozilla-esr128?
Attachment #9462180 - Flags: approval-mozilla-esr115?
Attachment #9462180 - Flags: approval-mozilla-beta?

Comment on attachment 9462181 [details]
Bug 1938471 - synchronize access to FetchDriver::mObserver. r=#necko

Beta/Release Uplift Approval Request

  • User impact if declined/Reason for urgency: crash
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The patch introduces a synchronization mechanism around a member variable. This does not introduce new functional behaviour.
  • String changes made/needed:
  • Is Android affected?: Yes

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration:
  • User impact if declined: undefined behaviour
  • Fix Landed on Version:
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The patch introduces a synchronization mechanism around a member variable. This does not introduce new functional behaviour.
Attachment #9462181 - Flags: approval-mozilla-esr128?
Attachment #9462181 - Flags: approval-mozilla-esr115?
Attachment #9462181 - Flags: approval-mozilla-beta?

Comment on attachment 9462180 [details]
Bug 1938471 - Backed out changeset dcfa4149aaf3. r=#necko

Approved for 136.0b3

Attachment #9462180 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment on attachment 9462181 [details]
Bug 1938471 - synchronize access to FetchDriver::mObserver. r=#necko

Approved for 136.0b3

Attachment #9462181 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
QA Whiteboard: [post-critsmash-triage]
Flags: qe-verify-

Comment on attachment 9462180 [details]
Bug 1938471 - Backed out changeset dcfa4149aaf3. r=#necko

Approved for 128.8esr

Attachment #9462180 - Flags: approval-mozilla-esr128? → approval-mozilla-esr128+

Comment on attachment 9462181 [details]
Bug 1938471 - synchronize access to FetchDriver::mObserver. r=#necko

Approved for 128.8esr

Attachment #9462181 - Flags: approval-mozilla-esr128? → approval-mozilla-esr128+

Comment on attachment 9462180 [details]
Bug 1938471 - Backed out changeset dcfa4149aaf3. r=#necko

Approved for 115.21esr

Attachment #9462180 - Flags: approval-mozilla-esr115? → approval-mozilla-esr115+

Comment on attachment 9462181 [details]
Bug 1938471 - synchronize access to FetchDriver::mObserver. r=#necko

Approved for 115.21esr

Attachment #9462181 - Flags: approval-mozilla-esr115? → approval-mozilla-esr115+
Whiteboard: [necko-triaged] [necko-priority-queue] → [necko-triaged] [necko-priority-queue][adv-main136+r][adv-esr128.8+r][adv-esr115.21+r]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: