Poison value crash in [@ mozilla::dom::FetchDriverObserver::Release]
Categories
(Core :: Networking, defect, P2)
Tracking
()
People
(Reporter: mccr8, Assigned: smayya)
References
(Regression)
Details
(4 keywords, Whiteboard: [necko-triaged] [necko-priority-queue][adv-main136+r][adv-esr128.8+r][adv-esr115.21+r])
Crash Data
Attachments
(2 files, 1 obsolete file)
|
48 bytes,
text/x-phabricator-request
|
dmeehan
:
approval-mozilla-beta+
dmeehan
:
approval-mozilla-esr115+
dmeehan
:
approval-mozilla-esr128+
tjr
:
sec-approval+
|
Details | Review |
|
48 bytes,
text/x-phabricator-request
|
dmeehan
:
approval-mozilla-beta+
dmeehan
:
approval-mozilla-esr115+
dmeehan
:
approval-mozilla-esr128+
tjr
:
sec-approval+
|
Details | Review |
Crash report: https://crash-stats.mozilla.org/report/index/594532a2-9267-4bd4-ab87-eacc30241208
Reason:
EXCEPTION_ACCESS_VIOLATION_READ
Top 10 frames:
0 xul.dll std::_Atomic_integral<unsigned long long, 8>::fetch_add(const unsigned long l... /builds/worker/fetches/vs/VC/Tools/MSVC/14.39.33519/include/atomic:1619
0 xul.dll std::_Atomic_integral_facade<unsigned long long>::fetch_sub(const unsigned lo... /builds/worker/fetches/vs/VC/Tools/MSVC/14.39.33519/include/atomic:1723
0 xul.dll mozilla::ThreadSafeAutoRefCnt::operator--() xpcom/base/nsISupportsImpl.h:395
0 xul.dll mozilla::dom::FetchDriverObserver::Release() dom/fetch/FetchDriver.h:55
0 xul.dll mozilla::RefPtrTraits<mozilla::dom::FetchService::FetchInstance>::Release(moz... mfbt/RefPtr.h:49
0 xul.dll RefPtr<mozilla::dom::FetchService::FetchInstance>::ConstRemovingRefPtrTraits<... mfbt/RefPtr.h:409
0 xul.dll RefPtr<mozilla::dom::FetchService::FetchInstance>::~RefPtr() mfbt/RefPtr.h:80
0 xul.dll nsBaseHashtableET<nsRefPtrHashKey<mozilla::dom::FetchServicePromises>, RefPtr... xpcom/ds/nsBaseHashtable.h:253
1 xul.dll PLDHashTable::RawRemove(PLDHashTable::Slot&) xpcom/ds/PLDHashTable.cpp:575
1 xul.dll PLDHashTable::RawRemove(PLDHashEntryHdr*) xpcom/ds/PLDHashTable.cpp:558
I'm not sure how actionable this is, but there's a decent volume of these. Lots on 115 but also a few on newer versions like 133. Mostly but not entirely with FetchService::Observe on the stack. It looks like we're trying to release the ref pointer to the FetchDriverService, but it has already been freed? Could the entry.Data()->Cancel(true); call or something else in here already have removed it from mFetchInstanceTable? That's the only obvious thing I can think of.
Comment 1•1 year ago
|
||
Hi Sunil,
Do you probably have an idea here?
Thanks.
Comment 2•1 year ago
|
||
The crash is happened on main thread and the stack below shows that FetchParent::RecvAbortFetchOp is called.
2 xul.dll mozilla::dom::FetchService::CancelFetch(RefPtr<mozilla::dom::FetchServicePromises> const&&, bool) dom/fetch/FetchService.cpp:900 cfi
3 xul.dll mozilla::dom::FetchParent::RecvAbortFetchOp::<lambda_11>::operator()() dom/fetch/FetchParent.cpp:269 inlined
However, this line indicates that this function should be called on the background thread.
We should at least add a MOZ_RELEASE_ASSERT(!NS_IsMainThread()); to avoid FetchParent::RecvAbortFetchOp being called on main thread.
| Assignee | ||
Comment 3•1 year ago
|
||
(In reply to Kershaw Chang [:kershaw] from comment #2)
The crash is happened on main thread and the stack below shows that
FetchParent::RecvAbortFetchOpis called.2 xul.dll mozilla::dom::FetchService::CancelFetch(RefPtr<mozilla::dom::FetchServicePromises> const&&, bool) dom/fetch/FetchService.cpp:900 cfi 3 xul.dll mozilla::dom::FetchParent::RecvAbortFetchOp::<lambda_11>::operator()() dom/fetch/FetchParent.cpp:269 inlinedHowever, this line indicates that this function should be called on the background thread.
We should at least add aMOZ_RELEASE_ASSERT(!NS_IsMainThread());to avoidFetchParent::RecvAbortFetchOpbeing called on main thread.
I dont think RecvAbortFetchOp is called on the Mainthread as PFetch Actors are always created on the background thread
3 xul.dll mozilla::dom::FetchParent::RecvAbortFetchOp::<lambda_11>::operator()() dom/fetch/FetchParent.cpp:269 inlined
I think this refers to the lambda dispatched from RecvAbortFetchOp here.
I will check this further
| Assignee | ||
Updated•1 year ago
|
Comment 4•1 year ago
|
||
I noticed that in mozilla::dom::FetchService::CancelFetch we call entry.Data()->Cancel(aForceAbort); before entry.Remove();.
But the call to Cancel goes through mozilla::dom::FetchService::FetchInstance::Cancel, mozilla::dom::FetchDriver::FetchDriverAbortActions, mozilla::dom::FetchService::FetchInstance::OnResponseEnd which also calls entry.Remove().
If I didn't get anything wrong, that means we just need to hold a refPtr of entry.Data() while calling cancel on it.
Comment 5•1 year ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) {{ PTO 21 Dec - 06 Jan }} from comment #4)
If I didn't get anything wrong, that means we just need to hold a refPtr of entry.Data() while calling cancel on it.
Or better yet, hold a refPtr to it, remove it from the hashtable, and then call cancel on the FetchInstance.
Comment 6•1 year ago
|
||
Same thing applies here:
https://searchfox.org/mozilla-central/rev/495dafe9d89894e4768afe6d978139351574cc84/dom/fetch/FetchService.cpp#832-839
mFetchInstanceTable.RemoveIf([](auto& entry) {
bool res = entry.Data()->IsLocalHostFetch();
if (res) {
return false;
}
entry.Data()->Cancel(true);
return true;
});
Since we're canceling before returning true, that same chain will remove it from the table before we return.
We need to append them to an RefPtrArray in the RemoveIf lambda, and cancel after RemoveIf returns.
Comment 7•1 year ago
|
||
(In reply to Sunil Mayya from comment #3)
(In reply to Kershaw Chang [:kershaw] from comment #2)
The crash is happened on main thread and the stack below shows that
FetchParent::RecvAbortFetchOpis called.2 xul.dll mozilla::dom::FetchService::CancelFetch(RefPtr<mozilla::dom::FetchServicePromises> const&&, bool) dom/fetch/FetchService.cpp:900 cfi 3 xul.dll mozilla::dom::FetchParent::RecvAbortFetchOp::<lambda_11>::operator()() dom/fetch/FetchParent.cpp:269 inlinedHowever, this line indicates that this function should be called on the background thread.
We should at least add aMOZ_RELEASE_ASSERT(!NS_IsMainThread());to avoidFetchParent::RecvAbortFetchOpbeing called on main thread.I dont think RecvAbortFetchOp is called on the Mainthread as PFetch Actors are always created on the background thread
3 xul.dll mozilla::dom::FetchParent::RecvAbortFetchOp::<lambda_11>::operator()() dom/fetch/FetchParent.cpp:269 inlined
I think this refers to the lambda dispatched from RecvAbortFetchOp here.
I will check this further
Yes, I think you are right. Sorry for the noise.
| Assignee | ||
Comment 8•1 year ago
|
||
Working on the fix. Will submit patch for review this week.
| Assignee | ||
Comment 9•1 year ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) {{ PTO 21 Dec - 06 Jan }} from comment #4)
I noticed that in mozilla::dom::FetchService::CancelFetch we call
entry.Data()->Cancel(aForceAbort);beforeentry.Remove();.
But the call to Cancel goes through mozilla::dom::FetchService::FetchInstance::Cancel, mozilla::dom::FetchDriver::FetchDriverAbortActions, mozilla::dom::FetchService::FetchInstance::OnResponseEnd which also callsentry.Remove().
I think with this path we shouldnt hit the double removal case as we skip removing the entry due to this check.
Calling FetchService::OnResponseEnd() with reason other than eAborted should cause removal of the entry.
From the code path looks like cancelling the channel from here could trigger nsHttpChannel::OnStopRequest thus resulting in FetchService::OnResponseEnd() being called with end reason eByNetworking which could remove the entry here
In the current scenario we are triggering FetchService::OnResponseEnd() twice, once due to the abort operation and second due to the cancelling of the channel.
From Bug 1810805, we defferred assigning nullptr to mObserver to OnStopRequest, this seems to have triggered the second FetchService::OnResponseEnd().
Looking at the crash signature, we are seeing the crash after the fix for Bug 1810805 had been merged.
We need to ensure that FetchService::OnResponseEnd() is called only once. Our fix should be around this.
Comment 10•1 year ago
|
||
(In reply to Sunil Mayya from comment #9)
From Bug 1810805, we defferred assigning nullptr to
mObserverto OnStopRequest, this seems to have triggered the second FetchService::OnResponseEnd().
Looking at the crash signature, we are seeing the crash after the fix for Bug 1810805 had been merged.We need to ensure that FetchService::OnResponseEnd() is called only once. Our fix should be around this.
Sounds good.
| Assignee | ||
Comment 11•1 year ago
|
||
Currently running tests to confirm our theory.
| Assignee | ||
Comment 12•1 year ago
|
||
| Assignee | ||
Comment 13•1 year ago
•
|
||
We have posted a diagnistic patch for review and review is under progress
| Assignee | ||
Comment 14•1 year ago
|
||
| Assignee | ||
Comment 15•1 year ago
|
||
Updated•1 year ago
|
| Assignee | ||
Comment 16•1 year ago
|
||
We have clear indications that this is a regression from Bug 1810805 .
We will revert our changes for Bug 1810805 and fix the race using a suitable synchronization mechanism.
This should fix both the issues.
Patch with the aforementioned fix has been posted for review - https://phabricator.services.mozilla.com/D235761
Comment 17•1 year ago
|
||
The fix for bug 1810805 landed in 121/115.6, but I don't see any crashes with this signature in 115.6 or 115.7. It's possible there just weren't enough users of those versions, or maybe something changed in 115.8 to make this crash more likely. Either way we will need to back-port the fix for this to both ESR branches. Because of that this is not a candidate for the mid-cycle point-release.
The crash was more frequent for ESR-115 users. The reason the volume of crashes goes down after 2024-10-01 is because we migrated most ESR-115 users to ESR-128 with that release. This is visible at
https://crash-stats.mozilla.org/signature/?signature=mozilla%3A%3Adom%3A%3AFetchDriverObserver%3A%3ARelease&date=%3E%3D2024-07-28T21%3A41%3A00.000Z&date=%3C2025-01-28T21%3A41%3A00.000Z#graphs
if you enter "Major version" in the drop-down.
Updated•1 year ago
|
| Assignee | ||
Comment 18•1 year ago
|
||
Comment on attachment 9462180 [details]
Bug 1938471 - Backed out changeset dcfa4149aaf3. r=#necko
Security Approval Request
- How easily could an exploit be constructed based on the patch?: Not easily.
- Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
- Which branches (beta, release, and/or ESR) are affected by this flaw, and do the release status flags reflect this affected/unaffected state correctly?: beta, release, ESR and yes
- If not all supported branches, which bug introduced the flaw?: None
- Do you have backports for the affected branches?: No
- If not, how different, hard to create, and risky will they be?: they should be straightforward uplift
- How likely is this patch to cause regressions; how much testing does it need?: the patch backouts the regressing patch and fixes the original race issue with the regressing Bug.
- Is the patch ready to land after security approval is given?: Yes
- Is Android affected?: Yes
Comment 19•1 year ago
|
||
Comment on attachment 9462180 [details]
Bug 1938471 - Backed out changeset dcfa4149aaf3. r=#necko
Approved to land. It's not clear if both patches are needed; if so they should both be flagged for sec-approval; but I looked at the other one and it is okay to land and request uplift also.
| Assignee | ||
Comment 20•1 year ago
|
||
Comment on attachment 9462181 [details]
Bug 1938471 - synchronize access to FetchDriver::mObserver. r=#necko
Security Approval Request
- How easily could an exploit be constructed based on the patch?: This is not straightforward as it fixes a race condition meant for the regressing bug.
- Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
- Which branches (beta, release, and/or ESR) are affected by this flaw, and do the release status flags reflect this affected/unaffected state correctly?: yes
- If not all supported branches, which bug introduced the flaw?: None
- Do you have backports for the affected branches?: No
- If not, how different, hard to create, and risky will they be?: The patches could be constructed straightforward. We dont see any major risk
- How likely is this patch to cause regressions; how much testing does it need?: The patch does not introduce new behaviour, but rather introduces a syncrhonization mechanism for the data member.
- Is the patch ready to land after security approval is given?: Yes
- Is Android affected?: Yes
| Assignee | ||
Comment 21•1 year ago
|
||
(In reply to Tom Ritter [:tjr] from comment #19)
Comment on attachment 9462180 [details]
Bug 1938471 - Backed out changeset dcfa4149aaf3. r=#neckoApproved to land. It's not clear if both patches are needed; if so they should both be flagged for sec-approval; but I looked at the other one and it is okay to land and request uplift also.
Thanks for correcting Tom.
Both patches require sec-approval, and I have just flagged them to ensure compliance with the process.
Updated•1 year ago
|
Comment 22•1 year ago
|
||
Set release status flags based on info from the regressing bug 1810805
Comment 23•1 year ago
|
||
Comment 24•1 year ago
|
||
https://hg.mozilla.org/mozilla-central/rev/c121a140cb3e
https://hg.mozilla.org/mozilla-central/rev/60cf06a5ac6a
Comment 25•1 year ago
|
||
The patch landed in nightly and beta is affected.
:smayya, is this bug important enough to require an uplift?
- If yes, please nominate the patch for beta approval.
- If no, please set
status-firefox136towontfix.
For more information, please visit BugBot documentation.
Comment 26•1 year ago
|
||
To add to comment 25, please also add uplift requests for esr115 and esr128
| Assignee | ||
Comment 27•1 year ago
|
||
Comment on attachment 9462180 [details]
Bug 1938471 - Backed out changeset dcfa4149aaf3. r=#necko
Beta/Release Uplift Approval Request
- User impact if declined/Reason for urgency: crash
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): The patch backs out a recently introduced change hence restoring into a stable version.
- String changes made/needed: No
- Is Android affected?: Yes
ESR Uplift Approval Request
- If this is not a sec:{high,crit} bug, please state case for ESR consideration:
- User impact if declined: crash
- Fix Landed on Version:
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): The patch backs out a recently introduced change hence restoring into a stable version.
| Assignee | ||
Comment 28•1 year ago
|
||
Comment on attachment 9462181 [details]
Bug 1938471 - synchronize access to FetchDriver::mObserver. r=#necko
Beta/Release Uplift Approval Request
- User impact if declined/Reason for urgency: crash
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): The patch introduces a synchronization mechanism around a member variable. This does not introduce new functional behaviour.
- String changes made/needed:
- Is Android affected?: Yes
ESR Uplift Approval Request
- If this is not a sec:{high,crit} bug, please state case for ESR consideration:
- User impact if declined: undefined behaviour
- Fix Landed on Version:
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): The patch introduces a synchronization mechanism around a member variable. This does not introduce new functional behaviour.
Comment 29•1 year ago
|
||
Comment on attachment 9462180 [details]
Bug 1938471 - Backed out changeset dcfa4149aaf3. r=#necko
Approved for 136.0b3
Comment 30•1 year ago
|
||
Comment on attachment 9462181 [details]
Bug 1938471 - synchronize access to FetchDriver::mObserver. r=#necko
Approved for 136.0b3
Comment 31•1 year ago
|
||
| uplift | ||
Updated•1 year ago
|
Updated•1 year ago
|
Comment 32•1 year ago
|
||
Comment on attachment 9462180 [details]
Bug 1938471 - Backed out changeset dcfa4149aaf3. r=#necko
Approved for 128.8esr
Comment 33•1 year ago
|
||
Comment on attachment 9462181 [details]
Bug 1938471 - synchronize access to FetchDriver::mObserver. r=#necko
Approved for 128.8esr
Comment 34•1 year ago
|
||
| uplift | ||
Updated•1 year ago
|
Comment 35•1 year ago
|
||
Comment on attachment 9462180 [details]
Bug 1938471 - Backed out changeset dcfa4149aaf3. r=#necko
Approved for 115.21esr
Comment 36•1 year ago
|
||
Comment on attachment 9462181 [details]
Bug 1938471 - synchronize access to FetchDriver::mObserver. r=#necko
Approved for 115.21esr
Comment 37•1 year ago
|
||
| uplift | ||
Updated•1 year ago
|
Updated•1 year ago
|
Updated•9 months ago
|
Description
•