Closed Bug 1745972 Opened 4 years ago Closed 3 years ago

Crash in [@ nsTArray_Impl<T>::Compare<T> | NS_QuickSort | XUL@0x79486f | mozilla::net::CacheIndex::FrecencyArray::SortIfNeeded]

Categories

(Core :: Networking: Cache, defect, P1)

Unspecified
macOS
defect

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox-esr102 - wontfix
firefox109 --- unaffected
firefox110 --- unaffected
firefox111 --- unaffected

People

(Reporter: kershaw, Assigned: edgul)

References

Details

(Keywords: crash, csectype-uaf, sec-high, Whiteboard: [necko-triaged][necko-priority-review])

Crash Data

Attachments

(3 files, 1 obsolete file)

+++ This bug was initially created as a clone of Bug #1662676 +++

Crash report: https://crash-stats.mozilla.org/report/index/d634ed01-a060-4993-a124-2734e0200901

Top 10 frames of crashing thread:

0 XUL int nsTArray_Impl<mozilla::net::CacheIndexRecord*, nsTArrayInfallibleAllocator>::Compare<detail::CompareWrapper<mozilla::net:: xpcom/ds/nsTArray.h:2255
1 XUL NS_QuickSort xpcom/ds/nsQuickSort.cpp:117
2 XUL XUL@0x79486f 
3 XUL mozilla::net::CacheIndex::FrecencyArray::SortIfNeeded netwerk/cache2/CacheIndex.cpp:3401
4 XUL mozilla::net::CacheIndexEntryAutoManage::~CacheIndexEntryAutoManage netwerk/cache2/CacheIndex.cpp:108
5 XUL mozilla::net::CacheIndex::MergeJournal netwerk/cache2/CacheIndex.cpp:95
6 XUL mozilla::net::CacheIndex::FinishRead netwerk/cache2/CacheIndex.cpp:2537
7 XUL mozilla::net::CacheIndex::OnDataRead netwerk/cache2/CacheIndex.cpp:3667
8 XUL mozilla::net::ReadEvent::Run netwerk/cache2/CacheFileIOManager.cpp:700
9 XUL mozilla::net::CacheIOThread::LoopOneLevel netwerk/cache2/CacheIOThread.cpp:538

For a Windows crash report I see these inline frames with VS:

[Inline Frame] xul.dll!mozilla::net::`anonymous namespace'::FrecencyComparator::Equals(mozilla::net::CacheIndexRecord * a, mozilla::net::CacheIndexRecord * b) Line 47
	at /builds/worker/checkouts/gecko/netwerk/cache2/CacheIndex.cpp(47)
[Inline Frame] xul.dll!detail::CompareWrapper<mozilla::net::(anonymous namespace)::FrecencyComparator,mozilla::net::CacheIndexRecord *,0>::Equals(mozilla::net::CacheIndexRecord * const & aLeft, mozilla::net::CacheIndexRecord * const & aRight) Line 928
	at /builds/worker/workspace/obj-build/dist/include/nsTArray.h(928)
[Inline Frame] xul.dll!detail::CompareWrapper<mozilla::net::(anonymous namespace)::FrecencyComparator,mozilla::net::CacheIndexRecord *,0>::Compare(mozilla::net::CacheIndexRecord * const & aLeft, mozilla::net::CacheIndexRecord * const & aRight) Line 920
	at /builds/worker/workspace/obj-build/dist/include/nsTArray.h(920)
[Inline Frame] xul.dll!nsTArray_Impl<mozilla::net::CacheIndexRecord *,nsTArrayInfallibleAllocator>::Compare(const void * aE1, const void * aE2, void * aData) Line 2300
	at /builds/worker/workspace/obj-build/dist/include/nsTArray.h(2300)
xul.dll!NS_QuickSort(void * a, unsigned int n, unsigned int es, int(*)(const void *, const void *, void *) cmp, void * data) Line 134
	at /builds/worker/checkouts/gecko/xpcom/ds/nsQuickSort.cpp(134)
[Inline Frame] xul.dll!nsTArray_Impl<mozilla::net::CacheIndexRecord *,nsTArrayInfallibleAllocator>::Sort(const mozilla::net::`anonymous namespace'::FrecencyComparator & aComp) Line 2310
	at /builds/worker/workspace/obj-build/dist/include/nsTArray.h(2310)
xul.dll!mozilla::net::CacheIndex::FrecencyArray::SortIfNeeded() Line 3401
	at /builds/worker/checkouts/gecko/netwerk/cache2/CacheIndex.cpp(3401)

This looks like a UAF on some of the elements within FrecencyArray::mRecs

The crash is still happening, so open a new bug to track this crash.

Whiteboard: [sec-survey][post-critsmash-triage][adv-main91+r][adv-esr78.13+r] → [sec-survey][post-critsmash-triage][adv-main91+r][adv-esr78.13+r][necko-triaged]
Whiteboard: [sec-survey][post-critsmash-triage][adv-main91+r][adv-esr78.13+r][necko-triaged] → [necko-triaged]
Assignee: nobody → nhnt11
Status: NEW → ASSIGNED

Won't have time to look at this this month.

Assignee: nhnt11 → nobody
Status: ASSIGNED → NEW

Randell's work in bug 1747439 and bug 1746875 could also fix this.
Let's watch if the crash rate goes down.

I'll also take a look again.

Assignee: nobody → kershaw
See Also: → 1747439

Update: we are still waiting to see if bug 1747439 fixes this.
Currently, there is no crash after beta 97.0b9.

Looks like we're still seeing crashes with Release 97.

(In reply to Ryan VanderMeulen [:RyanVM] from comment #5)

Looks like we're still seeing crashes with Release 97.

Thanks. I am aware of this and I am trying to have a patch to fix this.

The idea in this patch is like D99680, but the difference is that mFrecencyArray now contains CacheIndexRecordWrapper.
This should be able to help us track why CacheIndexRecordWrapper is released before removing it from mFrecencyArray.

Keywords: leave-open

Comment on attachment 9265135 [details]
Bug 1745972 - Check if CacheIndexRecordWrapper is still in mFrecencyArray, r=#necko

Security Approval Request

  • How easily could an exploit be constructed based on the patch?: Probably not easy. This is just a diagnostic patch.
  • Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
  • Which older supported branches are affected by this flaw?: all
  • If not all supported branches, which bug introduced the flaw?: None
  • Do you have backports for the affected branches?: Yes
  • If not, how different, hard to create, and risky will they be?: Should be able to uplift cleanly.
  • How likely is this patch to cause regressions; how much testing does it need?: Low. This patch only adds a diagnostic assertion.
Attachment #9265135 - Flags: sec-approval?

Hi Tom,

Could you take a look at this sec-approval?

thanks.

Flags: needinfo?(tom)

Since the request we've been past the last uplift window; sec-approvals will resume once the tree reopens.

Flags: needinfo?(tom)

Comment on attachment 9265135 [details]
Bug 1745972 - Check if CacheIndexRecordWrapper is still in mFrecencyArray, r=#necko

Approved to land and if desired, uplift.

Attachment #9265135 - Flags: sec-approval? → sec-approval+

Comment on attachment 9265135 [details]
Bug 1745972 - Check if CacheIndexRecordWrapper is still in mFrecencyArray, r=#necko

Beta/Release Uplift Approval Request

  • User impact if declined: No impact. This is a diagnostic patch.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: N/A
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This patch only adds a diagnostic assertion.
  • String changes made/needed: N/A
Attachment #9265135 - Flags: approval-mozilla-beta?

Comment on attachment 9265135 [details]
Bug 1745972 - Check if CacheIndexRecordWrapper is still in mFrecencyArray, r=#necko

Approved for 99.0b3. Thanks.

Attachment #9265135 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment on attachment 9265135 [details]
Bug 1745972 - Check if CacheIndexRecordWrapper is still in mFrecencyArray, r=#necko

Cleared approval-mozilla-beta to remove this from the "missed uplift" queries.

Attachment #9265135 - Flags: approval-mozilla-beta+

Status update: we are still waiting for this diagnostic assertion to be triggered.
There is no crash report for this so far.

Status update: After significant research and discussion with Kershaw, we intend on moving CacheIndexRecordWrapper deletion to the Cache IO thread to eliminate the possibility that the main thread deletions are causing some kind of race condition during sorting. Hoping to get to this next week.

Whiteboard: [necko-triaged] → [necko-triaged][necko-priority-queue]
Assignee: kershaw → eguloien
Attachment #9285377 - Attachment description: WIP: Bug 1745972 - Moved cache cleanup on shutdown to cacheIO thread → Bug 1745972 - Moved cache wrapper deletion to cacheIO thread r?kershaw
Attachment #9285377 - Attachment description: Bug 1745972 - Moved cache wrapper deletion to cacheIO thread r?kershaw → Bug 1745972 - Dispatch cache wrapper deletion to caller thread to acquire lock before deletion and safety-removal from frecency array. r?kershaw

Comment on attachment 9285377 [details]
Bug 1745972 - Dispatch cache wrapper deletion to caller thread to acquire lock before deletion and safety-removal from frecency array. r?kershaw

Security Approval Request

  • How easily could an exploit be constructed based on the patch?: Difficult, I cannot seem to replicate this.
  • Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
  • Which older supported branches are affected by this flaw?: All
  • If not all supported branches, which bug introduced the flaw?: None
  • Do you have backports for the affected branches?: No
  • If not, how different, hard to create, and risky will they be?: Shouldn't need to uplift
  • How likely is this patch to cause regressions; how much testing does it need?: Low.
  • Is Android affected?: Yes
Attachment #9285377 - Flags: sec-approval?

Tom, can we get sec approval on this again?

Flags: needinfo?(tom)

Comment on attachment 9285377 [details]
Bug 1745972 - Dispatch cache wrapper deletion to caller thread to acquire lock before deletion and safety-removal from frecency array. r?kershaw

Approved to land and uplift if needed

Flags: needinfo?(tom)
Attachment #9285377 - Flags: sec-approval? → sec-approval+

Dispatch cache wrapper deletion to caller thread to acquire lock before deletion and safety-removal from frecency array. r=kershaw,necko-reviewers,valentin
https://hg.mozilla.org/integration/autoland/rev/c1b26d78f5d1fbd7392fee3fb646d8c99016c2a2
https://hg.mozilla.org/mozilla-central/rev/c1b26d78f5d1

Whiteboard: [necko-triaged][necko-priority-queue] → [necko-triaged][necko-priority-review]

Moving to necko review queue to monitor if fix works.

Did you want to uplift this to Beta so it can go out with the 105 release in a couple weeks instead of waiting until October for 106 to ship? Go ahead and nominate the patch for approval if yes.

Flags: needinfo?(edgul)

Comment on attachment 9285377 [details]
Bug 1745972 - Dispatch cache wrapper deletion to caller thread to acquire lock before deletion and safety-removal from frecency array. r?kershaw

Beta/Release Uplift Approval Request

  • User impact if declined: Low. Most users don't observe the original crash
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Low impact to users. Only safety precautions added.
  • String changes made/needed: None
  • Is Android affected?: Yes
Flags: needinfo?(edgul)
Attachment #9285377 - Flags: approval-mozilla-beta?

Comment on attachment 9285377 [details]
Bug 1745972 - Dispatch cache wrapper deletion to caller thread to acquire lock before deletion and safety-removal from frecency array. r?kershaw

Approved for 105.0b6.

Attachment #9285377 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment on attachment 9285377 [details]
Bug 1745972 - Dispatch cache wrapper deletion to caller thread to acquire lock before deletion and safety-removal from frecency array. r?kershaw

Landed on Beta for 105.0b6. Clearing the approval flag to get this off the needs-uplift radar.
https://hg.mozilla.org/releases/mozilla-beta/rev/4748ec75e338

Attachment #9285377 - Flags: approval-mozilla-beta+

I found this crash report that shows there is a crash when deleting mWrapper.

I also noticed that the sLock sometimes is unlocked when doing I/O, so maybe we should use another lock to protect CacheIndexRecordWrapper.

Noted. Will try to take another look next week.

Whiteboard: [necko-triaged][necko-priority-review] → [necko-triaged][necko-priority-queue]

Under investigation. re: comment 29

So, looking at https://crash-stats.mozilla.org/report/index/07b6144c-c48a-442b-bb2e-4c9380221010#tab-details

We crash on this line which makes me think that we have an ownership problem here.

mIndex->mFrecencyArray.RemoveRecord(entry->mRec, mProofOfLock); actually release mRec? (we don't seem to hold a ref to it.
Maybe the simple fix is to have a RefPtr<CacheIndexRecordWrapper> rec = entry->mRec; then use that instead of entry->mRec.

I haven't fully confirmed that's the problem, but that code seems a bit fishy to me.

(In reply to Valentin Gosu [:valentin] (he/him) from comment #32)

mIndex->mFrecencyArray.RemoveRecord(entry->mRec, mProofOfLock); actually release mRec?

It seems to be the case that we null out mRecs[idx], where idx comes from entry->mRec. Being mRecs an nsTArray<RefPtr<CacheIndexRecordWrapper>> an array of RefPtr this might delete something unexpectedly, indeed.

Not sure about the level of indirection here, though: Keeping a RefPtr for entry->mRec might actually not refer to the same object we potentially delete?

A few things to note here.
We can note some key differences between the crash report in comment 32 and here.

Respectively:

  1. EXCEPTION_ACCESS_VIOLATION_READ vs EXCEPTION_ACCESS_VIOLATION_WRITE
  2. Crash line in ~CacheIndexEntryAutoManage:
    here vs here
  3. Crash in NS_QuickSort vs std::min, which contains call to Length(), which calls mRecs's Length()

So [1] and [3] seem to suggest that just touching mRecs, could this be multiple thread access?

While [2] seems to support with ownership/deletion theory, being another occurrence upon removal.
Along these lines I'm very suspicious of the mIndex which doesn't appear to ever remove it's entries, just label them as removed. If this ever got out of sync after a manual mRecs[idx] = nullptr; then the FindEntry() could pass back a live raw pointer to garbage. Need more digging here, but this smells like the right direction.

I'm not sure if this is the same root cause, but this assert crash seems to indicate that we are failing to find an item in the frencency array during removal after it has already been removed from the entry list.
https://crash-stats.mozilla.org/report/index/85ca9316-79b1-4728-9c86-f3d9b0221030

Since the RefPtr is already being held by mOldRecord in this case, it may also suggest the idea to protect ourselves with RefPtr<CacheIndexRecordWrapper> rec = entry->mRec; may not be enough.

Also the patch above adds what may be too simplistic a test, but perhaps suggests that the nsTarray<..> mRecs is not the reason for a double delete.

Flags: needinfo?(valentin.gosu)
Flags: needinfo?(kershaw)

Randell has been adding some more thread safety annotations in https://phabricator.services.mozilla.com/D160392, which may help here.

Flags: needinfo?(valentin.gosu)

Randell's patch should only affect compile time error detection, so we shouldn't expect any fixes here. But reviewing the added protections may inspire some ideas. Will try to review this week.

Depends on D161215

Added basic xpcshell-test that makes many requests to different paths of httpd server. Tested on linux and windows without any luck eliciting the crash.

Some limitations exist in the current test, likely due to the test setup/framework. For example, the test fails for huge numbers of requests (>10000 and with --verify). Maybe there is room to increase the number of runnable tests to improve the likelihood of catching the crash with automation?

Additional testing doesn't seem like a viable approach here given how difficult the issue has been to reproduce.

I am currently working on a diagnostic patch to improve our understanding of the suspicious items.

Attachment #9301876 - Attachment description: WIP: Bug 1745972 - Added some temporary testing of nsTArray of CacheIndexRecordWrapper → Bug 1745972 - Added assertions to verify additions and removals of mRecs. r=kershaw

"Quick" update.

Changes in the above diagnostic patch will help validate:

  • Correct item is removed from mRecs on release builds
  • Items are removed from mRecs during resizing are indeed null first on release builds
  • Duplicate appends to mRecs will not occur on release builds

Some other considerations:

  • Could lock is drop for potentially blocking IO operations be a factor here? I took a quick look, and didn’t see anything during the lock drop that touches mRecs. So this seems unlikely.
  • mRecs.IndexOf seems to return an unexpected NoIndex. Maybe it could return an incorrect item that gets deleted too early or twice? The assumptions of E in nsTArray<E> appear to hold: RefPtr has copy-constructor, operator< and operator== are defined in FrecencyComparator and (I’m pretty sure(?)) RefPtr is trivially copyable. If we assume all modifications to mRecs (justification below) then is there any other way IndexOf could return an incorrect value? This commit may help narrow down if this is a problem to begin with.
  • Could an item be destroyed and re-added? It looks like there are situations where we remove and then re-add an item to mRecs. But the item should be kept alive by RefPtr from what I can tell.
  • sLock.AssertCurrentThreadOwns will only verify on debug builds, which isn’t super helpful as this crash seems rare enough to only happen on release builds in the wild. I considered adding MOZ_RELEASE_ASSERT here, but we already have proof_of_lock parameters, so this didn’t seem worth the time at this point.

It may also be worth noting the following:

  • we seem to also have allocation problems, for example. May allow us to be more confident that items are being correctly added otherwise we would see these errors first.
  • This crash was also interesting. It appears to be a result of trying to directly index mRecs with a NoIndex. Though the line reference is off and I’m not sure how we got past the ASSERT. I took a look around but it looks like all direct indexing (including via the iterators) of mRecs happens under lock. So it doesn’t seem like we can label this as another thread messing with the contents of mRecs.

Any further ideas or critiques of my assumptions would be greatly appreciated.

Comment on attachment 9301876 [details]
Bug 1745972 - Added assertions to verify additions and removals of mRecs. r=kershaw

Security Approval Request

  • How easily could an exploit be constructed based on the patch?: Difficult, this is diagnostic
  • Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
  • Which older supported branches are affected by this flaw?: All
  • If not all supported branches, which bug introduced the flaw?: None
  • Do you have backports for the affected branches?: Yes
  • If not, how different, hard to create, and risky will they be?:
  • How likely is this patch to cause regressions; how much testing does it need?: Low, this is diagnostic
  • Is Android affected?: Unknown
Attachment #9301876 - Flags: sec-approval?
Flags: needinfo?(kershaw)

Hey Tom, can we get security approval on this when you get a minute?

Flags: needinfo?(tom)

Comment on attachment 9301876 [details]
Bug 1745972 - Added assertions to verify additions and removals of mRecs. r=kershaw

Yes; tree just opened up so approved to land.

Flags: needinfo?(tom)
Attachment #9301876 - Flags: sec-approval? → sec-approval+

Comment on attachment 9301876 [details]
Bug 1745972 - Added assertions to verify additions and removals of mRecs. r=kershaw

Beta/Release Uplift Approval Request

  • User impact if declined: This is diagnostic. Crash will persist longer without additional information to help debug.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Diagnostic
  • String changes made/needed: None
  • Is Android affected?: Yes
Attachment #9301876 - Flags: approval-mozilla-beta?
Attachment #9304992 - Attachment is obsolete: true
Whiteboard: [necko-triaged][necko-priority-queue] → [necko-triaged][necko-priority-review]

Put back in review for monitoring.

Comment on attachment 9301876 [details]
Bug 1745972 - Added assertions to verify additions and removals of mRecs. r=kershaw

Approved for 109.0b2.

Attachment #9301876 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment on attachment 9301876 [details]
Bug 1745972 - Added assertions to verify additions and removals of mRecs. r=kershaw

Landed on Beta. Removing the approval to get it off the needs-uplift radar.
https://hg.mozilla.org/releases/mozilla-beta/rev/91741e5269e0

Attachment #9301876 - Flags: approval-mozilla-beta+

[Tracking Requested - why for this release]:
Active crashes in 102esr
No crashes past 105, so likely fixed in 106. Someone should survey likely candidates from 106 for backport to 102esr

Crash Signature: mozilla::net::CacheIndex::FrecencyArray::SortIfNeeded] → mozilla::net::CacheIndex::FrecencyArray::SortIfNeeded] [@ nsTArray_Impl<T>::Sort(mozilla::net::(anonymous namespace)::FrecencyComparator const&) | mozilla::net::CacheIndex::FrecencyArray::SortIfNeeded ]
Blocks: 1812420

Closing as the discussion around this issue is moving to Bug 1812420. The new bug will aim to also investigate potential signature changes.

Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(edgul)
Resolution: --- → INCOMPLETE

Given bug 1812420, it doesn't sound like there's anything actionable for ESR to do in this bug.

Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: