Closed Bug 1333568 Opened 3 years ago Closed 3 years ago

Crash in nsStringBuffer::Release - from nsLocalFile::Remove()

Categories

(Core :: Networking: Cache, defect, critical)

45 Branch
x86
Windows 10
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox-esr45 52+ fixed
firefox51 --- wontfix
firefox52 + fixed
firefox-esr52 --- fixed
firefox53 + fixed
firefox54 + fixed

People

(Reporter: jesup, Assigned: mayhemer)

Details

(Keywords: crash, csectype-uaf, sec-high, Whiteboard: [post-critsmash-triage][adv-main52+][adv-esr45.8+])

Crash Data

Attachments

(2 files, 2 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-d7ce5d3c-e18e-4b70-802d-41def2170124.
=============================================================

nsLocalFile::Remove() calls MakeDirty(), which is crashing with a UAF in mShortWorkingPath.Truncate().

This particular crash goes back to at least 47.  Many of the other related crashes may be due to the same base problem; perhaps something is failing to hold a ref to an nsLocalFile object and it's getting destroyed while still in use.

See https://crash-stats.mozilla.com/report/index/1e67d304-5a00-456a-ac79-64e362170118 (from 47) and https://crash-stats.mozilla.com/signature/?product=Firefox&proto_signature=~nsLocalFile&signature=nsStringBuffer%3A%3ARelease&date=%3E%3D2016-12-24T21%3A28%3A52.000Z&date=%3C2017-01-24T21%3A28%3A52.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_sort=-date&page=1 (wider search -- lots of crashes!)
Group: core-security → dom-core-security
class nsOfflineCacheEvictionFunction final : public mozIStorageFunction {

has an:

  NS_DECL_THREADSAFE_ISUPPORTS

But access to mItems is not synchronized. This code appears to be pretty ancient. Does anyone still understand the threading? If not, we can try adding a mutex here.
Trying to find an owner for this code. Jason? Honza?
Flags: needinfo?(jduell.mcbugs)
Flags: needinfo?(honzab.moz)
Already too busy.. sorry.
Flags: needinfo?(honzab.moz)
The stacks I looked at all seemed to be in the nsOfflineCacheDevice / nsDiskCacheDeviceSQL stuff, so moving this out of XPCOM, also based on the analysis in the description.
Component: XPCOM → Networking: Cache
Patrick, can you find an owner for this code? (And perhaps, confirm/deny my suspicions in comment 1).
Flags: needinfo?(mcmanus)
offline cache - I'm afraid hozna's queue is going to be the only route.
Flags: needinfo?(mcmanus) → needinfo?(honzab.moz)
All called from nsOfflineCacheEvictionFunction::Apply() -> nsLocalFile::Remove(bool) -> nsLocalFile::MakeDirty() [1] when truncating the mShortWorkingPath member.

Interesting this doesn't crash sooner in the code path.  I would not believe this was reordering.

I'll try to look at this, but really not very soon.


[1] https://dxr.mozilla.org/mozilla-central/rev/f4f374622111022d41dd8d5eb9220624135c534a/xpcom/io/nsLocalFileWin.h#98
Assignee: nobody → honzab.moz
Flags: needinfo?(jduell.mcbugs)
Flags: needinfo?(honzab.moz)
This is a sec-high. When can work on this bug be scheduled?

BTW: Are there any plans to increase the bus factor for offline cache? :)
It seems bad (not only from a security perspective), if there's only one person who can work on this.
Flags: needinfo?(honzab.moz)
the goal is to remove appcache - iirc office 360 was the blocker.
Flags: needinfo?(honzab.moz)
No idea where from this heap corruption comes and unable to reproduce locally (even with some code instrumentation added to fill in the mShortWorkingPath member).

I really don't know how to move forward here.  The crash rate is tho very low, so I'm not sure I should spend too much time on this.

Throwing back to the triage bag.
Assignee: honzab.moz → nobody
I had a look through the crash reports and I see that the error is hit from multiple different threads (i.e. thread 0 and random other ones). So again, I think the problem might be as described in comment 1: simultaneous access to the same data, causing one thread to access data that has been invalidated/reallocated in another.

Compare:
https://crash-stats.mozilla.com/report/index/0b0b6324-b34b-48cd-a0e5-72c292170114
https://crash-stats.mozilla.com/report/index/a8797065-0b09-4cb4-9389-ece492170119

And see the different callstacks:
4 	xul.dll 	nsOfflineCacheEvictionFunction::Apply() 	netwerk/cache/nsDiskCacheDeviceSQL.cpp:243
5 	xul.dll 	nsOfflineCacheDevice::EvictEntries(char const*) 	netwerk/cache/nsDiskCacheDeviceSQL.cpp:1930
6 	xul.dll 	nsCacheService::EvictEntriesForClient(char const*, int)
7 	xul.dll 	nsCacheService::EvictEntriesInternal(int) 	netwerk/cache/nsCacheService.cpp:1542
8 	xul.dll 	nsOfflineCacheDevice::Evict(nsILoadContextInfo*) 	netwerk/cache/nsDiskCacheDeviceSQL.cpp:2411
9 	xul.dll 	nsApplicationCacheService::Evict(nsILoadContextInfo*) 	netwerk/cache/nsApplicationCacheService.cpp:177
10 	xul.dll 	mozilla::net::AppCacheStorage::AsyncEvictStorage(nsICacheEntryDoomCallback*)

vs

 4 	xul.dll 	nsOfflineCacheEvictionFunction::Apply() 	netwerk/cache/nsDiskCacheDeviceSQL.cpp:243
5 	xul.dll 	nsOfflineCacheDevice::EvictEntries(char const*) 	netwerk/cache/nsDiskCacheDeviceSQL.cpp:1948
6 	xul.dll 	nsOfflineCacheDiscardCache::Run()

The data that is corrupted is not protected against simultaneous access. So again, is there any reason to believe we don't simply need a mutex here? (One reasonable answer would be: we changed things around from Firefox 49 to 53).
Flags: needinfo?(honzab.moz)
By the way, there are (unfortunately) other callstacks crashing in nsLocalFile::Remove() too, so this bug likely needs to be split up in several others, because there's more than one bug here :-/
Ah!  I though this was also thread-local as the EvictionObserver was!  Yes, I've seen this being accessed on more than one thread, but I abandoned the thread-concurrency theory based on that bad presumption.

Then this should not be hard to fix.
Assignee: nobody → honzab.moz
Status: NEW → ASSIGNED
Flags: needinfo?(honzab.moz)
Attached patch v1 (thread local for mItems) (obsolete) — Splinter Review
There are two classes involved here:

* nsOfflineCacheEvictionFunction, which is a singleton kept in mEvictionFunction during lifetime of the offline device
- only once registered in sqlite (same lifetime)
- this function (the class object) is the one that collects the deleted items in it's mItems member (replaced by the patch) that we then iterate to delete files on disk
- this function is used on more than one thread
- this function cannot be registered per thread (on demand) because two threads could try to register/deregister it concurrently, what would break sqlite or make it cry loud

* EvictionObserver, which is thread/stack local class
- that only forwards to to the eviction function and mainly forwards the Apply() function to nsOfflineCacheEvictionFunction, which iterates the mItems array and deletes the files (no inter-thread protection)

I removed the mItems member and turned to thread local instead, since we want to handle correctly failure of the sql delete query failure (by not deleting files that might have been collected by partially executed query).  If there were a single array protected by a mutex, the files would be intermixed and we could delete or not delete something we wanted or didn't want.


OTOH, why don't we delete the file directly from inside the function execution?  I'm not sure if the query itself is considered transactional or not.  So that if it fails on, say, row 3 of 5, then the 2 affected rows are reverted.  Then the collect/apply-on-success would make sense.  There is no comment why we collect and delete later, tho.


https://treeherder.mozilla.org/#/jobs?repo=try&revision=434d096e925c0f0f56f632d977e6848672ed0608
Attachment #8838510 - Flags: review?(valentin.gosu)
Comment on attachment 8838510 [details] [diff] [review]
v1 (thread local for mItems)

Review of attachment 8838510 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good!
Attachment #8838510 - Flags: review?(valentin.gosu) → review+
Comment on attachment 8838510 [details] [diff] [review]
v1 (thread local for mItems)

[Security approval request comment]
How easily could an exploit be constructed based on the patch?  No idea but probably very hard...

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?  It's easily identifiable by looking at the code changes, apparently a thread race issue.

Which older supported branches are affected by this flaw?  all

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?  this code didn't change a long time, so it's only about merging (will do it soon) ; risk is low

How likely is this patch to cause regressions; how much testing does it need?  Hmmm.. this is not a heavily used  feature, so it might show no sooner than on release if there is some issue with it.  We have some automated tests that probably exercise this code.  We also have QA doing manual tests to run this code, so I think it should be covered.
Attachment #8838510 - Flags: sec-approval?
Small nits fixed that slipped the review.
Attachment #8838648 - Flags: review+
Honza: can you also ask for uplifts as appropriate?  thanks
Flags: needinfo?(honzab.moz)
Comment on attachment 8838510 [details] [diff] [review]
v1 (thread local for mItems)

sec-approval+ for trunk.
As has been mentioned, please nominate patches for Aurora, Beta, and ESR45 so we can get this everywhere.
Attachment #8838510 - Flags: sec-approval? → sec-approval+
Attachment #8838510 - Attachment is obsolete: true
Flags: needinfo?(honzab.moz)
Comment on attachment 8838648 [details] [diff] [review]
v1.1 (thread local for mItems + few fixes)

applies on m-a, m-b.

Approval Request Comment
[Feature/Bug causing the regression]: appcache
[User impact if declined]: crash caused by concurrent thread access on nsCOMArray of nsIFiles
[Is this code covered by automated tests?]: yes (or at least should be according my knowledge)
[Has the fix been verified in Nightly?]: no STR
[Needs manual test from QE? If yes, steps to reproduce]: -
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: hmm.. it's a bit larger patch, but I performed a lot of local manual testing and it seems to work well
[Why is the change risky/not risky?]: there is some (tho small) risk, since it's hard to test all possible code paths
[String changes made/needed]: none
Attachment #8838648 - Flags: approval-mozilla-beta?
Attachment #8838648 - Flags: approval-mozilla-aurora?
the patch in its version 1 has been approved in comment 19.
Keywords: checkin-needed
Attached patch v1.1 for esr45 (obsolete) — Splinter Review
[Approval Request Comment]
see comment 20.
Attachment #8838848 - Flags: approval-mozilla-esr45?
Comment on attachment 8838648 [details] [diff] [review]
v1.1 (thread local for mItems + few fixes)

Applies cleanly, please see comment 20.
Attachment #8838648 - Flags: approval-mozilla-esr52?
Comment on attachment 8838648 [details] [diff] [review]
v1.1 (thread local for mItems + few fixes)

esr52 uplift will happen automagically after this lands on Beta. Speaking of uplift, this patch needed to be rebased around bug 1060419 for landing on trunk, but should apply cleanly as-is to Aurora.
Attachment #8838648 - Flags: approval-mozilla-esr52?
Comment on attachment 8838648 [details] [diff] [review]
v1.1 (thread local for mItems + few fixes)

thread safety fix for network cache, aurora53+, beta52+
Attachment #8838648 - Flags: approval-mozilla-beta?
Attachment #8838648 - Flags: approval-mozilla-beta+
Attachment #8838648 - Flags: approval-mozilla-aurora?
Attachment #8838648 - Flags: approval-mozilla-aurora+
https://hg.mozilla.org/mozilla-central/rev/20b742119207
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Comment on attachment 8838848 [details] [diff] [review]
v1.1 for esr45

Fix a sec-high. ESR45+.
Attachment #8838848 - Flags: approval-mozilla-esr45? → approval-mozilla-esr45+
Group: dom-core-security → core-security-release
Flags: qe-verify-
Whiteboard: [post-critsmash-triage]
Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main52+][adv-esr45.8+]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.