Closed Bug 1368273 Opened 7 years ago Closed 6 years ago

Crash in mozilla::dom::cache::StreamControl::CloseAllReadStreams

Categories

(Core :: DOM: Core & HTML, defect, P2)

53 Branch
defect

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox-esr52 --- wontfix
firefox53 --- wontfix
firefox54 --- wontfix
firefox55 --- wontfix

People

(Reporter: philipp, Assigned: mdaly)

References

Details

(4 keywords)

Crash Data

This bug was filed from the Socorro interface and is 
report bp-40a72e0d-a564-4079-bbc0-f5a310170525.
=============================================================
Crashing Thread (30)
Frame 	Module 	Signature 	Source
0 	xul.dll 	mozilla::dom::cache::StreamControl::CloseAllReadStreams() 	dom/cache/StreamControl.cpp:72
1 	xul.dll 	mozilla::dom::cache::StreamList::Cancel() 	dom/cache/StreamList.cpp:149
2 	xul.dll 	mozilla::dom::cache::Context::CancelAll() 	dom/cache/Context.cpp:899
3 	xul.dll 	mozilla::dom::cache::Manager::Shutdown() 	dom/cache/Manager.cpp:1815
4 	xul.dll 	`anonymous namespace'::CacheQuotaClient::ShutdownWorkThreads 	dom/cache/QuotaClient.cpp:211
5 	xul.dll 	mozilla::dom::quota::QuotaManager::Shutdown() 	dom/quota/ActorsParent.cpp:3217
6 	xul.dll 	mozilla::dom::quota::QuotaManager::ShutdownRunnable::Run() 	dom/quota/ActorsParent.cpp:2439
7 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1240
8 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/glue/nsThreadUtils.cpp:390
9 	xul.dll 	mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:368
10 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:231
11 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc:211
12 	xul.dll 	nsThread::ThreadFunc(void*) 	xpcom/threads/nsThread.cpp:490
13 	nss3.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:397
14 	nss3.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c:95
15 	ucrtbase.dll 	_o__CIpow 	
16 	kernel32.dll 	BaseThreadInitThunk 	
17 	ntdll.dll 	__RtlUserThreadStart 	
18 	ntdll.dll 	_RtlUserThreadStart

this is a low level crash across platforms that got a bit more common after 2017-05-22 & is mostly mostly UAF.

a number of comments mention they where on weather.com when the crash happened, so maybe there was a recent change on that page that is tickling this crash now.
Group: core-security → dom-core-security
Ben, do you have time to look at this?
Flags: needinfo?(bkelly)
Andrew, do you have any time to look at this?
Flags: needinfo?(bkelly) → needinfo?(bugmail)
Yes, looking.  Leaving needinfo up for nags.
Assignee: nobody → bugmail
Status: NEW → ASSIGNED
Flags: needinfo?(bugmail)
Checkboxes are hard.
Flags: needinfo?(bugmail)
No solution yet, investigation results thus far:
- weather.com isn't doing anything too crazy with ServiceWorkers or cache.  It's using sw-toolbox for its cache purposes and it has a few of those.  That said, it is a very busy website, caching a fair bit.  (716K caches.sqlite, 11M morgue).  Having said that, I did see some additional cache stuff going on in Firefox devtools and under Chrome I saw the SW update itself, so there may also be some kind of A/B thing going on or a war of the service workers or something.
- This is definitely a shutdown crash.  Reading between the lines of the few weather.com comments in crash-stats and the stacks, I think users may have experienced bad browser janking and then quit, rather than weather.com having any direct causality.  Since most of the crashes seem to be non-e10s (at least that's how I'm interpreting the lack of a "DOMIPCEnabled" flag in meta), we certainly would expect any jank to be unpleasant.
- Code inspection seemed good.  The obvious explanation for the crash is that the non-refcounted CacheStreamControlParent got deleted by IPC but the (refcounted) StreamList still had a reference.  However, the CacheStreamControlParent's ActorDestroy explicitly tells the StreamList to forget about it. The control flow around all of that is simple and benefits from good hygiene so higher level logic errors still shouldn't be able to screw these invariants up.  Given that PCacheStreamControl is directly managed by PBackground, there aren't even interesting tree edge-cases.
- I taught nsIQuotaManagerService's test functionality how to explicitly shutdown a QuotaClient in the hopes of being able to reproduce this in a mochitest.  (nsIQMS's reset() doesn't shut down the QuotaClients.)  I've got the helper working, but it hasn't borne fruit yet.  Presumably because I still need to cause the PBackground channel to self-destruct before triggering the shutdown.

I'll attempt to better reproduce the shutdown circumstances as they relate to IPC shutdown tomorrow.
There is another weak-ref in the mix.  The Context maintains a mActivityList or something which contains StreamList objects.  I didn't see anything obviously broken there either, but its worth double checking.

The other thought I had was perhaps the shutdown code spinning the event loop is breaking some invariant.  We expect ActorDestroy() to get called in a separate runnable, but perhaps its running within the spun loop and surprising us somehow.
Note, Cache API is enabled in FF52 ESR, but service workers are not enabled.  So the possibility of triggering this there should be greatly reduced.
Priority: -- → P2
This has been discussed off-bug a bit through the months.  The main summary was that there was no smoking gun, and my efforts at reproduction through various manual and speculative automated stress tests related to creating orphaned caches and inducing quotaclient shutdown via specialized QM logic did not manage to reproduce.

Recently bug 1411908 appeared and seems related.  In discussion, :bkelly indicated that in his investigatory efforts, he'd found that the site uses the Cache API for its weather map UI that is layered over a google maps style interface, with the mapbox tiles being cached in the Cache API.  This potentially introduces a new usage pattern beyond the use of sw-toolbox and its cache that we should still investigate.  It also increased our hopes of being able to reproduce, as the sw-toolbox scenario would only result in crashes when the SW updated and a cache would be destroyed.

There are also some over-arching concerns about QuotaManager shutdown related to locks and the potential for resuscitation of cache API manager instances near shutdown.  (Which may or may not be related to some SQLite contention that's been observed.  I've also seen that somewhat recently with IDB which should share similar IO thread invariants to DOM cache, so really there's a lot of investigation that needs to happen all around.)

I'm clearing the assignee right now per discussion with :overholt to better indicate this bug is not actively being worked at this time (at least not by me).  I will proactively follow bug 1411908's investigation in the near term and attempt to assist.  We may want to update the dependencies or involve a meta bug related to shutdown issues for QM and the cache API, as concerns do seem to be spread across several bugs.  I'll try to arrange some discussion of this or maybe a brief mini-sprint on that at the imminent all-hands.


Aside: bug 1413112 which has not landed yet may help improve things as it cleans up the use of WorkerPrivate.
Assignee: bugmail → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(bugmail)
Andrew: please find appropriate folks to work on this security bug.
Assignee: nobody → overholt
Could bug 1411908 be related here?
Flags: needinfo?(bugmail)
Andrew (Sutherland) indicated bug 1411908 could be related in comment 9, btw
Flags: needinfo?(bugmail)
This is Marion territory now but I'm thinking we should probably resolve INCOMPLETE given comment 9.
Assignee: overholt → mdaly
Confirmed. This is an INCOMPLETE given the cold trail. 

We'll reopen if this pops up again in the future.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
Group: dom-core-security
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.