Closed Bug 1411908 Opened 7 years ago Closed 6 years ago

shutdownhang | NtWaitForAlertByThreadId | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe

Categories

(Core :: Storage: Quota Manager, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox-esr52 --- wontfix
firefox-esr60 61+ fixed
firefox59 + wontfix
firefox60 + wontfix
firefox61 + fixed

People

(Reporter: shawnjohnjr, Assigned: janv)

References

(Blocks 2 open bugs, )

Details

(Keywords: nightly-community, topcrash)

Crash Data

Attachments

(4 files, 7 obsolete files)

31.57 KB, text/plain
Details
9.25 KB, patch
asuth
: review+
Details | Diff | Splinter Review
5.87 KB, patch
asuth
: review+
Details | Diff | Splinter Review
4.87 KB, patch
Details | Diff | Splinter Review
From:
https://crash-stats.mozilla.com/report/index/8aa93cfd-bea8-4741-90f6-a61bb0171025

0 	ntdll.dll 	NtWaitForAlertByThreadId 	
1 	ntdll.dll 	RtlSleepConditionVariableCS 	
2 	kernelbase.dll 	SleepConditionVariableCS 	
3 	mozglue.dll 	mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl&) 	mozglue/misc/ConditionVariable_windows.cpp:58
4 	xul.dll 	mozilla::CondVar::Wait(unsigned int) 	xpcom/threads/CondVar.h:68
5 	xul.dll 	mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue<mozilla::EventQueue> >::GetEvent(bool, mozilla::EventPriority*) 	xpcom/threads/ThreadEventQueue.cpp:139
6 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:965
7 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/threads/nsThreadUtils.cpp:513
8 	xul.dll 	mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe(nsISupports*, char const*, char16_t const*) 	dom/quota/ActorsParent.cpp:2908
Crash Signature: [@ shutdownhang | NtWaitForAlertByThreadId | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe]
Assignee: nobody → shuang
Hmm. I hit once locally. But now I cannot reproduce this bug again. Upload log, but it doesn't look helpful.
(In reply to Shawn Huang [:shawnjohnjr] from comment #2)
> Hmm. I hit once locally. But now I cannot reproduce this bug again. Upload
> log, but it doesn't look helpful.

At least I'm sure this line never reached since I added log there.
http://searchfox.org/mozilla-central/rev/aa1343961fca52e8c7dab16788530da31aab7c8d/dom/quota/ActorsParent.cpp#2850
Crash Signature: [@ shutdownhang | NtWaitForAlertByThreadId | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe] → [@ shutdownhang | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe]
Crash Signature: [@ shutdownhang | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe] → [@ shutdownhang | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe] [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe]
Weather used a lot cache and indexeddb, maybe it's related to QuotaClient ShutdownWorkThread stuck.
Priority: -- → P2
I have investigate a bit further, then I found that Quota Client ShutdownWorkThread is stuck there.

See:
https://crash-stats.mozilla.com/report/index/aa51008c-e47a-43af-bbda-9c4af0171109#allthreads

Thread 33, Name: IPDL Background

0 	ntdll.dll 	NtWaitForAlertByThreadId 	
1 	ntdll.dll 	RtlSleepConditionVariableCS 	
2 	KERNELBASE.dll 	SleepConditionVariableCS 	
3 	mozglue.dll 	mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl&) 	mozglue/misc/ConditionVariable_windows.cpp:58
4 	xul.dll 	mozilla::CondVar::Wait(unsigned int) 	obj-firefox/dist/include/mozilla/CondVar.h:68
5 	xul.dll 	nsEventQueue::GetEvent(bool, nsIRunnable**, mozilla::BaseAutoLock<mozilla::Mutex>&) 	xpcom/threads/nsEventQueue.cpp:76
6 	xul.dll 	nsThread::nsChainedEventQueue::GetEvent(bool, nsIRunnable**, unsigned short*, mozilla::BaseAutoLock<mozilla::Mutex>&) 	xpcom/threads/nsThread.cpp:839
7 	xul.dll 	nsThread::GetEvent(bool, nsIRunnable**, unsigned short*, mozilla::BaseAutoLock<mozilla::Mutex>&) 	xpcom/threads/nsThread.cpp:1297
8 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1380
9 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/threads/nsThreadUtils.cpp:480
10 	xul.dll 	`anonymous namespace'::CacheQuotaClient::ShutdownWorkThreads 	dom/cache/QuotaClient.cpp:215
11 	xul.dll 	mozilla::dom::quota::QuotaManager::Shutdown() 	dom/quota/ActorsParent.cpp:3644
12 	xul.dll 	mozilla::dom::quota::QuotaManager::ShutdownRunnable::Run() 	dom/quota/ActorsParent.cpp:2817


When doing "Manager::ShutdownAll()", it doesn't finish.
Another shutdown is stucking at IO thread shutdown, |mIOThread->Shutdown()|.

https://crash-stats.mozilla.com/report/index/bdb5bac9-987c-4b52-a94f-1ee9b0171109#allthreads



Thread 30, Name: IPDL Background
Frame 	Module 	Signature 	Source
0 	ntdll.dll 	NtWaitForAlertByThreadId 	
1 	ntdll.dll 	RtlSleepConditionVariableCS 	
2 	KERNELBASE.dll 	SleepConditionVariableCS 	
3 	mozglue.dll 	mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl&) 	mozglue/misc/ConditionVariable_windows.cpp:58
4 	xul.dll 	mozilla::CondVar::Wait(unsigned int) 	xpcom/threads/CondVar.h:68
5 	xul.dll 	mozilla::ThreadEventQueue<mozilla::EventQueue>::GetEvent(bool, mozilla::EventPriority*) 	xpcom/threads/ThreadEventQueue.cpp:138
6 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:967
7 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/threads/nsThreadUtils.cpp:521
8 	xul.dll 	nsThread::Shutdown() 	xpcom/threads/nsThread.cpp:802
9 	xul.dll 	mozilla::dom::quota::QuotaManager::Shutdown() 	dom/quota/ActorsParent.cpp:3715
10 	xul.dll 	mozilla::dom::quota::QuotaManager::ShutdownRunnable::Run() 	dom/quota/ActorsParent.cpp:2862
AsmJSCache
Thread 28, Name: IPDL Background

https://crash-stats.mozilla.com/report/index/7b0d0bf8-7d8f-442a-97d5-0ebf00171109#allthreads

0 	ntdll.dll 	NtWaitForAlertByThreadId 	
1 	ntdll.dll 	RtlSleepConditionVariableCS 	
2 	KERNELBASE.dll 	SleepConditionVariableCS 	
3 	mozglue.dll 	mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl&) 	mozglue/misc/ConditionVariable_windows.cpp:58
4 	xul.dll 	mozilla::ThreadEventQueue<mozilla::EventQueue>::GetEvent(bool, mozilla::EventPriority*) 	xpcom/threads/ThreadEventQueue.cpp:139
5 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:965
6 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/threads/nsThreadUtils.cpp:513
7 	xul.dll 	mozilla::dom::asmjscache::`anonymous namespace'::Client::ShutdownWorkThreads 	dom/asmjscache/AsmJSCache.cpp:1951
8 	xul.dll 	mozilla::dom::quota::QuotaManager::Shutdown() 	dom/quota/ActorsParent.cpp:3683
9 	xul.dll 	mozilla::dom::quota::QuotaManager::ShutdownRunnable::Run() 	dom/quota/ActorsParent.cpp:2859
Indexeddb.

https://crash-stats.mozilla.com/report/index/071d9696-ec35-4ea0-8be6-83d6d0171103#allthreads


Thread 37, Name: IPDL Background

0 	ntdll.dll 	NtWaitForAlertByThreadId 	
1 	ntdll.dll 	RtlSleepConditionVariableCS 	
2 	KERNELBASE.dll 	SleepConditionVariableCS 	
3 	mozglue.dll 	mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl&) 	mozglue/misc/ConditionVariable_windows.cpp:58
4 	xul.dll 	mozilla::ThreadEventQueue<mozilla::EventQueue>::GetEvent(bool, mozilla::EventPriority*) 	xpcom/threads/ThreadEventQueue.cpp:139
5 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:965
6 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/threads/nsThreadUtils.cpp:513
7 	xul.dll 	mozilla::dom::indexedDB::`anonymous namespace'::ConnectionPool::Shutdown 	dom/indexedDB/ActorsParent.cpp:12352
8 	xul.dll 	mozilla::dom::indexedDB::`anonymous namespace'::QuotaClient::ShutdownWorkThreads 	dom/indexedDB/ActorsParent.cpp:18005
9 	xul.dll 	mozilla::dom::quota::QuotaManager::Shutdown() 	dom/quota/ActorsParent.cpp:3683
10 	xul.dll 	mozilla::dom::quota::QuotaManager::ShutdownRunnable::Run() 	dom/quota/ActorsParent.cpp:2859
(In reply to Shawn Huang [:shawnjohnjr] from comment #4)
> Many crash reports mentioned "Weather.com"
> 
> https://crash-stats.mozilla.com/signature/?_sort=-
> date&signature=shutdownhang%20%7C%20mozilla%3A%3Adom%3A%3Aquota%3A%3AQuotaMan
> ager%3A%3AShutdownObserver%3A%3AObserve&date=%3E%3D2017-10-27T07%3A31%3A24.
> 000Z&date=%3C2017-11-03T07%3A31%3A24.000Z#comments

This majorly is related to Cache Quota Client shutdown stuck.
(In reply to Shawn Huang [:shawnjohnjr] from comment #8)
> Another shutdown is stucking at IO thread shutdown, |mIOThread->Shutdown()|.
> 
> https://crash-stats.mozilla.com/report/index/bdb5bac9-987c-4b52-a94f-
> 1ee9b0171109#allthreads
> 
> 
> 
> Thread 30, Name: IPDL Background
> Frame 	Module 	Signature 	Source
> 0 	ntdll.dll 	NtWaitForAlertByThreadId 	
> 1 	ntdll.dll 	RtlSleepConditionVariableCS 	
> 2 	KERNELBASE.dll 	SleepConditionVariableCS 	
> 3 	mozglue.dll 
> mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl&) 
> mozglue/misc/ConditionVariable_windows.cpp:58
> 4 	xul.dll 	mozilla::CondVar::Wait(unsigned int) 	xpcom/threads/CondVar.h:68
> 5 	xul.dll 	mozilla::ThreadEventQueue<mozilla::EventQueue>::GetEvent(bool,
> mozilla::EventPriority*) 	xpcom/threads/ThreadEventQueue.cpp:138
> 6 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 
> xpcom/threads/nsThread.cpp:967
> 7 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 
> xpcom/threads/nsThreadUtils.cpp:521
> 8 	xul.dll 	nsThread::Shutdown() 	xpcom/threads/nsThread.cpp:802
> 9 	xul.dll 	mozilla::dom::quota::QuotaManager::Shutdown() 
> dom/quota/ActorsParent.cpp:3715
> 10 	xul.dll 	mozilla::dom::quota::QuotaManager::ShutdownRunnable::Run() 
> dom/quota/ActorsParent.cpp:2862

I'm not sure this is related. But every time this symptom happens.
I also saw QuotaManager IO thread is doing |CacheQuotaClient::GetUsageForOrigin|.


Thread 41, Name: QuotaManager IO
Frame 	Module 	Signature 	Source
0 	ntdll.dll 	NtQueryAttributesFile 	
1 	KERNELBASE.dll 	GetFileAttributesW 	
2 	xul.dll 	nsLocalFile::HasFileAttribute(unsigned long, bool*) 	xpcom/io/nsLocalFileWin.cpp:3127
3 	xul.dll 	`anonymous namespace'::GetBodyUsage 	dom/cache/QuotaClient.cpp:53
4 	xul.dll 	`anonymous namespace'::GetBodyUsage 	dom/cache/QuotaClient.cpp:57
5 	xul.dll 	`anonymous namespace'::CacheQuotaClient::GetUsageForOrigin 	dom/cache/QuotaClient.cpp:210
6 	xul.dll 	mozilla::dom::quota::QuotaManager::InitializeOrigin(mozilla::dom::quota::PersistenceType, nsTSubstring<char> const&, nsTSubstring<char> const&, __int64, bool, nsIFile*) 	dom/quota/ActorsParent.cpp:4380
7 	xul.dll 	mozilla::dom::quota::QuotaManager::InitializeRepository(mozilla::dom::quota::PersistenceType) 	dom/quota/ActorsParent.cpp:4295
8 	xul.dll 	mozilla::dom::quota::QuotaManager::EnsureOriginIsInitializedInternal(mozilla::dom::quota::PersistenceType, nsTSubstring<char> const&, nsTSubstring<char> const&, nsTSubstring<char> const&, nsIFile**, bool*) 	dom/quota/ActorsParent.cpp:5242
9 	xul.dll 	mozilla::dom::quota::QuotaManager::EnsureOriginIsInitialized(mozilla::dom::quota::PersistenceType, nsTSubstring<char> const&, nsTSubstring<char> const&, nsTSubstring<char> const&, nsIFile**) 	dom/quota/ActorsParent.cpp:5199
10 	xul.dll 	mozilla::dom::indexedDB::`anonymous namespace'::OpenDatabaseOp::DoDatabaseWork 	dom/indexedDB/ActorsParent.cpp:21594
11 	xul.dll 	mozilla::dom::indexedDB::`anonymous namespace'::FactoryOp::Run 	dom/indexedDB/ActorsParent.cpp:21424
(In reply to Shawn Huang [:shawnjohnjr] from comment #7)
> I have investigate a bit further, then I found that Quota Client
> ShutdownWorkThread is stuck there.
> 
> See:
> https://crash-stats.mozilla.com/report/index/aa51008c-e47a-43af-bbda-
> 9c4af0171109#allthreads
> 
> Thread 33, Name: IPDL Background
> 
> 0 	ntdll.dll 	NtWaitForAlertByThreadId 	
> 1 	ntdll.dll 	RtlSleepConditionVariableCS 	
> 2 	KERNELBASE.dll 	SleepConditionVariableCS 	
> 3 	mozglue.dll 
> mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl&) 
> mozglue/misc/ConditionVariable_windows.cpp:58
> 4 	xul.dll 	mozilla::CondVar::Wait(unsigned int) 
> obj-firefox/dist/include/mozilla/CondVar.h:68
> 5 	xul.dll 	nsEventQueue::GetEvent(bool, nsIRunnable**,
> mozilla::BaseAutoLock<mozilla::Mutex>&) 	xpcom/threads/nsEventQueue.cpp:76
> 6 	xul.dll 	nsThread::nsChainedEventQueue::GetEvent(bool, nsIRunnable**,
> unsigned short*, mozilla::BaseAutoLock<mozilla::Mutex>&) 
> xpcom/threads/nsThread.cpp:839
> 7 	xul.dll 	nsThread::GetEvent(bool, nsIRunnable**, unsigned short*,
> mozilla::BaseAutoLock<mozilla::Mutex>&) 	xpcom/threads/nsThread.cpp:1297
> 8 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 
> xpcom/threads/nsThread.cpp:1380
> 9 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 
> xpcom/threads/nsThreadUtils.cpp:480
> 10 	xul.dll 	`anonymous namespace'::CacheQuotaClient::ShutdownWorkThreads 
> dom/cache/QuotaClient.cpp:215
> 11 	xul.dll 	mozilla::dom::quota::QuotaManager::Shutdown() 
> dom/quota/ActorsParent.cpp:3644
> 12 	xul.dll 	mozilla::dom::quota::QuotaManager::ShutdownRunnable::Run() 
> dom/quota/ActorsParent.cpp:2817
> 
> 
> When doing "Manager::ShutdownAll()", it doesn't finish.
Ben, would you give some comments on any possible Manager::ShutdownAll() doesn't finish?
Many users left coments they have problems with Weathers.com. (https://crash-stats.mozilla.com/signature/?product=Firefox&signature=shutdownhang%20%7C%20mozilla%3A%3Adom%3A%3Aquota%3A%3AQuotaManager%3A%3AShutdownObserver%3A%3AObserve#comments)

Cache QuotaClient ShutdownWorkThread() doesn't finish.

Since QuotaManager calls ShutdownWorkThread() for each quota client.
http://searchfox.org/mozilla-central/rev/c99d035f00dd894feff38e4ad28a73fb679c63a6/dom/quota/ActorsParent.cpp#3683
Flags: needinfo?(bkelly)
I won't be able to look at this until next week after I get back from TPAC.
(In reply to Shawn Huang [:shawnjohnjr] from comment #7)
> I have investigate a bit further, then I found that Quota Client
> ShutdownWorkThread is stuck there.
> When doing "Manager::ShutdownAll()", it doesn't finish.
This happens since 45.0a2.
https://crash-stats.mozilla.com/report/index/89dc7f7d-9d4f-431d-9b4e-2b42c0171109#allthreads
(In reply to Shawn Huang [:shawnjohnjr] from comment #12)
> (In reply to Shawn Huang [:shawnjohnjr] from comment #8)
> > Another shutdown is stucking at IO thread shutdown, |mIOThread->Shutdown()|.
> > 
> > https://crash-stats.mozilla.com/report/index/bdb5bac9-987c-4b52-a94f-
> > 1ee9b0171109#allthreads
> I'm not sure this is related. But every time this symptom happens.
> I also saw QuotaManager IO thread is doing
> |CacheQuotaClient::GetUsageForOrigin|.
> 
> 
> Thread 41, Name: QuotaManager IO
> Frame 	Module 	Signature 	Source
> 0 	ntdll.dll 	NtQueryAttributesFile 	
> 1 	KERNELBASE.dll 	GetFileAttributesW 	
> 2 	xul.dll 	nsLocalFile::HasFileAttribute(unsigned long, bool*) 
> xpcom/io/nsLocalFileWin.cpp:3127
> 3 	xul.dll 	`anonymous namespace'::GetBodyUsage 	dom/cache/QuotaClient.cpp:53
> 4 	xul.dll 	`anonymous namespace'::GetBodyUsage 	dom/cache/QuotaClient.cpp:57
> 5 	xul.dll 	`anonymous namespace'::CacheQuotaClient::GetUsageForOrigin 
> dom/cache/QuotaClient.cpp:210
> 6 	xul.dll 
> mozilla::dom::quota::QuotaManager::InitializeOrigin(mozilla::dom::quota::
> PersistenceType, nsTSubstring<char> const&, nsTSubstring<char> const&,
> __int64, bool, nsIFile*) 	dom/quota/ActorsParent.cpp:4380
> 7 	xul.dll 
> mozilla::dom::quota::QuotaManager::InitializeRepository(mozilla::dom::quota::
> PersistenceType) 	dom/quota/ActorsParent.cpp:4295
> 8 	xul.dll 
> mozilla::dom::quota::QuotaManager::EnsureOriginIsInitializedInternal(mozilla:
> :dom::quota::PersistenceType, nsTSubstring<char> const&, nsTSubstring<char>
> const&, nsTSubstring<char> const&, nsIFile**, bool*) 
> dom/quota/ActorsParent.cpp:5242
> 9 	xul.dll 
> mozilla::dom::quota::QuotaManager::EnsureOriginIsInitialized(mozilla::dom::
> quota::PersistenceType, nsTSubstring<char> const&, nsTSubstring<char>
> const&, nsTSubstring<char> const&, nsIFile**) 
> dom/quota/ActorsParent.cpp:5199
> 10 	xul.dll 	mozilla::dom::indexedDB::`anonymous
> namespace'::OpenDatabaseOp::DoDatabaseWork 
> dom/indexedDB/ActorsParent.cpp:21594
> 11 	xul.dll 	mozilla::dom::indexedDB::`anonymous namespace'::FactoryOp::Run 
> dom/indexedDB/ActorsParent.cpp:21424
For this case, it probably doesn't make sense to do InitializeOrigin() when |QuotaManager::Shutdown()| already finished all
ShutdownWorkThread. All database op should check if shutting down procedure has been executed.

But I'm not sure why mIOTrhead->Shutdown() stuck. It simply shuts down that IOThread. Maybe CacheQuotaClient::GetUsageForOrigin  calls GetFileAttributesW, somehow blocks mIOThread->Shutdown(). I need to take a deeper look.
(In reply to Shawn Huang [:shawnjohnjr] from comment #9)
> AsmJSCache
> Thread 28, Name: IPDL Background
> 
> https://crash-stats.mozilla.com/report/index/7b0d0bf8-7d8f-442a-97d5-
> 0ebf00171109#allthreads
> 
> 0 	ntdll.dll 	NtWaitForAlertByThreadId 	
> 1 	ntdll.dll 	RtlSleepConditionVariableCS 	
> 2 	KERNELBASE.dll 	SleepConditionVariableCS 	
> 3 	mozglue.dll 
> mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl&) 
> mozglue/misc/ConditionVariable_windows.cpp:58
> 4 	xul.dll 	mozilla::ThreadEventQueue<mozilla::EventQueue>::GetEvent(bool,
> mozilla::EventPriority*) 	xpcom/threads/ThreadEventQueue.cpp:139
> 5 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 
> xpcom/threads/nsThread.cpp:965
> 6 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 
> xpcom/threads/nsThreadUtils.cpp:513
> 7 	xul.dll 	mozilla::dom::asmjscache::`anonymous
> namespace'::Client::ShutdownWorkThreads 	dom/asmjscache/AsmJSCache.cpp:1951
> 8 	xul.dll 	mozilla::dom::quota::QuotaManager::Shutdown() 
> dom/quota/ActorsParent.cpp:3683
> 9 	xul.dll 	mozilla::dom::quota::QuotaManager::ShutdownRunnable::Run() 
> dom/quota/ActorsParent.cpp:2859

Jan, I did study in shutdownhang bug and I found it's possible that parent actors are still live which leads potential problem.
Shall I backout AsmJSCache patch first or investigate further?
Flags: needinfo?(jvarga)
So all these stack traces ... it's a bit confusing to report them like this, because some of them are not crahes on those threads, but ok I understood.
According to these reports, there are cases when the main thread crashes when indexeddb or asmsjcache is spinning a nested event loop in ShutdownWorkThread.
I'm not sure why we should back out recent asmjscache fixes if we still doesn't know why the main nested event loop "fails" in QuotaManager::ShutdownObserver::Observe
Is it timing out ?
Flags: needinfo?(jvarga)
(In reply to Jan Varga [:janv] from comment #18)
> I'm not sure why we should back out recent asmjscache fixes if we still
> doesn't know why the main nested event loop "fails" in
> QuotaManager::ShutdownObserver::Observe
> Is it timing out ?
It is timing out. Initially, I thought it's reasonable to check the snapshot of other threads when timing out to see if any QuotaClient ShutdownWorkThread got stuck. Sorry, you're right, we don't need to back out any patch, unless we know how much time each Quota Client used during 30 seconds, these stack trace can't prove anything. It's possible that other QuotaClient ShutdownWorkThread is spinning for a long time. My last comment is wrong.
(In reply to Shawn Huang [:shawnjohnjr] from comment #19)
> It is timing out.

Can you investigate what's the default timeout for this across all platforms ?
(In reply to Jan Varga [:janv] from comment #20)
> (In reply to Shawn Huang [:shawnjohnjr] from comment #19)
> > It is timing out.
> 
> Can you investigate what's the default timeout for this across all platforms
> ?
It looks like 63 seconds. (FALLBACK_ASYNCSHUTDOWN_CRASH_AFTER_MS 60000 + ADDITIONAL_WAIT_BEFORE_CRASH_MS 3000).
Only MOZ_ASAN builds use 3 minutes.

https://searchfox.org/mozilla-central/rev/fe75164c55321e011d9d13b6d05c1e00d0a640be/toolkit/components/terminator/nsTerminator.cpp#392

In |QuotaManager::Shutdown()|, we set DEFAULT_SHUTDOWN_TIMER_MS to 30 seconds, so it supposes to run AbortOperation() first, then Watchdog barks. I used to reproduce this issue once locally on Linux, I checked the console log, warning message (Some storage operations are taking longer than expected during shutdown and will be aborted!) is printed first, then MOZ_CRASH.
Ok, one thing comes to my mind. AbortOperations() is currently only called for quota clients. Internal operations, like clear origin are never notified like this.
Also, indexeddb only handles Database objects, that is, already open databases. Operations for opening/deleting of databases are not handled.

We could add some debugging output after the "Some storage operations are taking longer than expected ...", like iterate over all directory locks or even quota objects and display client type, origin, etc.
This way we should be able to identify operations that block the shutdown.
Should we add telemetry for the shutdown time for different quota storage clients?

(In reply to Shawn Huang [:shawnjohnjr] from comment #16)
> For this case, it probably doesn't make sense to do InitializeOrigin() when
> |QuotaManager::Shutdown()| already finished all
> ShutdownWorkThread. All database op should check if shutting down procedure
> has been executed.

This seems like a good idea.
Flags: needinfo?(bkelly)
(In reply to Ben Kelly [:bkelly] from comment #23)
> Should we add telemetry for the shutdown time for different quota storage
> clients?
I don't know. Is that a common practice to add telemetry during shutdown? Will that make shutdown performance impact?
Shawn, it's also interesting that these crashes started appearing on September 26/27 on all branches, can you check what landed at that time or what could cause it on all branches at the same time ?
Crash Signature: [@ shutdownhang | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe] [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe] → [@ shutdownhang | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe] [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe] [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::…
(In reply to Jan Varga [:janv] from comment #25)
> Shawn, it's also interesting that these crashes started appearing on
> September 26/27 on all branches, can you check what landed at that time or
> what could cause it on all branches at the same time ?

I will do.
Remove:
signature: [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown]
I believe this is a different bug.
Crash Signature: [@ shutdownhang | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe] [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe] [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::… → [@ shutdownhang | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe] [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe]
https://crash-stats.mozilla.com/report/index/8c8cce09-a889-4031-86fd-4ee070171114
"This is consistently failing with weather.com. I'm trying to stay with Firefox as the format suits me better than Chrome or Edge but weather.com is one of my main sites. I need it for work and if I can't access it I will be forced to migrate to another browser. Please help."

Something must go wrong with that site, I guess it runs slow and user tries to shutdown browser but the user saw crash anyway.
I skim the crash reports with the message related to "weather.com". Most of them crash while their DOMCacheThread is executing db::DeleteCacheId() in SetupAction() [1]. BTW, their Firefox version is "56.0.2".

[1] https://crash-stats.mozilla.com/report/index/b0fd0ddc-ea58-4138-adb6-c3c0b0171108#allthreads
(In reply to Tom Tung [:tt] from comment #29)
> I skim the crash reports with the message related to "weather.com". Most of
> them crash while their DOMCacheThread is executing db::DeleteCacheId() in
> SetupAction() [1]. BTW, their Firefox version is "56.0.2".
> 

Ok, that's useful.
Tom, you are already familiar with DOM Cache code, can you inspect that method ?
(In reply to Jan Varga [:janv] from comment #30)
> Ok, that's useful.
> Tom, you are already familiar with DOM Cache code, can you inspect that
> method ?

Sure, but I need to work on other bug now. I'll inspect that tomorrow.

I put the information I know for now below:
The function SetupAction() is called while DOM Cache initialization. And, it is trying to delete the entries from the cache.sqlite which are not deleted in the previous action in the db::DeleteCacheId().
(In reply to Tom Tung [:tt] from comment #31)
I find out they all stuck in [1] which aStatement is equal to SQLITE_LOCKED_SHAREDCACHE. Base on SQLite website, it says that it's fired to indicate that the locking conflict has occurred due to contention with a different database connection that happens to hold a shared cache with the database connection to which the error was returned. 

So, the following questions come to my mind:
- It's executing SetupAction() which is in the DOM Cache init state, so there should have only one connection. Why we get an error code with locking conflict (imply more than one connection)?
- "Share-cache" mode is disabled by default. Do we turn it on? And when?

I'll look into it deeper tomorrow.

[1] https://hg.mozilla.org/releases/mozilla-release/annotate/3702966a64c8/storage/mozStorageConnection.cpp#l1129
I don't know how it's done in DOM cache implementation, but usually it's better to avoid having multiple readwrite transactions at the same time. SQLite can't process them in parallel anyway.
Each dom::cache::Manager should get its own thread and its own sqlite connection.  Normally this means no sharing because there is normally only one Manager at a time.

There is a weird corner case, though, where we think we are done with Cache API and start shutting down the Manager.  Before it completes, though, something else starts and tries to create a new Manager.  Perhaps weather.com is hitting that corner case somehow.

It would be useful to to try to understand the pattern of API calls weather.com is making.
This also seems likely to be related to bug 1368273, which I'm cc'ing people on, which also seemed to suggest the orphan case mentioned in comment 31 and involve weather.com.
Is this related to bug 1405290? The dates of first appearance seem suspiciously similar.
(In reply to Shawn Huang [:shawnjohnjr] from comment #16)
> (In reply to Shawn Huang [:shawnjohnjr] from comment #12)
> > (In reply to Shawn Huang [:shawnjohnjr] from comment #8)
> > > Another shutdown is stucking at IO thread shutdown, |mIOThread->Shutdown()|.
> > > 
> > > https://crash-stats.mozilla.com/report/index/bdb5bac9-987c-4b52-a94f-
> > > 1ee9b0171109#allthreads
> > I'm not sure this is related. But every time this symptom happens.
> > I also saw QuotaManager IO thread is doing
> > |CacheQuotaClient::GetUsageForOrigin|.
> > 
> > 
> > Thread 41, Name: QuotaManager IO
> > Frame 	Module 	Signature 	Source
> > 0 	ntdll.dll 	NtQueryAttributesFile 	
> > 1 	KERNELBASE.dll 	GetFileAttributesW 	
> > 2 	xul.dll 	nsLocalFile::HasFileAttribute(unsigned long, bool*) 
> > xpcom/io/nsLocalFileWin.cpp:3127
> > 3 	xul.dll 	`anonymous namespace'::GetBodyUsage 	dom/cache/QuotaClient.cpp:53
> > 4 	xul.dll 	`anonymous namespace'::GetBodyUsage 	dom/cache/QuotaClient.cpp:57
> > 5 	xul.dll 	`anonymous namespace'::CacheQuotaClient::GetUsageForOrigin 
> > dom/cache/QuotaClient.cpp:210
> > 6 	xul.dll 
> > mozilla::dom::quota::QuotaManager::InitializeOrigin(mozilla::dom::quota::
> > PersistenceType, nsTSubstring<char> const&, nsTSubstring<char> const&,
> > __int64, bool, nsIFile*) 	dom/quota/ActorsParent.cpp:4380
> > 7 	xul.dll 
> > mozilla::dom::quota::QuotaManager::InitializeRepository(mozilla::dom::quota::
> > PersistenceType) 	dom/quota/ActorsParent.cpp:4295
> > 8 	xul.dll 
> > mozilla::dom::quota::QuotaManager::EnsureOriginIsInitializedInternal(mozilla:
> > :dom::quota::PersistenceType, nsTSubstring<char> const&, nsTSubstring<char>
> > const&, nsTSubstring<char> const&, nsIFile**, bool*) 
> > dom/quota/ActorsParent.cpp:5242
> > 9 	xul.dll 
> > mozilla::dom::quota::QuotaManager::EnsureOriginIsInitialized(mozilla::dom::
> > quota::PersistenceType, nsTSubstring<char> const&, nsTSubstring<char>
> > const&, nsTSubstring<char> const&, nsIFile**) 
> > dom/quota/ActorsParent.cpp:5199
> > 10 	xul.dll 	mozilla::dom::indexedDB::`anonymous
> > namespace'::OpenDatabaseOp::DoDatabaseWork 
> > dom/indexedDB/ActorsParent.cpp:21594
> > 11 	xul.dll 	mozilla::dom::indexedDB::`anonymous namespace'::FactoryOp::Run 
> > dom/indexedDB/ActorsParent.cpp:21424
> For this case, it probably doesn't make sense to do InitializeOrigin() when
> |QuotaManager::Shutdown()| already finished all
> ShutdownWorkThread. All database op should check if shutting down procedure
> has been executed.
> 
I was wrong. OpenDatabaseOp::DoDatabaseWork actually checked |QuotaClient::IsShuttingDownOnNonBackgroundThread()| before calling |EnsureOriginIsInitialized()|. But why I can see code runs EnsureOriginIsInitialized in the QuotaManager IO thread after snapshot of shutdown 60 seconds timeout? If |QuotaClient::IsShuttingDownOnNonBackgroundThread()| runs correctly, it shall not fall into |EnsureOriginIsInitialized|.

Maybe code ran into |EnsureOriginIsInitialized()| before shutdown() happened and then that thread hang after doing NtQueryAttributesFile.

https://searchfox.org/mozilla-central/rev/a5d613086ab4d0578510aabe8653e58dc8d7e3e2/dom/indexedDB/ActorsParent.cpp#21628
In a recent-ish call, :bkelly raised the possibility that the interactive radar map feature that can be found at https://weather.com/weather/radar/interactive/l/USCA0746:1:US and uses the Cache API for tiling could be involved.  Previously, in my investigations, I had only suspected the ServiceWorker and its use of sw-toolbox that would delete caches on update installation.

Setting NI on :shawnjohnjr as a to-do for both of us and maybe others to try and meet up at the all hands and look into what's happening here further.
Flags: needinfo?(shuang)
Thank you, Andrew. See you at the all hands.
Reading the recent log again.
Like: https://crash-stats.mozilla.com/report/index/37d126df-05c7-4541-9ec1-7f0810171210#allthreads

Thread 45, QuotaManager IO thread always stuck at nsLocalFile::HasFileAttribute(unsigned long, bool*) when doing  	GetFileAttributesW. And QuotaManager thread stuck at |mIOThread->Shutdown()| and it would make sense.
This happened also happened on ESR build.

This signature pattern also happens a lot, and this only happened on Windows platform.
I've discussed with :asuth yesterday for Comment 38, this bug probably is not related to Bug 1368273. But it's worthy to check Comment 10 for Indexeddb case further.

https://crash-stats.mozilla.com/report/index/071d9696-ec35-4ea0-8be6-83d6d0171103#allthreads

When Thread 37, IPDL Background thread |ConnectionPool::Shutdown| is executing, which is waiting for |mShutdownComplete|.

Meanwhile, Thread 40, IndexedDB #1 thread, is waiting for |mContinueRunning| as false.

https://searchfox.org/mozilla-central/rev/9d920555ec81f1c9d4b7fa3b08e23eb88efb60e1/dom/indexedDB/ActorsParent.cpp#13307

It looks like the database is not yet called |ConnectionPool::Cleanup()| yet.
Flags: needinfo?(shuang)
(In reply to Shawn Huang [:shawnjohnjr] from comment #40)
> Reading the recent log again.
> Like:
> https://crash-stats.mozilla.com/report/index/37d126df-05c7-4541-9ec1-
> 7f0810171210#allthreads
> 
> Thread 45, QuotaManager IO thread always stuck at
> nsLocalFile::HasFileAttribute(unsigned long, bool*) when doing  
> GetFileAttributesW. And QuotaManager thread stuck at |mIOThread->Shutdown()|
> and it would make sense.
> This happened also happened on ESR build.
> 
> This signature pattern also happens a lot, and this only happened on Windows
> platform.

Jan, do you have any comment regarding this signature? We checked together at the all hands.
User seems to report this crash with this signature a lot, and mentioned they suffer this crash many times in a day.
I can't explain that why |GetFileAttributesW| windows api got hang and it causes mIOThread->Shutdown() stuck.
Somehow, in recent release, this has been observed on Windows quite often.
Flags: needinfo?(jvarga)
Flags: needinfo?(jvarga)
(In reply to Shawn Huang [:shawnjohnjr] from comment #42)
> (In reply to Shawn Huang [:shawnjohnjr] from comment #40)
> > Reading the recent log again.
> > Like:
> > https://crash-stats.mozilla.com/report/index/37d126df-05c7-4541-9ec1-
> > 7f0810171210#allthreads
> 
> Jan, do you have any comment regarding this signature? We checked together
> at the all hands.
> User seems to report this crash with this signature a lot, and mentioned
> they suffer this crash many times in a day.
> I can't explain that why |GetFileAttributesW| windows api got hang and it
> causes mIOThread->Shutdown() stuck.
> Somehow, in recent release, this has been observed on Windows quite often.
Discussed with Jan, this symptom can be fixed and it's actionable.
We can fix this symptom at least to reduce shutdown hang rate.
This symptom is not a regression since it appears since very early version (before 52).
I will update a patch for this case soon. :)
Hello Shawn - this is #5 top crash for 58 Beta 13 - are you still working on a patch? Thanks.
Flags: needinfo?(shuang)
(In reply to Marcia Knous [:marcia - use ni] from comment #44)
> Hello Shawn - this is #5 top crash for 58 Beta 13 - are you still working on
> a patch? Thanks.
Yes. I'm still working on the patch as I mentioned in Comment 43.
Flags: needinfo?(shuang)
Hey, Andrew, can you take a look at this?
Flags: needinfo?(bugmail)
I can't verify the patch can fix OpenDatabaseOp is stuck, since I still can't reproduce the symptom.
(In reply to Shawn Huang [:shawnjohnjr] from comment #47)
> Created attachment 8942037 [details] [diff] [review]
> [WIP] Bug 1411908 - Wait for FactoryOps to be finished
> 
> I can't verify the patch can fix OpenDatabaseOp is stuck, since I still
> can't reproduce the symptom.

Unfortunately, I found that it's possible that gFactoryOps is not null in ShutdownWorkThread.
Comment on attachment 8942037 [details] [diff] [review]
[WIP] Bug 1411908 - Wait for FactoryOps to be finished

Review of attachment 8942037 [details] [diff] [review]:
-----------------------------------------------------------------

Shawn, (at least for now) you don't have to try to use the same array of ops to handle shutdown. Adding/removing of ops to/from this array is currently done in places that are important to handle concurrent access to the same database file. The shutdown is a bit different, for shutdown we have to wait until the directory lock is released. So try to create a new array and add/remove elements similarly as it is done in AsmJSCache.cpp.
After reviewing the implementation in indexedDB,
I have some different viewpoint about the shutdownhang case while "QuotaManager IO" thread is busy on QuotaManager::InitializeRepository() from OpenDatabaseOp::DoDatabaseWork():
1. It's obvious that this ongoing OpenDatabaseOp::DoDatabaseWork() has been started before QuotaManager::Shutdown() happened, because the QuotaClient::IsShuttingDownOnNonBackgroundThread() has been checked at the beginning of OpenDatabaseOp::DoDatabaseWork() before running into QuotaManager::InitializeRepository().
2. From the crash reports, it seems that whenever this symptom happened, the "QuotaManager IO" thread is busy in QuotaManager::InitializeRepository() which looks heavy to me if there are lots of origins in the profile to be initiailized and there is a while loop to iterate all the sub-directories and if the platform itself is also busy.
3. I don't know why it will be faster if we wait for QuotaManager::InitializeRepository() returned at QuotaClient::ShutdownWorkThreads() if QuotaManager::InitializeRepository() has been called before shutdown is started.
4. It makes more sense to me to return NS_ERROR_ABORT earlier from the while loop of QuotaManager::InitializeRepository() if QuotaManager::IsShuttingDown() is true, then OpenDatabaseOp will be aborted earlier and we don't have to add these redundant FactoryOps array for waiting in QuotaClient::ShutdownWorkThreads().
(In reply to Bevis Tseng[:bevis][:btseng] from comment #50)
> After reviewing the implementation in indexedDB,
> I have some different viewpoint about the shutdownhang case while
> "QuotaManager IO" thread is busy on QuotaManager::InitializeRepository()
> from OpenDatabaseOp::DoDatabaseWork():
> 1. It's obvious that this ongoing OpenDatabaseOp::DoDatabaseWork() has been
> started before QuotaManager::Shutdown() happened, because the
> QuotaClient::IsShuttingDownOnNonBackgroundThread() has been checked at the
> beginning of OpenDatabaseOp::DoDatabaseWork() before running into
> QuotaManager::InitializeRepository().
> 2. From the crash reports, it seems that whenever this symptom happened, the
> "QuotaManager IO" thread is busy in QuotaManager::InitializeRepository()
> which looks heavy to me if there are lots of origins in the profile to be
> initiailized and there is a while loop to iterate all the sub-directories
> and if the platform itself is also busy.
> 3. I don't know why it will be faster if we wait for
> QuotaManager::InitializeRepository() returned at
> QuotaClient::ShutdownWorkThreads() if QuotaManager::InitializeRepository()
> has been called before shutdown is started.
> 4. It makes more sense to me to return NS_ERROR_ABORT earlier from the while
> loop of QuotaManager::InitializeRepository() if
> QuotaManager::IsShuttingDown() is true, then OpenDatabaseOp will be aborted
> earlier and we don't have to add these redundant FactoryOps array for
> waiting in QuotaClient::ShutdownWorkThreads().
Hmm. OpenDatabaseOp::DoDatabaseWork actually checked |QuotaClient::IsShuttingDownOnNonBackgroundThread()| before calling |EnsureOriginIsInitialized()|. But I forgot InitializeRepository might be heavy time consumption operation, and shutdown operation might happen during InitializeRepository stage.
We can add some more IsShuttingDown checks here and there, but still, ShutdownWorkerThreads should release any quota objects or directory locks related to specific quota client.
Imagine a case when OpenDatabaseOp is about to open SQLite database (EnsureOriginIsInitialized was already called and succeeded). Now we get the shutdown notification and the shutdown flag is set. Now if the database has really old schema, and is quite big, the upgrade will take long time. If we don't let it to finish before mIOThread->Shutdown() is called in QuotaManager::Shutdown(), then we can get a similar problem as reported in this bug. We would have to add IsShuttingDown() checks everywhere.
So I think it's safer to have a redundant array which we may try to unify with the existing one at some point.
I see! I misunderstood that the purpose of waiting for the operation complete was to speedup the shutdown progress.
Another important bugs related to SQLITE_LOCKED_SHAREDCACHE that I'm looking into:
1. IndexedbDB QuotaClient Shutdown
IndexedDB Thread 42, ConnectionPool shutdown failure, that leads QuotaManager shutdown timeout
https://crash-stats.mozilla.com/report/index/a4a5e242-c2cd-43ad-883d-2d4320180116#allthreads

2. Cache QuotaClient Shutdown
Cache API Thread 51(DOMCacheThread), which causes manager->ShutdownAll() stuck.
https://crash-stats.mozilla.com/report/index/28742d4a-d0ea-4e3d-bd86-a31ea0180113#allthreads

They are both in the loop in |StepStatement|, which causes DOMCache thread and IndexedDB ConnectionPool shutdown failure.
They look like the similar problems.

But I looked all threads, there is no other thread accessing sqlite in these two cases.

Andrew, do you have any idea regarding this sqlite problem?
Is your reference to SQLITE_LOCKED_SHAREDCACHE derived from :tt's comment 32?  Both of the stacks you link to show SQLite actively executing.  We'd only be worried about SQLITE_LOCKED_SHAREDCACHE if we saw the stack trace sitting on the WaitForUnlockNotify line https://hg.mozilla.org/releases/mozilla-release/annotate/3702966a64c8/storage/mozStorageConnection.cpp#l1138.

Specifically:

- That while loop is intended to amount to: "Try and perform the step, but if the step is blocked by another connection in the same process, we have an optimization that lets our thread sleep/block until that other thread has released.  Of course, the contention can happen again, so stick this in a loop."  Most of the time we expect the condition to run sqlite3_step for side-effect and not fall into the body of the while loop.

- For 1., thread 42 is actively doing disk I/O in service of an IDB get().  This seems potentially like a straightforward situation where some combination of slow disk I/O and a backlog of user requests that aren't sufficiently cancelled leads to the timeout being hit.

- For 2., thread 51 is much more interesting and is what bkelly was describing in comment 34.  It looks like what's happened is:
  - dom/cache/QuotaClient's ShutdownWorkThreads() has been invoked and this has begun shutdown of all the managers.
  - But a Manager was re-created because cache::Manager::GetOrCreate() and cache::Manager::Factory::GetOrCreate() don't have any concept of shutdown being triggered, and someone called them after the manager in question invoked Manager::RemoveContext.  We really want Manager::ShutdownAll() and Manager::Factory::ShutdownAll() to have a concept of having been shutdown beyond Factory's mInSyncShutdown guard which is strictly a conceptual variant on a kung-fu death grip.
    - It's possible the PrincipalVerifier mechanism may be involved because it ends up batching CacheOpParent instances and invoking them in a tight loop that doesn't yield back to the enclosing event loop, which makes it easier for us to not fall out of the spun nested event loop and for sFactory to become null and then become non-null again, maybe.  I may also be wrong about this; extra analysis would be required to be sure, and the GetOrCreate guards seem appropriate and straightforward.
  - This new manager has created a SetupAction in Manager::Init, which is not currently blocked on I/O, but is in the process of cleaning up orphaned DOM cache instances because the "context_open.marker" was found to be still hanging around when there was an attempt to use the DOM Cache API after its manager had already been shutdown.
  - (If we hadn't triggered a timer-based crash, it's possible the above would have cleaned up and we wouldn't have heard anything, as the factory would shutdown again, allowing the nested event loop to complete and QM to finish shutdown.)


So I'm not totally sure about the case 1, but for case 2, I think we want to add guards so that Manager::GetOrCreate insta-fails without re-creating the Factory or any Managers once its ShutdownAll() method has been entered.
Flags: needinfo?(bugmail)
Attachment #8942602 - Attachment is obsolete: true
Attachment #8942602 - Flags: review?(jvarga)
Attachment #8942607 - Attachment is obsolete: true
Attachment #8942607 - Flags: review?(jvarga)
Attachment #8943208 - Flags: review+
Assignee: shuang → nobody
Assignee: nobody → asuth
Assignee: asuth → bugmail
I'm going to test the second patch a bit ...
This is related to bug 1435958, which itself could be a dupe?
Andrew, you are assigned to this bug which seems to happen more and more on latest beta and central. Can you please have a look or find someone else who could work on this bug? Thanks.
Flags: needinfo?(bugmail)
Apologies on the delay.  The reduction in available engineers on our team at the beginning of the year has made it hard to address this in a timely fashion and leaves a very small pool of other candidates.  I think :janv may have availability now, and can coordinate cleanup here with his existing cleanup on bug 1286798 that should land soon (modulo fallout from bug 1373244 being shelved).  I'll check with my manager this afternoon.
See Also: → 1437575
See Also: → 1286798
Tracking requested due to high crash volume/top crasher.

Top Crashers for Firefox 59.0.2

2 	2.5% 	new 	shutdownhang | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe	1635 	1635 	0 	0 	1621 	0 	2016-04-14
	
Top Crashers for Firefox 59.0.1

5 	1.77% 	0.05% 	shutdownhang | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe	2221 	2221 	0 	0 	2230 	0 	2016-04-14	

Top Crashers for Firefox 60.0b7

6 	1.47% 	new 	shutdownhang | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe	162 	162 	0 	0 	157 	0 	2016-04-14 	
	
Top Crashers for Firefox 60.0b6

7 	1.64% 	0.83% 	shutdownhang | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe	279 	279 	0 	0 	245 	0 	2016-04-14	

Top Crashers for Firefox 57.0

10 	1.77% 	-0.16% 	shutdownhang | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe	38 	38 	0 	0 	38 	0 	2016-04-14

Top Crashers for Firefox 58.0.2

12 	0.91% 	0.02% 	shutdownhang | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe	156 	156 	0 	0 	155 	0 	2016-04-14
	
Top Crashers for Firefox 61.0a1 	

19 	0.79% 	0.12% 	shutdownhang | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe	39 	39 	0 	0 	39 	0 	2016-04-14 

Top Crashers for Firefox 52.7.3esr

54 	0.21% 	new 	shutdownhang | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe	122 	122 	0 	0 	106 	0 	2016-04-14 

Top Crashers for Firefox 52.7.2esr

91 	0.15% 	0.04% 	shutdownhang | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe	101 	101 	0 	0 	90 	0 	2016-04-14
We know it's the top crash on release. 
So, it seems not really useful to paste in comments from crash-stats - we can all click through to that. 
Tracking for 59 and 60, since if we come up with a fix we'll probably want to uplift at least to beta.
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #71)
> We know it's the top crash on release. 
> So, it seems not really useful to paste in comments from crash-stats

You know that it's a top crash and yet prior to my edits no one had marked this bug with the "top crash" keyword, nor had anyone requested tracking for this bug.

You know that this bug causes extreme frustration and grief to your users; that the user experience due to this bug is so bad that people are abandoning your product and yet you don't have enough engineers to work on it!
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #71)
> We know it's the top crash on release. 
> So, it seems not really useful to paste in comments from crash-stats - we
> can all click through to that. 

I consider copying and pasting comments from crash reports to bugs to be useful.

The data in Crash Stats is constantly changing. I copied some of the data from Crash Stats to this Bug to help maintain a historical context. It is a snapshot in time that I think is important. If you look at the crash stats tomorrow, or a month from now, the data will be different. 

For example, this comment is useful: Comment 44 • 3 months ago - Hello Shawn - this is #5 top crash for 58 Beta 13

If this bug is still open 6 or 12 months from now and someone comes to look at it, my comments and the data I copied and pasted will still be visible. But the data in crash-stats for TODAY will NOT be available in the future. 

Heck, even crash reports prior to Dec 24 2017 are all missing (Bug 1427111). So if anyone copied and pasted useful data such as comments out of those crash reports or crash stats into bugs prior to the crash reports all being deleted, that copied data would still be available today and would still be useful. 

   Bug 1427111 - [meta] crash autosubmission cleanup tracker - "Decision to delete crash reports is a pretty big deal - This is a huge decision as it leaves the Mozilla team with no telemetry data about Firefox crashes for the last nine months." (BleepingComputer.com)
taking
Flags: needinfo?(bugmail)
"Perf" key word?
(In reply to Jan Varga [:janv] from comment #74)
> taking

(moving assignee status given I think this was the intent.)
Assignee: bugmail → jvarga
Crash Signature: [@ shutdownhang | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe] [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe] → [@ shutdownhang | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe] [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe] [@ shutdownhang | static bool mozilla::SpinEventLoopUntil<T>…
Depends on: 1448957
Crash Signature: mozilla::SpinEventLoopUntil<T> ] → mozilla::SpinEventLoopUntil<T> ] [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging ]
Please note that I put some steps for reproduction via available automated tests on bug 1449538 comment 5 and comment 7.
(In reply to Jan Varga [:janv] from comment #74)
> taking

Jan, I wonder if you have some updates for us meanwhile. Thanks.
Flags: needinfo?(jvarga)
(In reply to Henrik Skupin (:whimboo) from comment #78)
> (In reply to Jan Varga [:janv] from comment #74)
> > taking
> 
> Jan, I wonder if you have some updates for us meanwhile. Thanks.

Yes, I'm trying to consolidate IDB's QuotaClient::ShutdownWorkThreads to cover all edge cases.
Flags: needinfo?(jvarga)
Attachment #8943208 - Attachment is obsolete: true
Attachment #8943209 - Attachment is obsolete: true
Andrew, could you please have a look at the review requests? It would be kinda helpful to have this reviewed and landed soon. Thanks.
Status: NEW → ASSIGNED
Flags: needinfo?(bugmail)
(In reply to Henrik Skupin (:whimboo) from comment #85)
> Andrew, could you please have a look at the review requests? It would be
> kinda helpful to have this reviewed and landed soon. Thanks.

I suspect he's got other bugs ahead of this one in his review queue, would anyone else be capable of reviewing this?
Andrea, do you want to review this ?
Flags: needinfo?(amarchesini)
Review is under way.
Flags: needinfo?(bugmail)
Flags: needinfo?(amarchesini)
Comment on attachment 8968117 [details] [diff] [review]
Part 1: Wait for all live database objects to finish when shutting down

Review of attachment 8968117 [details] [diff] [review]:
-----------------------------------------------------------------

r=asuth with the changes in question made, but also feel free to ask for a re-review, I can turn it around very fast now that the shutdown stuff is back in my head.

Restating: This patch alters the spin loop to wait for all databases to be closed.  In order to ensure that happens, it requests that all live databases close themselves.  (However, per my understanding, in most e10s cases we'll already have invalidated the databases, so I've requested below that we just move to invalidation for consistency and simplicity.)

::: dom/indexedDB/ActorsParent.cpp
@@ -14247,5 @@
>  Database::CleanupMetadata()
>  {
>    AssertIsOnBackgroundThread();
>  
> -  if (!mMetadataCleanedUp) {

Removing this guard doesn't seem obviously safe.  There are 2 callers to CleanupMetadata:
- Database::Invalidate()
- Database::ConnectionClosedCallback, initiated by Database::MaybeCloseConnection, with WaitForTransactionsHelper::WaitForTransactions() being the one to eventually invoke the callback.

It seems possible for the latter to be in flight when an Invalidation is triggered.  It's possible this is somehow safe, but things are so complex and the DecreaseBusyCount() call below is not idempotent, so a change like this needs a really good comment that ideally reference strong assertions that help guarantee the comment is accurate.

Please add comments/assertions or leave this as-is.

::: dom/indexedDB/PBackgroundIDBDatabase.ipdl
@@ +77,5 @@
>    async Invalidate();
>  
>    async CloseAfterInvalidationComplete();
>  
> +  async RequestClose();

I think if we're adding anything to the state machine here, we need to make sure we're at least adding comments to the new bits.  That said, it's not clear to me that we want to add this...

In the e10s case, I believe this is what we expect to happen surrounding shutdown:
- ContentParent sees the "profile-before-change" observer notification and shuts down the content parent.  See https://searchfox.org/mozilla-central/rev/fcd5cb3515d0e06592bba42e378f1b9518497e3d/dom/ipc/ContentParent.cpp#2808.  Note that this happens strictly before "profile-before-change-qm" which is when QuotaManager begins shutdown.
- QM has special logic to invoke QMS::AbortOperationsForProcess(childId) at https://searchfox.org/mozilla-central/rev/fcd5cb3515d0e06592bba42e378f1b9518497e3d/dom/ipc/ContentParent.cpp#1406 and since the QMS isn't in shutdown, that will get queued.  Note that this is invoked prior to the PContent IPC channel being closed, and PBackground won't actually get closed by the ContentParent shutdown mechanism at all... all our PBackground-related actors are expected to potentially engage in normal docshell closure shutdown behaviors if the process doesn't get hard-killed.  At least I think that's what happens, I'm going by somewhat stale IPC logs right now.
- IDB's QuotaClient will invoke Invalidate() on all of its databases: https://searchfox.org/mozilla-central/rev/fcd5cb3515d0e06592bba42e378f1b9518497e3d/dom/indexedDB/ActorsParent.cpp#17825

This does leave the non-e10s case, which means Fennec and Chrome-privileged IDB consumers that live in the parent process like the Push API.  So the new code in QuotaClient::ShutdownWorkThreads that uses RequestClose() will be used there.  That leads to a weird asymmetry, so I think it makes sense to just have that code also use Invalidate() and not add RequestClose().

Firefox shutdown in general is poorly documented, so it's possible I'm confused about what's going on.  If you've seen contradictory documentation elsewhere, or your own experiences differ, please do describe them here.  (And maybe add some code comments to that effect! ;)
Attachment #8968117 - Flags: review?(bugmail) → review+
Comment on attachment 8968118 [details] [diff] [review]
Part 2: Wait for all factory operations to finish when shutting down

Review of attachment 8968118 [details] [diff] [review]:
-----------------------------------------------------------------

Restating: This patch:
- Alters the shutdown spin loop for FactoryOps like OpenDatabaseOp to be completed/no longer exist.  It's no good checking for the database to all be closed if one can try and open during/after shutdown.
- Moves FactoryOps to be tracked from their time of creation rather than after QuotaManager::OpenDirectory call completes.  Otherwise there's a potential race where an open operation can sneak in.
  - This allows FactoryOp::DirectoryOpen to assert gFactoryOps exists as that scenario turned out to be indicative of something bad happening.

Question: What do you think about bumping the IncreaseBusyCount/DecreaseBusyCount assertions to be MOZ_DIAGNOSTIC_ASSERT instead of just MOZ_ASSERT? r=asuth on that too if you're on board.

::: dom/indexedDB/ActorsParent.cpp
@@ +13704,5 @@
>    }
>  
> +  gFactoryOps->AppendElement(actor);
> +
> +  // Balanced in CleanupMetadata().

Maybe extend the comment here to be:
Balanced in CleanupMetadata(), which is/must always called by SendResults().

@@ +22485,2 @@
>    } else if (mDirectoryLock) {
>      nsCOMPtr<nsIRunnable> callback = NewRunnableMethod(

For clarity, maybe add a comment here like:
// ConnectionClosedCallback will call CleanupMetadata().
Attachment #8968118 - Flags: review?(bugmail) → review+
(In reply to Andrew Sutherland [:asuth] from comment #89)
> Restating: This patch alters the spin loop to wait for all databases to be
> closed.  In order to ensure that happens, it requests that all live
> databases close themselves.  (However, per my understanding, in most e10s
> cases we'll already have invalidated the databases, so I've requested below
> that we just move to invalidation for consistency and simplicity.)

I don't think databases are invalidated at that point. They are invalidated either
by Database::ActorDestroy (that happens later in the game) or AbortOperationsForProcess
(that also happens later).
I don't think we want to rely on AbortOperationsForProcess(), that exists just for
the case when a content process crashes or something and we want to abort anything
storage related to it.
Anyway, Invalidate also aborts existing transactions, I think we just want "allow to close".
That's why I added the new RequestClose message.
It's a bit confusing, but Invalidate and RequestClose share the same bool flag mClosed.
So Invalidate is almost the same as Close, it just doesn't wait for existing transactions
to finish, it also aborts them.

> 
> ::: dom/indexedDB/ActorsParent.cpp
> @@ -14247,5 @@
> >  Database::CleanupMetadata()
> >  {
> >    AssertIsOnBackgroundThread();
> >  
> > -  if (!mMetadataCleanedUp) {
> 
> Removing this guard doesn't seem obviously safe.  There are 2 callers to
> CleanupMetadata:
> - Database::Invalidate()
> - Database::ConnectionClosedCallback, initiated by
> Database::MaybeCloseConnection, with
> WaitForTransactionsHelper::WaitForTransactions() being the one to eventually
> invoke the callback.

I'm removing the caller in Invalidate, because if Invalidate is called first I don't want it
to remove metadata so early.

> Please add comments/assertions or leave this as-is.

Sure I can make those assertions stronger.

> > +  async RequestClose();
> 
> I think if we're adding anything to the state machine here, we need to make
> sure we're at least adding comments to the new bits.  That said, it's not
> clear to me that we want to add this...

Ok, I can try to do that.

> In the e10s case, I believe this is what we expect to happen surrounding
> shutdown:
> - ContentParent sees the "profile-before-change" observer notification and
> shuts down the content parent.  See
> https://searchfox.org/mozilla-central/rev/
> fcd5cb3515d0e06592bba42e378f1b9518497e3d/dom/ipc/ContentParent.cpp#2808. 
> Note that this happens strictly before "profile-before-change-qm" which is
> when QuotaManager begins shutdown.
> - QM has special logic to invoke QMS::AbortOperationsForProcess(childId) at
> https://searchfox.org/mozilla-central/rev/
> fcd5cb3515d0e06592bba42e378f1b9518497e3d/dom/ipc/ContentParent.cpp#1406 and
> since the QMS isn't in shutdown, that will get queued.  Note that this is
> invoked prior to the PContent IPC channel being closed, and PBackground
> won't actually get closed by the ContentParent shutdown mechanism at all...
> all our PBackground-related actors are expected to potentially engage in
> normal docshell closure shutdown behaviors if the process doesn't get
> hard-killed.  At least I think that's what happens, I'm going by somewhat
> stale IPC logs right now.
> - IDB's QuotaClient will invoke Invalidate() on all of its databases:
> https://searchfox.org/mozilla-central/rev/
> fcd5cb3515d0e06592bba42e378f1b9518497e3d/dom/indexedDB/ActorsParent.cpp#17825

I don't think AbortOperationsForProcess() is called at that point. Remember
that Invalidate also aborts and we never wanted to shutdown by aborting immediately.
That's only done after 30 seconds (that is if shutdown is not done after 30 seconds).

> Firefox shutdown in general is poorly documented, so it's possible I'm
> confused about what's going on.  If you've seen contradictory documentation
> elsewhere, or your own experiences differ, please do describe them here. 
> (And maybe add some code comments to that effect! ;)

Yeah, shutdown is a nightmare. It wasn't easy to come with these patches.
Fortunately I was working on a better shutdown for new local storage implementation,
so I used some ideas from it here.
(In reply to Jan Varga [:janv] from comment #91)
> (In reply to Andrew Sutherland [:asuth] from comment #89)
> > Restating: This patch alters the spin loop to wait for all databases to be
> > closed.  In order to ensure that happens, it requests that all live
> > databases close themselves.  (However, per my understanding, in most e10s
> > cases we'll already have invalidated the databases, so I've requested below
> > that we just move to invalidation for consistency and simplicity.)
> 
> I don't think databases are invalidated at that point. They are invalidated
> either
> by Database::ActorDestroy (that happens later in the game) or
> AbortOperationsForProcess
> (that also happens later).
> I don't think we want to rely on AbortOperationsForProcess(), that exists
> just for
> the case when a content process crashes or something and we want to abort
> anything
> storage related to it.

Can you elaborate on why you think AbortOperationsForProcess() happens later?  As I understand it, RequestClose() is happening inside QuotaManager's shutdown, which is "profile-before-change-qm" which happens after "profile-before-change".  So timeline wise we have:

- "profile-before-change" observer event (synchronous)
  - ContentParent::Observe does:
      ShutDownProcess(SEND_SHUTDOWN_MESSAGE);
      SpinEventLoopUntil([&]() { return !mIPCOpen || mCalledKillHard; });
  - ContentChild::ShutdownInternal calls SendFinishShutdown()
  - ContentParent::RecvFinishShutdown invokes ShutDownProcess(CLOSE_CHANNEL);
  - ShutDownProcess gets to this code block:
    if (QuotaManagerService* quotaManagerService = QuotaManagerService::Get()) {
      quotaManagerService->AbortOperationsForProcess(mChildID);
    }
  - ShutDownProcess then gets to the call to Close() which will actually trigger shutdown of the channel
  - Now ContentParent::ActorDestroy gets called, which sets mIPCOpen = false, allowing the spun loop above to complete.
- "profile-before-change-qm" observer event fires
  - QuotaManager::ShutdownObserer::Observer creates a ShutdownRunnable
  - ShutdownRunnable invokes QuotaManager::Shutdown which invokes ShutdownWorkThreads on all quota clients.


Note that I just tried to reproduce this locally with a debug build using MOZ_IPC_MESSAGE_LOG=1, but we always end up destroying the PBrowser explicitly.  So neither set of logic gets a chance to run.  I could probably make it happen with a ChromeWorker though, and I'm beginning to suspect that it's indeed Chrome-privileged JS running in either the parent process or a ChromeWorker that is the source of these hangs.

The PBrowser shutdown leads to the next thing...

> Anyway, Invalidate also aborts existing transactions, I think we just want
> "allow to close".
> That's why I added the new RequestClose message.
> It's a bit confusing, but Invalidate and RequestClose share the same bool
> flag mClosed.
> So Invalidate is almost the same as Close, it just doesn't wait for existing
> transactions
> to finish, it also aborts them.

When the PBrowser gets destroyed, we already Invalidate the database via InvalidateInternal and thereby the transactions.  After the bug 1450266 landing, this happens when DOMEventTargetHelper's DisconnectFromOwner() is invoked when the global is being cleaned up.  I reviewed that change in https://bugzilla.mozilla.org/show_bug.cgi?id=1447871#c24 but the main context is that :bkelly patterned that design decision after the existing logic in IDBDatabase::Observer::Observe which does InvokeInternal() on the "inner-window-destroyed" notification.  (Note that we only did the observer on the main thread which is/was an inconsistency with Workers.)  (Also note that we now can/should remove that observer logic.)

I think the choice to abort the transactions makes sense for a lot of reasons:
- Closing a tab is explicit user intent.  Likewise, shutting down the browser is also explicit intent.
- IDB allows for infinite loop transactions like demonstrated at https://github.com/w3c/IndexedDB/issues/72#issuecomment-199890369 that we cannot allow to hang browser shutdown.  That's the whole point of this bug, after all.
- Transactions get automatically aborted if there's an error/exception in dispatching the success event.  Which there will be if there's no JSContext because the global is dead.  Also, semantically, we can't assume that a transaction is ready to be committed if JS can't run.  And it can't run once the global is dead.  Otherwise we're potentially looking at truncating transactions.
  - Note that I think @inexorabletash is on board with adding semantics that would make it possible for JS code to close out a commit per our discussion at https://github.com/w3c/IndexedDB/issues/234#issuecomment-381707058 but that doesn't change things here.

Which is all to say, our existing behavior is to abort transactions and I think we should stick with that.  If you still think automatically triggering a close is the right thing, I'd like to better understand how this will impact:
- infinite loop transactions
- interrupting a multi-step transaction that's still issuing new write mutations.  Will the transaction be committed in its partial state?  Or is the closing just a nicer way to abort the transaction?  (And if it's just a nicer abort, what do we gain from that?)


Also, I think I might now have a better idea of how to create a test for this, and/or better repro.  The RuntimeService triggers shutdown at "xpcom-shutdown", so a ChromeWorker will not be shutdown until *after* QuotaManager gets shutdown.  So a ChromeWorker in a content process should reproduce the ContentParent shutdown scenario, and a ChromeWorker in the parent process should reproduce the ShutdownWorkThreads scenario.


> I'm removing the caller in Invalidate, because if Invalidate is called first
> I don't want it
> to remove metadata so early.

I missed that, apologies.  I definitely will have no complaints if we start using phabricator sooner rather than later, because it makes it easier to check the post-patch state!


> I don't think AbortOperationsForProcess() is called at that point. Remember
> that Invalidate also aborts and we never wanted to shutdown by aborting
> immediately.
> That's only done after 30 seconds (that is if shutdown is not done after 30
> seconds).

I think I already addressed the abort stuff above, but I'm interested if you can clarify about "we never wanted to shutdown by aborting immediately".  You definitely have more context about design intent.  (It's also possible back when I worked on the Firefox OS email app I complained about shutdown betraying me, but in general I think I always saw the phenomenon where the "success" event would not be able to dispatch to my worker and so the transaction would end up getting rolled back anyways.)
Ok, I'll check the flow again, but if remember correctly IPC was still available in ShutdownWorkThreads.
Ok, so all this is a bit more complicated. You are absolutely right about the case when there's normal e10s browser app running. AbortOperationsForProcess is called before ShutdownWorkThreads (I was mistaken, sorry for that).
However, non-e10s and for example xpcshell doesn't get AborOperationsForProcess at all, so we can't rely just on that.
Some stuff may have changed since b2g and I'm really surprised that we invalidate/abort transactions when a tab is closed or entire app quits (quits normally, not a hard kill).
If we wanted to keep the old logic that I remember, AbortOperationsForProcess should have a flag, indicating if it's a normal closing or a crash, so we could either wait for all transactions to finish w/o aborting them or abort and wait.
The "always invalidate" route is tempting because it's quite simple to implement/fix this bug, but I want to make sure that we're going to do the right decision here.
Are we 100% sure we want to abort transactions when a tab is closed or the app quits ?
Do we know what other browsers do ?
Re: chromium/blink behavior

From https://cs.chromium.org/chromium/src/third_party/blink/renderer/modules/indexeddb/idb_database.cc?l=589 we see that the connection is closed when the global is freed, like our DOMEventTargetHelper:

  void IDBDatabase::ContextDestroyed(ExecutionContext*) {
    // Immediately close the connection to the back end. Don't attempt a
    // normal close() since that may wait on transactions which require a
    // round trip to the back-end to abort.
    if (backend_) {
      backend_->Close();
      backend_.reset();
    }


In terms of a higher level shutdown like our QuotaManager shutdown, it appears that this is handled by their QuotaManager defining a task runner that explicitly blocks shutdown.  That is, at https://cs.chromium.org/chromium/src/storage/browser/quota/quota_manager.cc?q=shutdown+file:src/storage/browser/quota/&sq=package:chromium&dr=C&l=828 there's:
      db_runner_(base::CreateSequencedTaskRunnerWithTraits(
          {base::MayBlock(), base::TaskPriority::USER_VISIBLE,
           base::TaskShutdownBehavior::BLOCK_SHUTDOWN})),

That runner gets marked for deletion when the QuotaManager's destructor is invoked at https://cs.chromium.org/chromium/src/storage/browser/quota/quota_manager.cc?l=1149 which is:
  QuotaManager::~QuotaManager() {
    proxy_->manager_ = NULL;
    for (auto* client : clients_)
      client->OnQuotaManagerDestroyed();
    if (database_)
      db_runner_->DeleteSoon(FROM_HERE, database_.release());
  }

Note that while OnQuotaManagerDestroyed looks interesting, the IndexedDB client's implementation is boring at https://cs.chromium.org/chromium/src/content/browser/indexed_db/indexed_db_quota_client.cc?sq=package:chromium&l=75
  void IndexedDBQuotaClient::OnQuotaManagerDestroyed() { delete this; }


The mechanism that seems to correspond to "profile-before-change" is BrowserContext (their profile analog, I think)'s and all it seems to do related to our concept of QM is to shutdown serviceworkers.  See https://cs.chromium.org/chromium/src/content/browser/browser_context.cc?sq=package:chromium&l=384 and note this really useful comment that suggests that the windows/workers will be terminated before any such highlevel shutdown happens:
  // Service Workers must shutdown before the browser context is destroyed,
  // since they keep render process hosts alive and the codebase assumes that
  // render process hosts die before their profile (browser context) dies.
  ForEachStoragePartition(browser_context,
                          base::Bind(ShutdownServiceWorkerContext));
Er, that is, BrowserContext::NotifyWillBeDestroyed corresponds to "profile-before-change".  I forgot to copy-and-paste that.
(In reply to Jan Varga [:janv] from comment #94)
> If we wanted to keep the old logic that I remember,
> AbortOperationsForProcess should have a flag, indicating if it's a normal
> closing or a crash, so we could either wait for all transactions to finish
> w/o aborting them or abort and wait.
> The "always invalidate" route is tempting because it's quite simple to
> implement/fix this bug, but I want to make sure that we're going to do the
> right decision here.
> Are we 100% sure we want to abort transactions when a tab is closed or the
> app quits ?

## General Content Cases ##

I believe so.  Here's my rationale, expanding on my thoughts in hunk 2 of comment 92 with more support.

From the spec for transaction lifetime at https://w3c.github.io/IndexedDB/#transaction-lifetime-concept we've got two key facts:
- From step 2, "The implementation must allow requests to be placed against the transaction whenever the active flag is set."
- From step 7, "When a transaction has been started and it can no longer become active, the implementation must attempt to commit it, as long as the transaction has not been aborted."

A consequence of this is that if we can't dispatch a "success" event at content code, then we must abort the transaction because we can't know whether the "success" event would have issued an additional request.  To do otherwise is to risk violating the entirety of the IDB transaction semantics.  That is, if IDB-consuming code put()s "foo" then put()s "bar" when foo's success comes back, the author expects the DB to have both "foo" and "bar" or none of them.  Never just "foo".

Focusing on the Worker life-cycle because it's better specified in the HTML spec and a little more straightforward to analyze in our implementations: Once a worker hits the Closing state or Terminating state (documented at https://searchfox.org/mozilla-central/source/dom/workers/WorkerHolder.h#17) (and there's no sync event loop for sync XHR... but let's ignore that), it becomes impossible for any more tasks to be dispatched to/on the Worker global *per spec*.  Note that we had/have a pretty big hole in our implementation where, out of necessity, IPC runnables could still be delivered, plus runnables dispatched after the initial task queue draining.  :baku has some fixes pending for this (or maybe already landed?) and we need to do more fixes here.

The Closing state is somewhat of an edge-case, because it's when worker code has done self.close() but hasn't yet yielded back to the event loop.  Terminating (or higher) is the real shutdown state where the global is dead as far as JS content is concerned, which is why WorkerGlobalScope invokes DisconnectEventTargets() in NoteTerminating().  This is the line after which all transactions that haven't yet invoked SendCommit() are doomed to be aborted.

I assert that aborting is thus correct because anything beyond the Terminating and DOMEventTargetHelper::DisconnectFromOwner() stage is either doomed to abort and/or is just wasted effort.  We can't touch the global anymore, etc.  (I do think IDB has been trying to touch the global on workers previously, but as I sorta traced out, our attempt to dispatch the event ends up failing because the global is dead.).


## Edge Cases ##

There's 2 cases where the above DOMEventTargetHelper::DisconnectFromOwner() probably doesn't come into play:
1. Chrome code that doesn't have its global torn down during "profile-before-change".  This could include several different cases... I need to do more investigation to know for sure, but my argument is we should treat everything like the above, so I'm going to just enumerate them here:
  a. XUL globals and their ChromeWorkers.  It's possible the reason I saw the PBrowser terminations is the window got closed and the tabbrowser tore down all its <browser> instances.
  b. content <browser> instances in the parent process?  If 'a' gets nuked, then these aren't a problem.
  c. JSM's and JS XPCOM implementations and their ChromeWorkers.  Because we let so much of the browser be implemented in JS and these have no window lifecycle that constrains them, these are absolutely likely to survive until "xpcom-shutdown" which is after "profile-before-change-qm".
2. QM and IDB tests that perform a synthetic shutdown/restart of QuotaManager.  This is very handy for testing, but also not actually realistic.

We can absolutely make the argument that for "1" that our chrome-privileged code should know what it's doing and have the right to ensure that its writes hit disk even if shutdown is happening.  But it's also the case that they can just use https://searchfox.org/mozilla-central/source/toolkit/components/asyncshutdown/AsyncShutdown.jsm to add a shutdown blocker and block "profile-before-change" from completing.  In the cases where they haven't added a blocker, I think it's reasonable to abort the transactions.  If we're worried that we might subtly break things because the IDB-using code did not think about shutdown and assumed all transactions would be given the chance to complete, we can add a MOZ_DIAGNOSTIC_ASSERT in ShutdownWorkThreads() to aggressively surface the problem.  Ideally we'd also surface the consumer in the crash assertion, but we need to be careful about that.

For the "2" testing case, well, the semantics are of course up to us, but I think the only thing that makes sense is to be consistent with the content and chrome-privileged cases, and aborting is appropriate there.  Otherwise we're not testing what we think we're testing.  (And really, we shouldn't be testing QuotaManager going away, because in reality QuotaManager should always outlive content and basically outlive chrome-privileged scopes.)
That's a lot of information, let me parse it. So far it looks that invalidation is the way to go.
Anyway, thank you very much for this huge effort on your part.
(In reply to Andrew Sutherland [:asuth] from comment #90)
> ::: dom/indexedDB/ActorsParent.cpp
> @@ +13704,5 @@
> >    }
> >  
> > +  gFactoryOps->AppendElement(actor);
> > +
> > +  // Balanced in CleanupMetadata().
> 
> Maybe extend the comment here to be:
> Balanced in CleanupMetadata(), which is/must always called by SendResults().

Ok, extended the comment.

> 
> @@ +22485,2 @@
> >    } else if (mDirectoryLock) {
> >      nsCOMPtr<nsIRunnable> callback = NewRunnableMethod(
> 
> For clarity, maybe add a comment here like:
> // ConnectionClosedCallback will call CleanupMetadata().

Ok, added the comment.
Attachment #8968117 - Attachment is obsolete: true
Attachment #8972778 - Flags: review?(bugmail)
(In reply to Jan Varga [:janv] from comment #99)
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=de9b7be70a3d54d797161c0bcaa4c119cced4e74

one of the xpcshell tests failed, I fixed it here:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=03adf7bad29b067cca5fac32a69a0a737358b7e9
(In reply to Jan Varga [:janv] from comment #99)
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=de9b7be70a3d54d797161c0bcaa4c119cced4e74

I triggerd a couple more wdspec jobs to see if that patch fixes all of our timeout failures and indeed it does! Thanks a lot.
(In reply to Henrik Skupin (:whimboo) from comment #103)
> (In reply to Jan Varga [:janv] from comment #99)
> > https://treeherder.mozilla.org/#/
> > jobs?repo=try&revision=de9b7be70a3d54d797161c0bcaa4c119cced4e74
> 
> I triggerd a couple more wdspec jobs to see if that patch fixes all of our
> timeout failures and indeed it does! Thanks a lot.

Cool, thanks for the verification.
Comment on attachment 8972778 [details] [diff] [review]
Part 1: Wait for all live database objects to finish when shutting down

Review of attachment 8972778 [details] [diff] [review]:
-----------------------------------------------------------------

Woo!

Restating: This is the previous part 1 but with an explicit abort instead of the request-close mechanism.  Our rationale for the abort over a close is in comment 97.

Aside re Shutdown: My statement in comment 89 that "Note that this is invoked prior to the PContent IPC channel being closed, and PBackground won't actually get closed by the ContentParent shutdown mechanism at all..." is misleading.  A more accurate picture:
- Under debug builds and similar invariant checking builds like ASAN and TSAN we give the content child an unlimited amount of time to shutdown and we go through the full shutdown process.  This allows leak-checking to happen.
- Under release/non-debug builds, the content child only has 5 seconds to send a "FinishShutdown" message or it will be hard-killed.  Once it does send "FinishShutdown" and the parent closes the PContent channel, ContentChild::ActorDestroy will invoke ProcessChild::QuickExit()[1] which triggers what amounts to an exit(0) call[2], immediately terminating the process.
More detailed notes are at https://github.com/asutherland/asuth-gecko-notes/blob/87236cafa73b6ccd594d727fc397db452c239021/ipc/SHUTDOWN.md

1: https://searchfox.org/mozilla-central/rev/ce9ff94ffed34dc17ec0bfa406156d489eaa8ee1/dom/ipc/ContentChild.cpp#2357
2: https://searchfox.org/mozilla-central/rev/ce9ff94ffed34dc17ec0bfa406156d489eaa8ee1/ipc/glue/ProcessChild.cpp#39
Attachment #8972778 - Flags: review?(bugmail) → review+
Pushed by jvarga@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1a0c61c2cf60
Part 1: Wait for all live database objects to finish when shutting down; r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/f579a2844485
Part 2: Wait for all factory operations to finish when shutting down; r=asuth
https://hg.mozilla.org/mozilla-central/rev/1a0c61c2cf60
https://hg.mozilla.org/mozilla-central/rev/f579a2844485
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
For Firefox 60 it should be too late, but I feel that we should take a fix into account for ESR60.
[Tracking Requested - why for this release]: comment 108.  probably something to consider for 60.1 if things look ok in beta61.
Crash Signature: mozilla::SpinEventLoopUntil<T> ] [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging ] → mozilla::SpinEventLoopUntil<T> ] [@ shutdownhang | static bool mozilla::SpinEventLoopUntil<T> mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe] [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging ]
No longer blocks: 1449538
Could be a point release in ESR?
Do we have evidence showing this helped in 61?
Flags: needinfo?(hskupin)
(In reply to Julien Cristau [:jcristau] from comment #112)
> Do we have evidence showing this helped in 61?

Yes, all the timeout failures for new session tests in wdspec are gone. So from our side this helped a lot.
Flags: needinfo?(hskupin)
Jan, would you be comfortable requesting uplift to esr60?
Flags: needinfo?(jvarga)
Comment on attachment 8968118 [details] [diff] [review]
Part 2: Wait for all factory operations to finish when shutting down

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration: It's a top crash.
User impact if declined: Users would still experience often shutdown hangs.
Fix Landed on Version: 61
Risk to taking this patch (and alternatives if risky): We haven't seen any issues since it landed on 61.
String or UUID changes made by this patch:  None
Flags: needinfo?(jvarga)
Attachment #8968118 - Flags: approval-mozilla-esr60?
Comment on attachment 8984354 [details] [diff] [review]
Part 1: Wait for all live database objects to finish when shutting down (for ESR60)

Topcrash fix that's baked on 61 for awhile now. Approved for ESR 60.1.
Attachment #8984354 - Flags: approval-mozilla-esr60? → approval-mozilla-esr60+
Attachment #8968118 - Flags: approval-mozilla-esr60? → approval-mozilla-esr60+
Crash Signature: mozilla::SpinEventLoopUntil<T> ] [@ shutdownhang | static bool mozilla::SpinEventLoopUntil<T> mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe] [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging ] → mozilla::SpinEventLoopUntil<T> ] [@ shutdownhang | static bool mozilla::SpinEventLoopUntil<T> | mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe] [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging ]
This is still visible in the Windows nightly 20180621100048, as topcrash #8,
with 7 crashes from 7 different installations.
Flags: needinfo?(jvarga)
This is still a topcrash (#6) in 62 beta as well. Should we open a new bug here?
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #120)
> This is still a topcrash (#6) in 62 beta as well. Should we open a new bug
> here?

I filed Bug 1487194 for 2 of the signatures - will take a look at the rest later.

(In reply to Marcia Knous [:marcia - needinfo? me] from comment #121)

(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #120)

This is still a topcrash (#6) in 62 beta as well. Should we open a new bug
here?

I filed Bug 1487194 for 2 of the signatures - will take a look at the rest
later.

Yes, bug 1487194 is the new bug for remaining issues in quota manager shutdown.

Flags: needinfo?(jvarga)
Blocks: 1633342
No longer blocks: 1633342
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: