Open Bug 1588498 (qm-shutdown-hangs) Opened 5 years ago Updated 3 days ago

[meta] Remaining issues that may cause QuotaManager shutdown hangs

Categories

(Core :: Storage: Quota Manager, defect, P2)

defect

Tracking

()

ASSIGNED
Tracking Status
firefox-esr78 --- wontfix
firefox86 --- wontfix
firefox87 --- wontfix
firefox88 --- wontfix
firefox89 --- wontfix

People

(Reporter: janv, Assigned: jstutte)

References

(Depends on 14 open bugs, Blocks 1 open bug, )

Details

(Keywords: leave-open, meta, topcrash, Whiteboard: [tbird crash])

Crash Data

Attachments

(2 files)

No description provided.
Depends on: 1588502
No longer blocks: 1541370
Component: DOM: IndexedDB → DOM: Quota Manager
Summary: [meta] Remaining issues that may cause IndexedDB shutdown hangs → [meta] Remaining issues that may cause QuotaManager shutdown hangs
Depends on: 1588510
Depends on: 1588512
Depends on: 1588519
Depends on: 1588522
Depends on: 1588525
Crash Signature: [@ mozilla::dom::indexedDB::`anonymous namespace'::QuotaClient::ShutdownTimedOut] [@ mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownTimedOut] [@ mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_3:…
Crash Signature: namespace)::QuotaClient::ShutdownWorkThreads::$_3::__invoke] [@ mozilla::dom::indexedDB::`anonymous namespace'::QuotaClient::ShutdownWorkThreads::<T>::__invoke] → namespace)::QuotaClient::ShutdownWorkThreads::$_3::__invoke] [@ mozilla::dom::indexedDB::`anonymous namespace'::QuotaClient::ShutdownWorkThreads::<T>::__invoke] [@ mozilla::dom::(anonymous namespace)::QuotaClient::ShutdownTimedOut] [@ mozilla::dom::(a…
See Also: → 1541370
See Also: → idb-shutdown-hangs
See Also: → ls-shutdown-hangs, 1542485
Type: task → defect
Depends on: 1616084
Depends on: 1593863
Crash Signature: mozilla::dom::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_15::__invoke] [@ mozilla::dom::`anonymous namespace'::QuotaClient::ShutdownTimedOut] [@ mozilla::dom::`anonymous namespace'::QuotaClient::ShutdownWorkThreads::<T>::__invoke] → mozilla::dom::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_15::__invoke] [@ mozilla::dom::`anonymous namespace'::QuotaClient::ShutdownTimedOut] [@ mozilla::dom::`anonymous namespace'::QuotaClient::ShutdownWorkThreads::<T>::__invoke] [@ m…
Depends on: 1666214
Depends on: 1672369
Crash Signature: mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_52::__invoke] → mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_52::__invoke] [@ static mozilla::dom::quota::Client::ShutdownWorkThreads::<T>::operator()::<T>::__invoke ]
Crash Signature: mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_52::__invoke] [@ static mozilla::dom::quota::Client::ShutdownWorkThreads::<T>::operator()::<T>::__invoke ] → mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_52::__invoke] [@ static mozilla::dom::quota::Client::ShutdownWorkThreads::<T>::operator()::<T>::__invoke ] [@ mozilla::dom::quota::Client::ShutdownWorkThreads::$_29::op…
Crash Signature: mozilla::dom::quota::Client::ShutdownWorkThreads::$_29::operator() const::{lambda#1}::__invoke ] → mozilla::dom::quota::Client::ShutdownWorkThreads::$_29::operator() const::{lambda#1}::__invoke ] [ @ static mozilla::dom::quota::Client::InitiateShutdownWorkThreads::<T>::operator()::<T>::__invoke ]
Crash Signature: mozilla::dom::quota::Client::ShutdownWorkThreads::$_29::operator() const::{lambda#1}::__invoke ] [ @ static mozilla::dom::quota::Client::InitiateShutdownWorkThreads::<T>::operator()::<T>::__invoke ] → mozilla::dom::quota::Client::ShutdownWorkThreads::$_29::operator() const::{lambda#1}::__invoke ] [@ static mozilla::dom::quota::Client::InitiateShutdownWorkThreads::<T>::operator()::<T>::__invoke ] [@ shutdownhang | RtlAcquireSRWLockExclusive | mozilla…
Crash Signature: mozilla::SpinEventLoopUntil<T> | mozilla::dom::quota::QuotaManager::Observer::Observe ] → mozilla::SpinEventLoopUntil<T> | mozilla::dom::quota::QuotaManager::Observer::Observe ] [@ mozilla::dom::quota::Client::ShutdownWorkThreads::$_29::operator() const::{lambda#1}::__invoke ] [@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | nsAppShell…
Crash Signature: nsAppShell::ProcessNextNativeEvent ] → nsAppShell::ProcessNextNativeEvent ] [@ static mozilla::dom::quota::QuotaManager::Shutdown::<T>::operator()::<T>::__invoke] [@ mozilla::dom::quota::QuotaManager::Shutdown::$_8::operator() const::{lambda#1}::__invoke]
Very high volume!
[@ static mozilla::dom::quota::Client::InitiateShutdownWorkThreads::<T>::operator()::<T>::__invoke ]
[@ static mozilla::dom::quota::QuotaManager::Shutdown::<T>::operator()::<T>::__invoke ]

Normal/Low volume
[@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | nsAppShell::ProcessNextNativeEvent ]
[@ mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownTimedOut ]
[@ mozilla::dom::indexedDB::`anonymous namespace'::QuotaClient::ShutdownWorkThreads::<T>::__invoke ]
[@ mozilla::dom::`anonymous namespace'::QuotaClient::ShutdownTimedOut ]
[@ static mozilla::dom::quota::Client::ShutdownWorkThreads::<T>::operator()::<T>::__invoke ]
[@ shutdownhang | RtlAcquireSRWLockExclusive | mozilla::SpinEventLoopUntil<T> | mozilla::dom::quota::QuotaManager::Observer::Observe ]

Thunderbird 78.x only
[@ mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_52::__invoke ]

68 only:
[@ mozilla::dom::(anonymous namespace)::QuotaClient::ShutdownTimedOut ]
Crash Signature: [@ mozilla::dom::indexedDB::`anonymous namespace'::QuotaClient::ShutdownTimedOut] [@ mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownTimedOut] [@ mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_3:… → [@ static mozilla::dom::quota::Client::InitiateShutdownWorkThreads::<T>::operator()::<T>::__invoke ] [@ static mozilla::dom::quota::QuotaManager::Shutdown::<T>::operator()::<T>::__invoke ] [@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | nsAppShell…

This attachment contains the value of QuotaManagerShutdownTimeout where set.

Flags: needinfo?(jstutte)
Crash Signature: namespace'::QuotaClient::ShutdownTimedOut ] [@ static mozilla::dom::quota::Client::ShutdownWorkThreads::<T>::operator()::<T>::__invoke ] [@ shutdownhang | RtlAcquireSRWLockExclusive | mozilla::SpinEventLoopUntil<T> | mozilla::dom::quota::QuotaManager::… → namespace'::QuotaClient::ShutdownTimedOut ] [@ static mozilla::dom::quota::Client::ShutdownWorkThreads::<T>::operator()::<T>::__invoke ] [@ mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_52::__invoke ]
Flags: needinfo?(jstutte)

Apparently the volume spikes up here since we unified the Quota client shutdown timeouts in bug 1681670. In turn the volume in other bugs (like bug 1542485, bug 1541776 and bug 1542485) went down by the same order of magnitude.

So we can be quite assured that we did not regress the overall quality of quota manager shutdown, we just moved some signatures and have hopefully now a better perspective on them.

Is there an update on this?
I really need this bug fixed as it's severely affecting my firefox performance making it really bad under prolonged or heavy use. I've had firefox be slow, freeze and crash for 3 hours straight due to this bug.
I get this bug every 3 days and every week.
Are you having problems reproducing this bug? I can sometimes reproduce it on my computer through "stress testing", but it appears sporadically and it appears to do with how memory is allocated. It appears more often once the windows memory caching Superfetch is off.

Hi :desbest, we are working hard to make this bug actionable. This bug is supposed to have a noticeable impact only during shutdown, leaving the Firefox process open for some more second than normal and then reporting a crash after the shutdown timeout. If this is what you are seeing, then reliable STR or a captured profile could help to move this forward. If you have any other concrete problem during use, please file a new bug, always with clearer STR and/or a captured profile.

In my experience this bug also affects the performance of Firefox before firefox is shut down. Firefox is slower if it happens that this bug will happen upon shut down of Firefox.
It's hard to provide a reliable STR because if I follow the same instructions, sometimes the bug will happen and sometimes it won't.
Does a captured profile mean to use the Firefox Profiler? https://profiler.firefox.com/

I suspect the bug is connected to when a website uses IndexedDB.

(In reply to desbest from comment #6)

In my experience this bug also affects the performance of Firefox before firefox is shut down. Firefox is slower if it happens that this bug will happen upon shut down of Firefox.

It might be in some weird state, that also affects shutdown, of course. There is no other load (other programs with high CPU usage) on the machine when this happens?

It's hard to provide a reliable STR because if I follow the same instructions, sometimes the bug will happen and sometimes it won't.
Does a captured profile mean to use the Firefox Profiler? https://profiler.firefox.com/

Yes, exactly. It could give at least hints, what is slowing down your instance in general. Please choose the "Firefox Platform" setting in the profiler plugin while recording. Or if you suspect Indexed DB to be involved, you might want to use a custom profile and switch on the feature "All File IO" and maybe the "DOM Worker" thread.

I can't get the profiler to work.
After finishing capturing, when I click the upload button nothing happens. So I can't share my capture with others.

Also how do I profile the startup and shutdown of firefox?
https://profiler.firefox.com/docs/#/./guide-startup-shutdown
Where do I put the environment variable? Do I have to change the windows settings like in the image below?

It won't let me post images. The images are here. https://imgur.com/a/uqGwi1C

I've spent 4 days trying to reproduce bugs and I've noticed two things using 2 methods.

===========

For method one, when I use the profiler to create a profile of how to reproduce the bug, Firefox freezes for over 3 hours and never resumes. I waited for as long as I could, and I noticed that it took 12 hours for it to stop freezing and then have Firefox spontaneously close. So I couldn't get the profiler to work. So I can't reproduce the bug using the first method unless the profiler is fixed so it stops freezing.

===========

For method two, if I use the Pocket Tube extension then wait for it to finish loading. If I open up a youtube video to play it then click the tag icon at the top that Pocket Tube generates, then close Firefox there will be a crash. However the signature for the crash will be different each time. Here is a profiler capture. https://share.firefox.dev/3mO77vC

Firefox 87.0 Crash Report [@ shutdownhang | RtlpWaitOnAddressWithTimeout | RtlpWaitOnAddress | RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | RtlEnterCriticalSection | sqlite3_finalize ]

Firefox 87.0 Crash Report [@ AsyncShutdownTimeout | places.sqlite#1: waiting for clients | Transaction (114),places.sqlite#1: PlacesExpiration.jsm: expire (115) ]

Flags: needinfo?(jstutte)

(In reply to desbest from comment #11)

I've spent 4 days trying to reproduce bugs and I've noticed two things using 2 methods.

Thanks for insisting!

Firefox 87.0 Crash Report [@ shutdownhang | RtlpWaitOnAddressWithTimeout | RtlpWaitOnAddress | RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | RtlEnterCriticalSection | sqlite3_finalize ]

Here is a corresponding crash report.
We try to finalize statements in places.sqlite on the main thread while there is another Thread 66, Name: mozStorage #2 executing something on another sqlite DB.

Firefox 87.0 Crash Report [@ AsyncShutdownTimeout | places.sqlite#1: waiting for clients | Transaction (114),places.sqlite#1: PlacesExpiration.jsm: expire (115) ]

There are quite some reports with similar signatures.
Again we see places.sqlite involved here. We run against this shutdown barrier. We might just be waiting on some slow transaction.

:mak, can you see something obvious here?

Flags: needinfo?(jstutte) → needinfo?(mak)

See also bug 1421562 ?

BTW: The signatures mentioned in comment #11 should not be considered part of this specific bug, they are yet another kind of async shutdown timeouts.

:desbest, do you have by chance the setting "Privacy & Security" -> "History" set to "Never remember history" or "Clear history when Nightly closes" ? Then I suspect you are affected by something similar to bug 1700584.

Flags: needinfo?(afaninthehouse)

It’s set to remember history.

(In reply to Jens Stutte [:jstutte] from comment #12)

We try to finalize statements in places.sqlite on the main thread while there is another Thread 66, Name: mozStorage #2 executing something on another sqlite DB.

How can you tell it's another db and not the same? It could be:

  1. thread 2 is executing some very long and expensive query on places.sqlite, and main-thread is just waiting for it to finish
  2. there is a deadlock between thread2 and main-thread being both on places.sqlite that will never be resolved even waiting forever
  3. thread 2 is executing Sync and it's taking too long

Firefox 87.0 Crash Report [@ AsyncShutdownTimeout | places.sqlite#1: waiting for clients | Transaction (114),places.sqlite#1: PlacesExpiration.jsm: expire (115) ]

This is an interesting signature, we should file a separate bug for it, since it looks like many of these timeouts are related to removing orphan icons, there may be a common cause. I don't expect that to be related to QuotaManager or indexedDB anyway.

Apparently this is hanging on "DELETE FROM moz_pages_w_icons WHERE page_url_hash NOT IN (SELECT url_hash FROM moz_places) OR id NOT IN (SELECT DISTINCT page_id FROM moz_icons_to_pages)"
There may be various reasons:

  1. there is some database corruption
  2. there is an unknown deadlock related to ATTACHed databases
  3. removing orphan icons is just very slow under certain circumstances

Did we go through the usual troubleshooting steps?
First thing i'd suggest to run Places Database Integrity check from about:support, and post the resulting log of that here or in the newly filed bug.
Then check if things improved after a restart.
Another interesting experiment would be to pretty much drop favicons, so go to the profile folder when Firefox is closed and move favicons.sqlite (and favicons.sqlite-wal) elsewhere. Launching firefox should generate a new one, and potentially resolve any unexpected perf issue with it, but will also lose all the favicons on bookmarks until you revisit them. You can always restore the previously moved file anyway, after the experiment. If that solves the problem, I may be interested into getting more statistics about that db.

But, as I said, I'd suggest to move the PlacesExpiration.js investigation to a separate bug. I don't expect these Places crashes to have any relation to DOM Storage or QuotaManager.

Flags: needinfo?(mak)

(In reply to Marco Bonardo [:mak] from comment #17)

But, as I said, I'd suggest to move the PlacesExpiration.js investigation to a separate bug. I don't expect these Places crashes to have any relation to DOM Storage or QuotaManager.

Spinned off bug 1706298.

Flags: needinfo?(afaninthehouse)

Looking at one of the cache related hangs I cannot see any thread doing any work.

The annotation reads:

cache: Managers: 1 (https://aaa.aa: Closing,  )
Intermediate steps:
0.000000s: starting

IIUC, the shutdown sequence in QuotaManager::Shutdown is the following:

  1. StopIdleMaintenance(); (does nothing except for IDB where it just sets an aborted flag
  2. Call InitiateShutdownWorkThreads(); for each client type. This writes the "starting" message into the annotations.
    2.a In case of cache this calls synchronously the sequence:
    Client::InitiateShutdownWorkThreads()
    -> CacheQuotaClient::InitiateShutdown()
    -> Manager::Factory::ShutdownAll()
    -> (for each manager) Manager::Shutdown()
    -> (for each Context) Context::CancelAll()
    -> Context::ThreadsafeHandle::AllowToClose()

Here it gets interesting, as we might dispatch dom::cache::Context::ThreadsafeHandle::AllowToCloseOnOwningThread to the owning thread. Interestingly, this code is never hit in our test automation?

-> if all goes well we then get into Context::ThreadsafeHandle::AllowToCloseOnOwningThread() and nuke our Contextreference which should nuke the Contextand through its destructor arrive at Factory::Remove(*this);.
-> If this was the last Context, Factory::Remove will destroy the factory singleton and thus unblock the shutdown. This also generates the "Manager removed" annotation.

2.b It then decides via CacheQuotaClient::IsShutdownCompleted() if we are already completely shut down. If not, we will enter the SpinEventLoopUntil
3. In case we need to wait for someone, enter the SpinEventLoopUntil.

It seems, that even after the cache shutdown we still have a living sFactory that prevents the event loop from breaking.

So the doubt would be: We dispatch dom::cache::Context::ThreadsafeHandle::AllowToCloseOnOwningThread with MOZ_ALWAYS_SUCCEEDS. But if this dispatch fails, we might end up with a live lock? From our automation it is not even clear if we ever try to dispatch this message, though.

Flags: needinfo?(jvarga)

Another possibility could be that we do not free all Contexts, of course. In that sense I was asking myself, if aOldManager->mContext is ever nulled out if there is an aOldManager passed here?

We want to enforce the (assumed) invariant that after we started QM shutdown we should not create new contextes
and associated threads for cache IO. We do this generating a runtime error, whose handling might not be consistent
in all (unexpected) cases, yet. But this is preferable over a shutdown hang crash, for sure.

Assignee: nobody → jstutte
Status: NEW → ASSIGNED

This patch is not really meant to fix the hangs.

Keywords: leave-open
Depends on: 1726598
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2ccec5420edf
Forbid the creation of new cache Context objects during quota manager shutdown. r=dom-storage-reviewers,asuth
Depends on: 1726711
Depends on: 1726714
Depends on: 1727497
Depends on: 1727526
Depends on: 1731968
Depends on: 1732274

Added a signature

Crash Signature: [@ static mozilla::dom::quota::Client::InitiateShutdownWorkThreads::<T>::operator()::<T>::__invoke ] [@ static mozilla::dom::quota::QuotaManager::Shutdown::<T>::operator()::<T>::__invoke ] [@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | nsAppShell… → [@ static mozilla::dom::quota::Client::InitiateShutdownWorkThreads::<T>::operator()::<T>::__invoke] [@ static mozilla::dom::quota::QuotaManager::Shutdown::<T>::operator()::<T>::__invoke] [@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | nsAppShell::…
Depends on: 1734393
Depends on: 1733826

Signature cleanup:

[@ static mozilla::dom::quota::Client::InitiateShutdownWorkThreads::<T>::operator()::<T>::__invoke ]
Appears only on 85.

[@ static mozilla::dom::quota::QuotaManager::Shutdown::<T>::operator()::<T>::__invoke ]
Very active

[@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | nsAppShell::ProcessNextNativeEvent ]
rare but active

[@ mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownTimedOut ]
<= 78

[@ mozilla::dom::indexedDB::`anonymous namespace'::QuotaClient::ShutdownWorkThreads::<T>::__invoke ]
inactive

[@ mozilla::dom::`anonymous namespace'::QuotaClient::ShutdownTimedOut ]
inactive

[@ static mozilla::dom::quota::Client::ShutdownWorkThreads::<T>::operator()::<T>::__invoke ]
inactive

[@ mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_52::__invoke ]
inactive

[@ mozilla::dom::quota::QuotaManager::Shutdown::{lambda#1}::operator() const::{lambda#1}::_FUN ]
new but rare
Crash Signature: [@ static mozilla::dom::quota::Client::InitiateShutdownWorkThreads::<T>::operator()::<T>::__invoke] [@ static mozilla::dom::quota::QuotaManager::Shutdown::<T>::operator()::<T>::__invoke] [@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | nsAppShell::… → [@ static mozilla::dom::quota::QuotaManager::Shutdown::<T>::operator()::<T>::__invoke] [@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | nsAppShell::ProcessNextNativeEvent] [@ mozilla::dom::quota::QuotaManager::Shutdown::{lambda#1}::operator() const…
Depends on: 1740893

This is still the #1 overall non-OOM topcrash on release.

Depends on: 1757144
Depends on: 1757186

This crash report shows a suspicious sequence:

Intermediate steps:
0.047000s: ClearRequestBase: Starting repeated directory removal #0
0.047000s: ClearRequestBase: Completed repeated directory removal #0
0.047000s: ClearRequestBase: Completed deleting files
0.047000s: ClearRequestBase: Starting deleting files
0.063000s: ClearRequestBase: Starting repeated directory removal #0
0.063000s: ClearRequestBase: Completed repeated directory removal #0
0.063000s: ClearRequestBase: Completed deleting files
... (continues until hang) ...

This one

ls: Datastores: 34 (DirectoryLock:1|Connection:1|QuotaObject:1|PrepareDatastoreOps:0|PreparedDatastores:0|Databases:0|ActiveDatabases:0|Origin:https://(removed)=D|PrivateBrowsingId:0|Closed:1, DirectoryLock:1|Connection:1|QuotaObject:1|PrepareDatastoreOps:0|PreparedDatastores:0|Databases:0|ActiveDatabases:0|Origin:https://(removed)=DD|PrivateBrowsingId:0|Closed:1, DirectoryLock:1|Connection:1|QuotaObject:1|PrepareDatastoreOps:0|PreparedDatastores:0|Databases:0|ActiveDatabases:0|Origin:moz-extension://(removed)|PrivateBrowsingId:0|Closed:1)

seems to show un-removed closed LS connections?

When a Datastore is closed we initiate closing of the connection and we release the directory lock after that. We unregister the Datastore object only when the connection is closed, otherwise a new Datastore could try to touch the same database on disk. I think the log from comment 30 falls into the group of "too much work during shutdown". If there are too many open LS connections, it can happen that we won't have a chance to close them all in time. I filed bug 1592593 for this in the past. I think that one could be a good "first project" bug.

If bug 1592593 doesn't help enouch, we might want to investigate adding more threads for I/O stuff. There's only one at the moment.

Anyway, I just noticed that we need to update Datastore::Stringify to include mHasLivePrivateDatastore.

Flags: needinfo?(jvarga)

(In reply to Jens Stutte [:jstutte] from comment #29)

This crash report shows a suspicious sequence:

Intermediate steps:
0.047000s: ClearRequestBase: Starting repeated directory removal #0
0.047000s: ClearRequestBase: Completed repeated directory removal #0
0.047000s: ClearRequestBase: Completed deleting files
0.047000s: ClearRequestBase: Starting deleting files
0.063000s: ClearRequestBase: Starting repeated directory removal #0
0.063000s: ClearRequestBase: Completed repeated directory removal #0
0.063000s: ClearRequestBase: Completed deleting files
... (continues until hang) ...

Since there's no info about the origin in the message, we can't say if it happens for the same origin directory or some other origin directory.
The latter is ok.

The annotations continue to show that we have two main causes (accounting together for ~90% of the cases):

  1. cache: Managers: 1 (https://aaaaaaaaaaaaaa.aaaaaaaaaa.aaa: Closing, ) - bug 1727526 did not yield any conclusion, yet
  2. QM: 1 normal origin ops pending - this one merits more investigation

In order to investigate 2. - it seems there is no way to know, which operation is pending here? It would probably make a huge difference if it can be randomly any kind or if we see only very few types of operations here?

Depends on: 1759732

(In reply to Jens Stutte [:jstutte] from comment #33)

  1. QM: 1 normal origin ops pending - this one merits more investigation

So the first ever crash with the new annotation shows:

  • that the annotation works!
  • that we directly trapped into a case with a very high # (35947) of pending ops
  • where all those pending ops are dom::quota::ClearOriginOp

Now it seems difficult to believe (though not impossible) that we want to delete ~36K of distinct origins here?

I'm not that surprised that a shutdown hang is caused by trying to do too much I/O, but 36K is quite a lot.
Bug 1733107 should help a bit.
(The de-duplication is probably a good idea).

There's also the repeated directory removal in the log, so it seems origin directories can't be deleted during shutdown and the number of origin then grows and grows.

Or some third-party party app interferes with us and slows down the clearing.

(In reply to Jan Varga [:janv] from comment #35)

(The de-duplication is probably a good idea).

Only if we also reduce the verbosity of ClearRequestBase::DeleteFiles which actually brings us to the retry loop (so it is good to have this verbosity!) where we seem to spend up to 10 times 200ms which seems a bad idea if all removals are always failing as seems to be the case here. So it is easy to spend some minute here and hit the timeout while the computer is actually idle.

This comment says it already.

What if we would just dispatch the same ClearRequestBaseOp another time if we were not able to remove all and move on? We should have a retry count parameter to be able to break, though.

I think we shouldn't repeat the clearing during shutdown. We should mark origin directories as invalid by putting marker files into them and then during storage initialization, remove such origin directories. I think there's a patch somewhere for the markers made by Tom. We might want to revive it ...

So I think we need to change our logic here. Instead of blindly retrying (and blocking) we should track during deletion based on the errors we get the severity of the situation:

Errors that can be safely ignored (no retry):
NS_ERROR_FILE_NOT_FOUND (already gone?)

Errors that might merit a retry:
NS_ERROR_FILE_IS_LOCKED (could go away)
NS_ERROR_FILE_DIR_NOT_EMPTY (something has been added while we were operating?)
NS_ERROR_FILE_DEVICE_TEMPORARY_FAILURE (the name says it all)
NS_ERROR_MODULE_WIN32 (mostly because we dunno what's in there)

All other errors do not merit a retry:
NS_ERROR_FILE_ACCESS_DENIED (this might be borderline, but I don't think it is ever a temporary thing without user or third-party SW interaction)
NS_ERROR_FILE_READ_ONLY (again, will not change)
NS_ERROR_FILE_NAME_TOO_LONG (will never change)
NS_ERROR_FILE_NOT_DIRECTORY (will never change)
NS_ERROR_FILE_FS_CORRUPTED (we should stop torturing the FS)
NS_ERROR_FILE_DEVICE_FAILURE (we should stop torturing the FS)
NS_ERROR_FILE_INVALID_PATH (will never change)

Errors that are not supposed to always happen in this context, and still do not merit a retry:
NS_ERROR_OUT_OF_MEMORY
NS_ERROR_FILE_NO_DEVICE_SPACE
NS_ERROR_FILE_ALREADY_EXISTS
NS_ERROR_FILE_INVALID_HANDLE

If then we really want to retry, we should do so by dispatching a new event rather than in a blocking loop with sleep.

(In reply to Jan Varga [:janv] from comment #40)

I think we shouldn't repeat the clearing during shutdown. We should mark origin directories as invalid by putting marker files into them and then during storage initialization, remove such origin directories. I think there's a patch somewhere for the markers made by Tom. We might want to revive it ...

OK, that is also an option, of course.

Depends on: 1760302

(In reply to Jens Stutte [:jstutte] from comment #38)

the retry loop (so it is good to have this verbosity!) where we seem to spend up to 10 times 200ms which seems a bad idea if all removals are always failing as seems to be the case here. So it is easy to spend some minute here and hit the timeout while the computer is actually idle.

After having touched the code in bug 1760302 I understood that actually in the crash from comment 34 we did not wait, we just printed out superfluous messages for how the loop was designed. Still I think we should do bug 1760302 in general, but this case seemed to be a different one.

Just a heads up that the patch from bug 1682326 (together with D139795 from bug 1757186) will most likely change the signatures here.

Crash Signature: const::{lambda#1}::_FUN] → const::{lambda#1}::_FUN] [@ mozilla::dom::quota::QuotaManager::Shutdown::$::__invoke ] [@ mozilla::dom::quota::QuotaManager::Shutdown::<T>::__invoke ]

So looking at the most recent nightly crashes, I can see:

  1. 24 cases where we do not see any annotation whatsoever. Example: 25785003-d2e6-4727-83ec-0ca7f0220425
  2. 43 cases where we seem to be stuck at ClearRequestBase::DeleteFiles (causing gNormalOriginOps not being cleared).
  3. only 3 cases with other, client-specific reasons

The first seems just an error in the way we create the annotations - if we arrive there with gNormalOriginOps unset we will crash unexpectedly before we set the annotations.

Edit: 1. is a regression from bug 1757186.

Depends on: 1770451
Depends on: 1770915

The bug is linked to topcrash signatures, which match the following criteria:

  • Top 5 desktop browser crashes on Mac on release
  • Top 20 desktop browser crashes on release
  • Top 20 desktop browser crashes on beta
  • Top 5 desktop browser crashes on Windows on beta
  • Top 5 desktop browser crashes on Windows on release

For more information, please visit auto_nag documentation.

Keywords: topcrash
Severity: normal → S3

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

(In reply to Release mgmt bot [:suhaib / :marco/ :calixte] from comment #48)

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

For some reasons, most of the relevant signatures moved to bug 1592336, but we want to continue the investigation here. So moving them all over here and closing bug 1592336.

Crash Signature: [@ static mozilla::dom::quota::QuotaManager::Shutdown::<T>::operator()::<T>::__invoke] [@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | nsAppShell::ProcessNextNativeEvent] [@ mozilla::dom::quota::QuotaManager::Shutdown::{lambda#1}::operator() const… → [@ mozilla::dom::quota::QuotaManager::Shutdown::<T>::operator() ] [@ mozilla::dom::quota::QuotaManager::Shutdown::<T>::__invoke ] [@ mozilla::dom::quota::QuotaManager::Shutdown::$::operator() ] [@ mozilla::dom::quota::QuotaManager::Shutdown::$::__invok…
Duplicate of this bug: 1592336

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to topcrash signatures, which match the following criteria:

  • Top 5 desktop browser crashes on Mac on release
  • Top 20 desktop browser crashes on release
  • Top 20 desktop browser crashes on beta
  • Top 5 desktop browser crashes on Windows on beta
  • Top 5 desktop browser crashes on Windows on release

For more information, please visit auto_nag documentation.

Keywords: topcrash

The severity field for this bug is set to S3. However, the following bug duplicate has higher severity:

:jstutte, could you consider increasing the severity of this bug to S2?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jstutte)
Severity: S3 → S2
Flags: needinfo?(jstutte)
Priority: P3 → P2
Depends on: 1822498
Crash Signature: mozilla::dom::quota::QuotaManager::Shutdown::$::__invoke ] → mozilla::dom::quota::QuotaManager::Shutdown::$::__invoke ] [@ mozilla::dom::quota::QuotaManager::Observer::Observe ]
Crash Signature: mozilla::dom::quota::QuotaManager::Shutdown::$::__invoke ] [@ mozilla::dom::quota::QuotaManager::Observer::Observe ] → mozilla::dom::quota::QuotaManager::Shutdown::$::__invoke ] [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::dom::quota::QuotaManager::Observer::Observe ] [@ shutdownhang | kernelbase.dll | mozilla::SpinEventLoopUntil<T> | mozilla::dom::quota…

Until bug 1822498 is fixed we should track all signatures here.

Duplicate of this bug: 1542485
Alias: qm-shutdown-hangs
Depends on: 1820823

(In reply to Intermittent Failures Robot from comment #55)

For more details, see:
https://treeherder.mozilla.org/intermittent-failures/bugdetails?bug=1588498&startday=2023-03-27&endday=2023-04-02&tree=all

This specific failure shows we are stuck flushing files from SQLite:

[task 2023-03-28T06:19:16.423Z] 06:19:16     INFO - Thread 34 QuotaManager IO
[task 2023-03-28T06:19:16.423Z] 06:19:16     INFO -  0  ntdll.dll!ZwFlushBuffersFile + 0xc
[task 2023-03-28T06:19:16.423Z] 06:19:16     INFO -      eip = 0x77616a1c    esp = 0x172fed14    ebp = 0x172feda8    ebx = 0x172fedb8
[task 2023-03-28T06:19:16.423Z] 06:19:16     INFO -      esi = 0x172fed7c    edi = 0x00000af4    eax = 0x00000000    ecx = 0x00000000
[task 2023-03-28T06:19:16.423Z] 06:19:16     INFO -      edx = 0x00000000 eflags = 0x00000206
[task 2023-03-28T06:19:16.423Z] 06:19:16     INFO -     Found by: given as instruction pointer in context
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -  1  xul.dll!mozilla::interceptor::FuncHook<mozilla::interceptor::WindowsDllInterceptor<mozilla::interceptor::VMSharingPolicyShared>,long (*)(void *, _IO_STATUS_BLOCK *) __attribute__((stdcall))>::operator()(void*&, _IO_STATUS_BLOCK*&) const [nsWindowsDllInterceptor.h:aff0b64247011767675227facc902a7968668686 : 150]
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -     Found by: inlining
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -  2  xul.dll!(anonymous namespace)::InterposedNtFlushBuffersFile(void*, _IO_STATUS_BLOCK*) [PoisonIOInterposerWin.cpp:aff0b64247011767675227facc902a7968668686 : 399 + 0x7]
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -      eip = 0x6638080d    esp = 0x172fed18    ebp = 0x172feda8    ebx = 0x172fedb8
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -      esi = 0x172fed7c    edi = 0x00000af4
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -     Found by: call frame info
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -  3  KERNELBASE.dll!FlushFileBuffers + 0x4b
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -      eip = 0x766da95c    esp = 0x172fedb0    ebp = 0x172fedc0    ebx = 0x04db55d4
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -      esi = 0x00000af4    edi = 0x04db55d4
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -     Found by: call frame info
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -  4  nss3.dll!winSync(sqlite3_file*, int) [sqlite3.c:aff0b64247011767675227facc902a7968668686 : 48137 + 0x11]
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -      eip = 0x7185960a    esp = 0x172fedc8    ebp = 0x172fefd0    ebx = 0x04db55d4
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -      esi = 0x00000af4    edi = 0x04db55d4
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -     Found by: call frame info
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -  5  xul.dll!(anonymous namespace)::xSync(sqlite3_file*, int) [TelemetryVFS.cpp:aff0b64247011767675227facc902a7968668686 : 222 + 0x14]
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -      eip = 0x66ecb348    esp = 0x172fefd8    ebp = 0x172ff030    ebx = 0x04db55d4
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -      esi = 0x0000024e    edi = 0x71d56d40
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -     Found by: call frame info

I would expect this to become a problem only when there is extremely slow disc I/O, it is a bit weird to see this in CI? Or am I overlookign something and there is more to investigate here?

Blocks: 1832861

Jens, these crashes seem to have spikes in the 114 beta cycle compared to the 113 cycle, anything that could explain that we could fix? Thanks

Flags: needinfo?(jstutte)

Yeah, comparing the same beta versions from 113 and 114 looks quite bad:

1 	114.0b3 	339 	35.17 %
2 	114.0b2 	313 	32.47 %
3 	114.0b4 	125 	12.97 %
4 	114.0b1 	86 	8.92 %
5 	113.0b4 	61 	6.33 %
6 	113.0b3 	23 	2.39 %
7 	113.0b2 	14 	1.45 %
8 	113.0b1 	3 	0.31 %

We need to take a closer look for sure.

Flags: needinfo?(jstutte)

Bug 1733107 should help with reducing shutdown hangs, it's currently our top priority to fix the bug and eventually uplift patches to beta.

Adding bug 1733107 to dependencies then. I am tracking bug 1733107 for a 114 beta uplift.

Depends on: 1733107
See Also: → 1833810
Depends on: 1833810
Depends on: 1746773
Whiteboard: [tbird crash]

The bug is linked to topcrash signatures, which match the following criteria:

  • Top 5 desktop browser crashes on Mac on release
  • Top 20 desktop browser crashes on release (startup)
  • Top 20 desktop browser crashes on beta
  • Top 10 desktop browser crashes on nightly
  • Top 5 desktop browser crashes on Windows on beta
  • Top 5 desktop browser crashes on Windows on release (startup)

For more information, please visit BugBot documentation.

There is a very weird upwards tendency for ESR crashes only, it seems. I cannot see any correlation to the corresponding normal versions, though. And even inside the 102.x series it is increasing over time, where I would not expect any functional changes landed that could influence here. And it happens also for 115.x, while the release channel seems to have a slow but steady downwards tendency all the time.

(In reply to Jens Stutte [:jstutte] from comment #63)

There is a very weird upwards tendency for ESR crashes only, it seems. I cannot see any correlation to the corresponding normal versions, though. And even inside the 102.x series it is increasing over time, where I would not expect any functional changes landed that could influence here. And it happens also for 115.x, while the release channel seems to have a slow but steady downwards tendency all the time.

That's bug 1849352, or better that bug would help to "mitigate" this. The upside is that there is no real increase affecting our users.

See Also: → 1849352

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit BugBot documentation.

Depends on: 1857858
Depends on: 1858067
Depends on: 1858625
See Also: → 1872960
Depends on: 1872960
Duplicate of this bug: 1872467
See Also: 1872960
Depends on: 1875036
No longer depends on: 1875036
See Also: → 1875036
Depends on: 1887609
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: