Bug 1722777 Comment 14 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

I got a chance to dig into this.  :nrishel reported that this reproed much more consistently with PGO builds, which was a clue.  Focusing on the case where the deletion fails due to `cert9.db` and `key4.db`, this is what is happening.  Due to challenges refcounting NSS objects, [NSS is shutdown as part of XPCOM shutdown] (https://searchfox.org/mozilla-central/rev/684aefab97202faa982dd898f1d99208a46f12b0/xpcom/build/XPCOMInit.cpp#761).  In the "fast shutdown" mode, that shutdown simply doesn't occur.  That means the SQLite DBs aren't closed, and (waves hands) that somehow leaves them blocking the temporary profile directory deletion.  Hence, this bug.  (Although, not the only cause of this bug.)

Here's what I propose for this ticket.  First, we need to fill the hole we've dug:

1.  We need to raise the high-water mark for the number of temporary profile directories immediately.
2.  We need an approach to clean up existing directories incrementally.
3.  If we can use a Windows-only feature like `MoveFileEx` to make sure these get cleaned up at shutdown time, great.  Perhaps let's only do that if we fail to delete the directory after N attempts (say 2 attempts).

Concurrently, we need to stop digging:

4.  Let's get a test establishing a background tasks profile contents highwater mark: Bug 1679443.  This is tricky because what the test task does actually matters in this case; profile contents are created on-demand.  But we can at least perform a network request to see the NSS files, say.
5.  Let's disable fast shutdown in background tasks mode, so that we are more likely to have an orderly shutdown.  It's not great that shutdowns could take longer and/or hang, but I think it's a risk worth taking: we don't have a human around to witness and correct problems.  (This doesn't have a ticket yet but should get one.)
5.  Let's turn of these NSS databases entirely: Bug 1757252.  We can easily keep this fixed with the tests added by Bug 1679443.
6.  Let's turn off recording Telemetry in background tasks mode (since it won't be submitted): Bug 1757229.  This should squash another large source of this particular problem.
7.  As a bonus, over the weekend I dug into not writing cookie databases to disk and it's pretty straightforward: Bug 1675829.  I'll get a patch up.
I got a chance to dig into this.  :nrishel reported that this reproed much more consistently with PGO builds, which was a clue.  Focusing on the case where the deletion fails due to `cert9.db` and `key4.db`, this is what is happening.  Due to challenges refcounting NSS objects, [NSS is shutdown as part of XPCOM shutdown](https://searchfox.org/mozilla-central/rev/684aefab97202faa982dd898f1d99208a46f12b0/xpcom/build/XPCOMInit.cpp#761).  In the "fast shutdown" mode, that shutdown simply doesn't occur.  That means the SQLite DBs aren't closed, and (waves hands) that somehow leaves them blocking the temporary profile directory deletion.  Hence, this bug.  (Although, not the only cause of this bug.)

Here's what I propose for this ticket.  First, we need to fill the hole we've dug:

1.  We need to raise the high-water mark for the number of temporary profile directories immediately.
2.  We need an approach to clean up existing directories incrementally.
3.  If we can use a Windows-only feature like `MoveFileEx` to make sure these get cleaned up at shutdown time, great.  Perhaps let's only do that if we fail to delete the directory after N attempts (say 2 attempts).

Concurrently, we need to stop digging:

4.  Let's get a test establishing a background tasks profile contents highwater mark: Bug 1679443.  This is tricky because what the test task does actually matters in this case; profile contents are created on-demand.  But we can at least perform a network request to see the NSS files, say.
5.  Let's disable fast shutdown in background tasks mode, so that we are more likely to have an orderly shutdown.  It's not great that shutdowns could take longer and/or hang, but I think it's a risk worth taking: we don't have a human around to witness and correct problems.  (This doesn't have a ticket yet but should get one.)
5.  Let's turn of these NSS databases entirely: Bug 1757252.  We can easily keep this fixed with the tests added by Bug 1679443.
6.  Let's turn off recording Telemetry in background tasks mode (since it won't be submitted): Bug 1757229.  This should squash another large source of this particular problem.
7.  As a bonus, over the weekend I dug into not writing cookie databases to disk and it's pretty straightforward: Bug 1675829.  I'll get a patch up.
I got a chance to dig into this.  :nrishel reported that this reproed much more consistently with PGO builds, which was a clue.  Focusing on the case where the deletion fails due to `cert9.db` and `key4.db`, this is what is happening.  Due to challenges refcounting NSS objects, [NSS is shutdown as part of XPCOM shutdown](https://searchfox.org/mozilla-central/rev/684aefab97202faa982dd898f1d99208a46f12b0/xpcom/build/XPCOMInit.cpp#761).  In the "fast shutdown" mode, that shutdown simply doesn't occur.    That means the SQLite DBs aren't closed, and (waves hands) that somehow leaves them blocking the temporary profile directory deletion.  Hence, this bug.  (Although, not the only cause of this bug.)

Fast shutdown mode is [only enabled when `!DEBUG` and `!MOZ_PROFILE_GENERATE`](https://searchfox.org/mozilla-central/rev/684aefab97202faa982dd898f1d99208a46f12b0/modules/libpref/init/StaticPrefList.yaml#11980), explaining why I don't see this locally.  If we were to test against a task that makes network requests in [test_backgroundtask_deletes_profile.js](https://searchfox.org/mozilla-central/rev/684aefab97202faa982dd898f1d99208a46f12b0/toolkit/components/backgroundtasks/tests/xpcshell/test_backgroundtask_deletes_profile.js#12), I think we'd witness this problem with `opt` builds.)

Here's what I propose for this ticket.  First, we need to fill the hole we've dug:

1.  We need to raise the high-water mark for the number of temporary profile directories immediately.
2.  We need an approach to clean up existing directories incrementally.
3.  If we can use a Windows-only feature like `MoveFileEx` to make sure these get cleaned up at shutdown time, great.  Perhaps let's only do that if we fail to delete the directory after N attempts (say 2 attempts).

Concurrently, we need to stop digging:

4.  Let's get a test establishing a background tasks profile contents highwater mark: Bug 1679443.  This is tricky because what the test task does actually matters in this case; profile contents are created on-demand.  But we can at least perform a network request to see the NSS files, say.
5.  Let's disable fast shutdown in background tasks mode, so that we are more likely to have an orderly shutdown.  It's not great that shutdowns could take longer and/or hang, but I think it's a risk worth taking: we don't have a human around to witness and correct problems.  (This doesn't have a ticket yet but should get one.)
5.  Let's turn of these NSS databases entirely: Bug 1757252.  We can easily keep this fixed with the tests added by Bug 1679443.
6.  Let's turn off recording Telemetry in background tasks mode (since it won't be submitted): Bug 1757229.  This should squash another large source of this particular problem.
7.  As a bonus, over the weekend I dug into not writing cookie databases to disk and it's pretty straightforward: Bug 1675829.  I'll get a patch up.

Back to Bug 1722777 Comment 14