1687685 - Result of vacuuming is not checked in CreateOrMigrateSchema

Reporter

Description

•

5 years ago

The result of executing the VACUUM operation is not checked at https://searchfox.org/mozilla-release/rev/4585f47ffd5244d276b33420fc30cd6647a0a37b/dom/cache/DBSchema.cpp#563-566. It can't simply be added, since then dom/cache/test/xpcshell/test_migration.js actually fails. It must be checked if this is indeed a bad test case, or if there is a general problem with schema migration.

Simon Giesecke [:sg] [he/him]

Reporter

Comment 1

•

5 years ago

Tom, can you give this a look? I am not sure how critical that is.

Flags: needinfo?(ttung)

Tom Tung [:tt, :ttung]

Assignee

Comment 2

•

5 years ago

•

Edited

Attached file log for test_migration.js — Details

From [this comment](https://searchfox.org/mozilla-release/rev/435433ec58bdce688e2af89f808db4e38383d026/dom/cache/DBSchema.cpp#480-483) and the general purpose of `VACUUM`, it's used to rebuild the database file and to repack it into a minimum of disk space. After the code is only run when the schema is mismatched and we haven't introduced a new cache database upgraded since FF 57. So, I think it's bad that this wasn't caught but it's not critical. The attachment is the log when I enabled the "mozStorage:5" and ran the test. I didn't find something suspicious based on "A VACUUM will fail if there is an open transaction on the database connection that is attempting to run the VACUUM. Unfinalized SQL statements typically hold a read transaction open, so the VACUUM might fail if there are unfinalized SQL statements on the same connection. VACUUM (but not VACUUM INTO) is a write operation and so if another database connection is holding a lock that prevents writes, then the VACUUM will fail. " in (https://www.sqlite.org/lang_vacuum.html) I will find another time to dig into this more.

Tom Tung [:tt, :ttung]

Assignee

Updated

•

5 years ago

Assignee: nobody → ttung

Flags: needinfo?(ttung)

Simon Giesecke [:sg] [he/him]

Reporter

Comment 3

•

5 years ago

This probably brings up the question again (I think we discussed it for IndexedDB at some point, but the case for IndexedDB might be a bit different) if we need to support migrating such old profiles at all, or if we could just wipe the cache when we really encounter an old profile. That seems to be better than corrupting the profile...

Andrew Sutherland [:asuth] (he/him)

Comment 4

•

5 years ago

The currently oldest supported version of Firefox per the top row of the "Past Branch Dates" in https://wiki.mozilla.org/Release_Management/Calendar is ESR78. But note that this is if we're talking about a user downloading a fresh install of Firefox. The auto-update mechanism has a concept of Watershed releases which are updates the user must pass through for migration purposes. That is, if you have a very old version of Firefox, the auto-updater won't directly give you the current release version, it will go [watershed 43, watershed 47, watershed 57, watershed 72]. The telemetry on this which is also where I'm finding the set of watershed versions is https://telemetry.mozilla.org/update-orphaning/.

It seems extremely reasonable for us to only support upgrades from, say, ESR68 from here on out since that would also cover the v72 watershed.

And as mentioned somewhat on slack, we want to be clearing the QuotaManager origin, not just Cache API storage. This is both something that's notionally mandated by spec (and in theory would include LocalStorage and cookies, but I think we'd explicitly want to avoid cookies) and something that is likely important to avoid breakage as sites will both in practice and due to spec assume that origin storage will be cleared coherently for these more advanced forms of storage.

Andrew Sutherland [:asuth] (he/him)

Comment 5

•

5 years ago

And in terms of the implication of that, reviewing https://wiki.mozilla.org/DOM/QuotaManager/SchemaChanges (which unfortunately has ugly formatting resulting from the migration from etherpad not going great), this might suggest that if we see a QuotaManager storage.sqlite version that's not v2.1, that we delete the entirety of storage.sqlite, PROFILE/storage/ and any legacy PROFILE/indexeddb/ directory.

There's a minor edge-case there that we had the v3/v2.1 downgrade hack. It could make sense to treat v3 as explicitly broken/to be cleared since anyone past v57 shouldn't have a v3 QM.

Simon Giesecke [:sg] [he/him]

Reporter

Comment 6

•

5 years ago

Thanks for providing the reference to the watershed versions. I wasn't aware of that, but that's very useful and indeed seems to imply that we cannot hit the upgrade paths from pre-72/ESR68 versions. Which for the cache means that no upgrade paths are needed at all. Rather, as you suggested, we should just clear the QM origin in such a case. Probably a separate bug should be filed for that, and this can then be closed as WONTFIX.

Other quota clients and the QM scheme itself should be covered separately, I think.

Jan Varga [:janv]

Comment 7

•

5 years ago

•

Edited

Ok, but why we are still seeing very old QM storage upgrades for FF 84 ?
https://sql.telemetry.mozilla.org/queries/72503/source?p_version=Release%2084

Jan Varga [:janv]

Comment 8

•

5 years ago

If some users restore their profiles from an old backup, we should at least notify them (via a deprecation process) that such profiles are no longer supported.

Simon Giesecke [:sg] [he/him]

Reporter

Comment 9

•

5 years ago

(In reply to Jan Varga [:janv] from comment #7)

Ok, but why we are still seeing very old QM storage upgrades for FF 84 ?
https://sql.telemetry.mozilla.org/queries/72503/source?p_version=Release%2084

Maybe that's happening in some automation setting, which involves restoration of an ancient profile as you mentiong in comment 8?

Jan Varga [:janv]

Comment 10

•

5 years ago

Ok, so first of all, we need to know if automation could be involved. All our packaged profiles which are used for testing of the upgrades are used by xpcshell tests. I would be really surprised if we learned that xpcshell tests send the telemetry.

Jan Varga [:janv]

Comment 11

•

5 years ago

Oh, you probably didn't mean our automation (treeherder).

Simon Giesecke [:sg] [he/him]

Reporter

Comment 12

•

5 years ago

Back to the question of the result of vacuuming: in IndexedDB we are ignoring errors during vacuuming at https://searchfox.org/mozilla-central/source/dom/indexedDB/ActorsParent.cpp#14219, but we propagate errors at https://searchfox.org/mozilla-central/source/dom/indexedDB/ActorsParent.cpp#1029

Simon Giesecke [:sg] [he/him]

Reporter

Comment 13

•

5 years ago

(In reply to Jan Varga [:janv] from comment #11)

Oh, you probably didn't mean our automation (treeherder).

Right, I wasn't thinking of our automation, but someone else's, using a build from one of our channels. I hope that doesn't submit anything to telemetry, that would be very confusing. If it did, it would need to be an extra try/autoland channel.

Andrew Sutherland [:asuth] (he/him)

Comment 14

•

5 years ago

(In reply to Jan Varga [:janv] from comment #8)

If some users restore their profiles from an old backup, we should at least notify them (via a deprecation process) that such profiles are no longer supported.

The existing product decision for critical data in a user's profile as established by Refresh Firefox does not include QuotaManager-managed data or LocalStorage. (This is also something we push via URL bar interventions and other mechanisms.) Which is to say, the profiles are supported and it's already policy that QuotaManager-managed storage is acceptable data to triage.

I'm not sure exactly what you mean by deprecation process. If you're proposing some kind of UI that's exposed to users, then it seems unlikely to get product and UX sign-off, especially if we expect that most of these profiles involve some kind of test automation relying on historical snapshotted profiles that are restored either in their entirety or partially. If you're talking about proposing a release note (via the "relnote-firefox" flag and generally following the docs in https://wiki.mozilla.org/Release_Management/Release_Notes), then that does make sense to me. If you're proposing intentionally supporting QM releases that pre-date Firefox ESR68 (which means v2.2) or watershed 72 then I don't think that makes sense without an explicit user story explicitly signed off on by product management that recognizes the complexity trade-offs.

Andrew Sutherland [:asuth] (he/him)

Comment 15

•

5 years ago

•

Edited

So I understand the general situation with VACUUM to be this:

For the Cache API, this failure is currently expected due to how we handle schema migration for existing databases which I believe ends up being buggy.
- In migrations we use ALTER TABLE to add the column and set it to be NOT NULL with a DEFAULT value. But this explicitly does not modify the contents of the table and I think we were assuming it did.
  - We need to be running an UPDATE command here to actually be making the DEFAULT value the one we expect.
  - Per SQLite docs on DEFAULT is only evaluated at insertion time.
- When rewriteSchema is set to true, after we perform the migrations, we clobber the schema to no longer have the DEFAULT via RewriteEntriesSchema. This is what will cause VACUUM to fail as we've now enacted constraints that will not hold.
In all other cases, VACUUM potentially will be detecting corruption of the database because it ends up looking at every page and therefore effectively doing an integrity check both for file format errors as well and constraint errors. We should be handling this corruption and deleting the database.
- We probably want to ensure we have an exception for use of sqlite3_interrupt as we are planning to use that to speed up shutdown, I believe.
- It likely also makes sense to specially detect the NS_ERROR_STORAGE_CONSTRAINT here to ensure it's surfaced to telemetry as an indication of a potential application logic error. It might make sense to not delete the database unless we see the storage constraints at runtime from non-VACUUM purposes.
  - In this case we might want to run a PRAGMA integrity_check which produces a bounded set of string rows that describe the constraint errors and which could be returned via telemetry events.

Jan Varga [:janv]

Comment 16

•

5 years ago

I meant a release note.

especially if we expect that most of these profiles involve some kind of test automation relying on historical snapshotted profiles that are restored either in their entirety or partially.

Well, that's only our expectation. I already mentioned this on slack:
FF 84 has 100% storage initialization success rate and 99.70% temporary storage initialization success rate. There are 534,714 successful UpgradeStorageFrom0_0To1_0 (which is quite old upgrade). If the upgrade is successful it will no longer be invoked during next session. This doesn't necessary mean that there are 0.5M affected users. In theory, every session could start with a restored old profile. So we probably need to write a new query that involves client_id.
As I said, I'm not opposed to the idea of getting rid of some old upgrades, but we need to understand the data better and make the decision once we are confident we understand it well.

Jan Varga [:janv]

Comment 17

•

5 years ago

We could implement an intermediate step regarding QM storage upgrades and improving the success rate even more. If the storage initialization fails in one of the upgrades for storage version < 2.2, then entire storage/ directory and storage.sqlite would be deleted.

Andrew Sutherland [:asuth] (he/him)

Comment 18

•

5 years ago

Note that we will generate a new client id if it's not possible to load the id from PROFILE/datareporting/state.json and that logic seems to operate independently from the cached id in a preference although I haven't dug into the whole control flow there.

Here's the selenium webdriver Firefox profile logic in particular this logic.

Andrew Sutherland [:asuth] (he/him)

Comment 19

•

5 years ago

•

Edited

(In reply to Jan Varga [:janv] from comment #17)

We could implement an intermediate step regarding QM storage upgrades and improving the success rate even more. If the storage initialization fails in one of the upgrades for storage version < 2.2, then entire storage/ directory and storage.sqlite would be deleted.

The v2.3 schema was introduced in Firefox 70 in bug 1563023. I think it'd be appropriate to wipe storage in any all-or-nothing version upgrade failure given that v70 pre-dates v72 for non-ESR and we would expect ESR68 to have already upgraded to ESR78. It would then be a subsequent enhancement to remove support for older upgrade versions and have them move to using the same storage{.sqlite} clearing mechanism.

In the end state, we'd end up with the startup check deciding on versions sorta like:

Too Old: The storage is just too old, clear and start from scratch.
Best Effort: All or Nothing: The storage is old but we'll still try and update the data, but if we can't, we'll clear and start from scratch.
Current Migration: Clear Broken Origins: For all new upgrades from here on out on Beta and Release, we'll clear origins that fail to upgrade.
Nightly Active Development: Retain Broken Origins: For recently landed upgrades on nightly perhaps we initially mark the upgrade as one that will not clear the origin, instead letting us ask for help investigating from people. This could perhaps trigger the tab to open up about:quotamanager at most once (ex: guarded by a pref that gets set to the new version upgrade that broke). The idea here is that a newly landed upgrade shouldn't clear everyone's data, instead giving the opportunity to back things out and/or address the problems. This then advances to the next situation:
Nightly Stabilized: Clear Broken Origins: After we think we've successfully landed the upgrade on nightly or we're down to a long tail of errors, we land a minor change to switch to clearing the broken origins, but we still have the detailed _TRY telemetry logging so the problems are known.

I understand Jan's proposal above to be that we keep everything in "Best Effort" for now and basically nothing in "Too Old", but I think my general argument is that it's appropriate for binary size, general complexity, and general sanity reasons to start moving things into "Too Old" once we think the total clearing logic is sound.

Tom Tung [:tt, :ttung]

Assignee

Comment 20

•

4 years ago

Attached file Bug 1687685 - Use an empty string as default value for request_integrity rather than NULL; — Details

Tom Tung [:tt, :ttung]

Assignee

Comment 21

•

4 years ago

Attached file Bug 1687685 - Collect the number of contraint errors when executing vaccum per reason; — Details

Depends on D107237

Phabricator Automation

Updated

•

4 years ago

Attachment #9206948 - Attachment description: Bug 1687685 - Use empty string as default value for request_integrity rather than NULL; → Bug 1687685 - Use an empty string as default value for request_integrity rather than NULL;

Tom Tung [:tt, :ttung]

Assignee

Updated

•

4 years ago

Blocks: 1697730

Tom Tung [:tt, :ttung]

Assignee

Comment 22

•

4 years ago

Attached file bug-1687685.md — Details

Attachment #9209052 - Flags: data-review?(chutten)

Chris H-C :chutten

Comment 23

•

4 years ago

Comment on attachment 9209052 [details]
bug-1687685.md

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

No. This collection will expire in Firefox 96.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does there need to be a check-in in the future to determine whether to renew the data?

Yes. The DOM Necko-Worker-Storage team is responsible for renewing or removing the collection before it expires in Firefox 96.

Result: datareview+

Attachment #9209052 - Flags: data-review?(chutten) → data-review+

Tom Tung [:tt, :ttung]

Assignee

Comment 24

•

4 years ago

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=cc3dcfb9d2f0c60654b9eb7c03c1e7f6e88f5ea3

Pulsebot

Comment 25

•

4 years ago

Pushed by archaeopteryx@coole-files.de: https://hg.mozilla.org/integration/autoland/rev/bdfa69ff97f2 Use an empty string as default value for request_integrity rather than NULL; r=dom-storage-reviewers,asuth https://hg.mozilla.org/integration/autoland/rev/5a464a241fe6 Collect the number of contraint errors when executing vaccum per reason; r=dom-storage-reviewers,asuth

Cristian Brindusan [:cbrindusan]

Comment 26

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/bdfa69ff97f2
https://hg.mozilla.org/mozilla-central/rev/5a464a241fe6

Status: NEW → RESOLVED

Closed: 4 years ago

status-firefox88: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 88 Branch

telemetry-probes

Updated

•

4 years ago

log for test_migration.js 5 years ago Tom Tung [:tt, :ttung] 23.12 KB, text/plain		Details
Bug 1687685 - Use an empty string as default value for request_integrity rather than NULL; 4 years ago Tom Tung [:tt, :ttung] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1687685 - Collect the number of contraint errors when executing vaccum per reason; 4 years ago Tom Tung [:tt, :ttung] 48 bytes, text/x-phabricator-request		Details \| Review
bug-1687685.md 4 years ago Tom Tung [:tt, :ttung] 2.90 KB, text/plain	chutten : data-review+	Details