Closed Bug 1716291 Opened 3 years ago Closed 3 years ago

xulstore should release its file locks before profile deletion

Categories

(Toolkit :: Storage, defect)

defect

Tracking

()

RESOLVED DUPLICATE of bug 1546838

People

(Reporter: jstutte, Assigned: myk)

References

Details

Attachments

(1 file, 1 obsolete file)

During profile deletion xulstore seems to hold active locks on lock.mdb and data.mdb, see the discussion started on bug 1714573 comment 6.

It seems that we shutdown xulstore only during real FF shutdown at a very late stage.

Additionally we have the singleton Manager = rkv::Manager<SafeModeEnvironment>; which might contribute to rkv staying alive and keeping files locked longer than expected, too.

We want probably a clean shutdown of xulstore during profile-before-change (or at the latest profile-before-change-qm, though this is usally reserved for quota managed storage).

We suspect this to potentially cause the incomplete deletion of profiles which might contribute to frequent warnings we encounter during quota manager's storage initialization about files that are recorded in our DB (left probably on disk) but miss as physical files (because deleted before we hit the xulstore lock).

Hi Myk, moving your ni here, thanks!

Flags: needinfo?(myk)

Ok, I've taken a closer look, and it's strange that xulstore (and extension-store, according to bug 1714573, comment 5) is holding locks on lock.mdb and data.mdb, given that xulstore at least is using the "safe-mode" environment of rkv, which shouldn't be creating those files.

(Those files are created by the "lmdb" environment, which xulstore is no longer using.)

I tried calling Manager.try_close() on the environment at "profile-before-change" anyway, in case there's some hidden dependency on the "lmdb" environment files, but it didn't make a difference.

Then I tried commenting out the calls to Migrator::easy_migrate_lmdb_to_safe_mode() in xulstore/src/statics.rs and kvstore/src/task.rs, and that worked around the issue.

See the attached patch for both sets of changes (a "profile-before-change" handler and commented out calls to the migrator function.)

It isn't clear why, from my reading of the migrator code, but when the migrator tries to open the lmdb environment (in order to migrate it to the safe-mode environment), perhaps that creates/locks those files and then doesn't unlock them.

In any case, if the migrator is the culprit, then this is an issue in rkv (or lmdb) rather than in xulstore/kvstore or the other rkv consumers in Gecko. So a ticket should probably be filed on rkv for further investigation.

I also see that Glean (in glean-core/src/database/mod.rs ) uses Migrator::open_and_migrate_lmdb_to_safe_mode() instead of the "easy" variant of the migration function, and then it explicitly deletes the LMDB files. I wonder if its approach would work around this issue.

Flags: needinfo?(myk)

Then I tried commenting out the calls to Migrator::easy_migrate_lmdb_to_safe_mode() in xulstore/src/statics.rs and kvstore/src/task.rs, and that worked around the issue.

To clarify: by "worked around the issue", I mean that when the two calls to Migrator::easy_migrate_lmdb_to_safe_mode() in xulstore/src/statics.rs and kvstore/src/task.rs are commented out on the changeset that was backed out in bug 1714573 (eaa1eb96adbe), then the browser/components/migration/tests/marionette/test_refresh_firefox.py test stops failing on Windows.

Thanks for the analysis and suggestions how to solve this! Let me see if we can get someone to have a look at rkv here.

Flags: needinfo?(jstutte)

Another "workaround" could be to just entirely remove migration support starting from version 94+ (when support for ESR 78 has ended). We switched to safe mode in version 81. People on the ESR channel would keep getting it until EOL ESR 91.x. People on the normal channel for whom the upgrade did not yet happen were not using their browser since almost 1.5 years and probably profit from a clean start. The only thing that might really break are old profiles stored on purpose and never updated within some CI environment.

WDYT?

Flags: needinfo?(jstutte) → needinfo?(myk)

That seems reasonable to me, although it's more of a product question than a technical one.

Flags: needinfo?(myk)

Hi Mike, can you suggest someone to help with this decision, see comment 6? Thanks!

Flags: needinfo?(mconca)

(In reply to Jens Stutte [:jstutte] from comment #6)

Another "workaround" could be to just entirely remove migration support starting from version 94+ (when support for ESR 78 has ended). We switched to safe mode in version 81. People on the ESR channel would keep getting it until EOL ESR 91.x. People on the normal channel for whom the upgrade did not yet happen were not using their browser since almost 1.5 years and probably profit from a clean start. The only thing that might really break are old profiles stored on purpose and never updated within some CI environment.

WDYT?

From a product standpoint, this makes sense to me.

Flags: needinfo?(mconca)

(In reply to Myk Melez [:myk] [@mykmelez] from comment #7)

That seems reasonable to me, although it's more of a product question than a technical one.

Hi Myk, would you be able to come to this then? Thanks!

Flags: needinfo?(myk)

Hi Jens, I'm not sure exactly what you're asking. Can you elaborate?

Hi Myk, sure, I wanted to ask if it would be up your alley to permanently remove xulstore's support for migration from lmdb to safe mode entirely in one of the next versions. IIUC, this way we would mitigate the problems with lmdb file locks without the need to dive deep into rkv itself right now. See also your comment 3. Thanks!

(In reply to Myk Melez [:myk] [@mykmelez] from comment #3)

Then I tried commenting out the calls to Migrator::easy_migrate_lmdb_to_safe_mode() in xulstore/src/statics.rs and kvstore/src/task.rs, and that worked around the issue.

Thanks for clarifying, Jens. Yes, I think I can contribute the change to xulstore, which should be as simple as commenting out the call to Migrator::easy_migrate_lmdb_to_safe_mode() (perhaps with a comment to explain why we aren't doing it anymore). Do you also want to do the same for kvstore, which produced the same failure in https://bugzilla.mozilla.org/show_bug.cgi?id=1714573#c5 for the extension-store directory?

Flags: needinfo?(myk) → needinfo?(jstutte)

Oh, yes, sure, when I answered I did not realize it were two different things actually. Thanks for pointing it out!

Flags: needinfo?(jstutte) → needinfo?(myk)
Assignee: nobody → myk
Status: NEW → ASSIGNED

Jens, ok, I've gone ahead and taken this bug and submitted a patch that contains the fix as I understand it.

Flags: needinfo?(myk)
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b3dc7d77241f
stop migrating xulstore/kvstore from LMDB- to safe mode backing stores; r=jstutte

Backed out for causing bustage on statics.rs

[task 2021-09-10T20:33:45.957Z] 20:33:45     INFO -  error: could not compile `kvstore` due to previous error
[task 2021-09-10T20:33:45.957Z] 20:33:45     INFO -  Caused by:
[task 2021-09-10T20:33:45.960Z] 20:33:45     INFO -    process didn't exit successfully: `CARGO=/builds/worker/fetches/rustc/bin/cargo CARGO_CRATE_NAME=kvstore CARGO_MANIFEST_DIR=/builds/worker/checkouts/gecko/toolkit/components/kvstore CARGO_PKG_AUTHORS='Myk Melez <myk@mykzilla.org>' CARGO_PKG_DESCRIPTION='' CARGO_PKG_HOMEPAGE='' CARGO_PKG_LICENSE='' CARGO_PKG_LICENSE_FILE='' CARGO_PKG_NAME=kvstore CARGO_PKG_REPOSITORY='' CARGO_PKG_VERSION=0.1.0 CARGO_PKG_VERSION_MAJOR=0 CARGO_PKG_VERSION_MINOR=1 CARGO_PKG_VERSION_PATCH=0 CARGO_PKG_VERSION_PRE='' LD_LIBRARY_PATH='/builds/worker/workspace/obj-build/release/deps:/builds/worker/fetches/rustc/lib:/builds/worker/fetches/clang/lib' /builds/worker/fetches/sccache/sccache /builds/worker/fetches/rustc/bin/rustc --crate-name kvstore toolkit/components/kvstore/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts --crate-type lib --emit=dep-info,metadata,link -C opt-level=2 -C panic=abort -C embed-bitcode=no -C metadata=1fe7876e5ecc6d80 -C extra-filename=-1fe7876e5ecc6d80 --out-dir /builds/worker/workspace/obj-build/x86_64-apple-darwin/release/deps --target x86_64-apple-darwin -C linker=/builds/worker/checkouts/gecko/build/cargo-linker -L dependency=/builds/worker/workspace/obj-build/x86_64-apple-darwin/release/deps -L dependency=/builds/worker/workspace/obj-build/release/deps --extern atomic_refcell=/builds/worker/workspace/obj-build/x86_64-apple-darwin/release/deps/libatomic_refcell-e8e5a941fdc9f7ad.rmeta --extern crossbeam_utils=/builds/worker/workspace/obj-build/x86_64-apple-darwin/release/deps/libcrossbeam_utils-7837ea6727699d6d.rmeta --extern cstr=/builds/worker/workspace/obj-build/release/deps/libcstr-efb78321bac38fe8.so --extern lazy_static=/builds/worker/workspace/obj-build/x86_64-apple-darwin/release/deps/liblazy_static-3239a1b5ebf3a821.rmeta --extern libc=/builds/worker/workspace/obj-build/x86_64-apple-darwin/release/deps/liblibc-0a0153e43dbc9007.rmeta --extern log=/builds/worker/workspace/obj-build/x86_64-apple-darwin/release/deps/liblog-6e7f16c593076ec2.rmeta --extern moz_task=/builds/worker/workspace/obj-build/x86_64-apple-darwin/release/deps/libmoz_task-889df964a01a5786.rmeta --extern nserror=/builds/worker/workspace/obj-build/x86_64-apple-darwin/release/deps/libnserror-e79caaf3f5a1b22b.rmeta --extern nsstring=/builds/worker/workspace/obj-build/x86_64-apple-darwin/release/deps/libnsstring-8c70a59d6dbb24da.rmeta --extern rkv=/builds/worker/workspace/obj-build/x86_64-apple-darwin/release/deps/librkv-174ecb35bc54d169.rmeta --extern storage_variant=/builds/worker/workspace/obj-build/x86_64-apple-darwin/release/deps/libstorage_variant-f459df77be84476a.rmeta --extern tempfile=/builds/worker/workspace/obj-build/x86_64-apple-darwin/release/deps/libtempfile-7ad77e6ec5046783.rmeta --extern thin_vec=/builds/worker/workspace/obj-build/x86_64-apple-darwin/release/deps/libthin_vec-957fd7efdb8cb1d4.rmeta --extern thiserror=/builds/worker/workspace/obj-build/x86_64-apple-darwin/release/deps/libthiserror-13be0ebb0d96adcb.rmeta --extern xpcom=/builds/worker/workspace/obj-build/x86_64-apple-darwin/release/deps/libxpcom-0962a2b0990f511f.rmeta -C debuginfo=2 -C force-frame-pointers=yes -Dwarnings -Aproc-macro-back-compat -Cembed-bitcode=yes -C codegen-units=1 -L native=/builds/worker/workspace/obj-build/x86_64-apple-darwin/release/build/lmdb-rkv-sys-8b7841f2e0211e8c/out` (exit status: 1)
[task 2021-09-10T20:33:45.960Z] 20:33:45     INFO -  warning: build failed, waiting for other jobs to finish...
[task 2021-09-10T20:33:45.960Z] 20:33:45     INFO -  error: unused import: `Migrator`
[task 2021-09-10T20:33:45.960Z] 20:33:45     INFO -    --> toolkit/components/xulstore/src/statics.rs:14:11
[task 2021-09-10T20:33:45.960Z] 20:33:45     INFO -     |
[task 2021-09-10T20:33:45.960Z] 20:33:45     INFO -  14 | use rkv::{Migrator, StoreOptions, Value};
[task 2021-09-10T20:33:45.960Z] 20:33:45     INFO -     |           ^^^^^^^^
[task 2021-09-10T20:33:45.960Z] 20:33:45     INFO -     |
[task 2021-09-10T20:33:45.960Z] 20:33:45     INFO -     = note: `-D unused-imports` implied by `-D warnings`
[task 2021-09-10T20:33:45.961Z] 20:33:45     INFO -  error: build failed
[task 2021-09-10T20:33:45.961Z] 20:33:45    ERROR -  gmake[4]: *** [/builds/worker/checkouts/gecko/config/makefiles/rust.mk:404: force-cargo-library-build] Error 101
Flags: needinfo?(myk)

I'd forgotten to remove the unused Migrator imports from the two files that no longer call Migrator::easy_migrate_lmdb_to_safe_mode(). I've updated the patch to remove those now. It should be ready to land again.

Flags: needinfo?(myk)
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9cbf4fe3f852
stop migrating xulstore/kvstore from LMDB- to safe mode backing stores; r=jstutte
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 94 Branch
Status: RESOLVED → REOPENED
Flags: needinfo?(myk)
Resolution: FIXED → ---
Target Milestone: 94 Branch → ---

I think the failing test is testing the wrong thing for a while but we just removed the file now entirely.

Ok, so this bug in the end is a duplicate of bug 1546838 which contains a reviewed patch that contains the exact same changes we propose here plus some cleanup we were missing. I think we should just rebase and land that other patch that was meant to land after some time but then got forgotten?

See Also: → 1546838
Attachment #9239978 - Attachment is obsolete: true
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → DUPLICATE

Thank you for your efforts here, Myk! It helped a lot to bring us on the right track.

Flags: needinfo?(myk)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: