Closed Bug 1805427 Opened 2 years ago Closed 2 years ago

Update to Glean SDK 52.0.0, glean_parser 6.4.0

Categories

(Toolkit :: Telemetry, task, P2)

task

Tracking

()

RESOLVED FIXED
111 Branch
Tracking Status
firefox111 --- fixed

People

(Reporter: chutten, Assigned: chutten)

References

(Blocks 1 open bug)

Details

Attachments

(4 files)

Following the directions (and glean_parser's, too) update to Glean SDK v52.0.0 and glean_parser v6.4.0

This ought to help unblock bug 1790872 by using CowString for labels. Might have interactions with bug 1804359 which has a Glean SDK update in it. Requires updates to t/c/g/build_scripts to handle CowStrings properly and to have register_pings appropriately register only the pings associated with the app_id being init (see bug 1804915). And probably some other stuff too, we'll see : )

This also unblocks bug 1798919

See Also: → 1798919

Depends on D164603

FOG gets to learn which app ids register which pings, allowing it to tell the
Glean SDK to trim all data not associated with those pings.

This will keep other app_ids' custom pings' events from growing without bound.

Alas, this does not come with tests as the trim behaviour is very internal to
the SDK. As such we'll have to rely on the SDK's tests.

Depends on D164604

Duplicate of this bug: 1804359
Blocks: 1793784
Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/39a8c4bff42f Update glean_parser to 6.4.0 r=janerik https://hg.mozilla.org/integration/autoland/rev/b1c44a7274c5 Update Glean to 52.0.0, uuid to 1.0 r=glandium,janerik,webdriver-reviewers,supply-chain-reviewers https://hg.mozilla.org/integration/autoland/rev/9308ede8e265 Turn on Glean's trim_data_to_registered_pings r=janerik https://hg.mozilla.org/integration/autoland/rev/9231b82f6a8b Take advantage of new Cow<str> labels in Glean r=janerik

Backed out for causing xpc failures in toolkit/components/extensions/test/xpcshell/test_ext_scripting_persistAcrossSessions.js

Backout link: https://hg.mozilla.org/integration/autoland/rev/5955c9f254723ec95c130c4cea747fbf230f5c1f

Push with failures

Failure log

Flags: needinfo?(chutten)

That's an odd place for this to be causing problems. What're the chances this'll reproduce locally on Linux...

Flags: needinfo?(chutten)

Naturally, it doesn't reproduce locally on my Linux.

[task 2022-12-15T16:21:39.696Z] 16:21:39 WARNING - TEST-UNEXPECTED-FAIL | xpcshell-remote.ini:toolkit/components/extensions/test/xpcshell/test_ext_scripting_persistAcrossSessions.js | test_persisted_scripts_cleared_on_addon_updates - [test_persisted_scripts_cleared_on_addon_updates : 266] A promise chain failed to handle a rejection: store error: IoError(Os { code: 2, kind: NotFound, message: "The system cannot find the file specified." }) - stack: (No stack available.)

"No stack available." isn't particularly helpful here. Let's see about test_ext_scripting_persistAcrossSessions.js:266... that's a weird line to have an error on. It's just a param in a dict. What's going on there?

Now that I'm on a different tip of tree, let's see what try makes of it: https://treeherder.mozilla.org/jobs?repo=try&revision=73429d174c8451817da6617e2a76a2e01a42e29d

Wait a minute, the test it's referencing (test_persisted_scripts_cleared_on_addon_updates) isn't on line 266 at all. It's at the bottom of the file.

Even in the version of that file that was on autoland at the time that's the wrong line number for that test.

Weird. I'm very confused.

Let's take a different angle. What's the error. store error: IoError(Os { code: 2, kind: NotFound, message: "The system cannot find the file specified." }). Looks Rust-ish. Ah it's from rkv (or kvstore, or xulstore). I suppose then it could be Glean-related: we do use rkv. Seems a little unlikely, but who knows.... oh wait, if it is rkv then this is a migration error. To my knowledge Glean doesn't migrate on Firefox Desktop: we've been off LMDB since November 2020.

This is suggesting to me it isn't Glean, so maybe it's a different rkv consumer or it's kvstore or xulstore related?

I'm running low on threads to pull. Hopefully try gets back to me soonish.

Is there some data that might be in the profile? I'm not familiar with this particular test but I encountered errors in the past with tests that pulled the profile from under the rest of Firefox machinery, especially asynchronous machinery.

It's Glean, everything involves data (And the SDK does store stuff in the profile) : )

A read of the test suggests it isn't hitting disk anywhere near where the Glean SDK is operating, though. Far more likely to my mind is the possibility that the ExtensionScriptStore under test's backing kvstore is the thing returning the error than there is anything Glean's doing.

try confirms it's borked on Windows. As such, I've been spending my morning trying to get a Windows build to compile (gosh, there sure are a lot of warnings, and misuses of , operators inside asserts mean I might not be allowed to compile debug). Fingers crossed this actually compiles for me soon. I'd like to figure this out and get this landed in as early in this Fx110 cycle as possible.

I have confirmed this is kvstore by logging which rkv wrapper is throwing. I've also managed to make this fail exactly once on my Linux machine, so I don't think this is Windows-specific (just Windows-likely).

Looking at the suspects in the code changes, I've landed on a possibility: The rkv update contains this change which writes to a temporary file then renames it (for atomicity). It could be that the rename is trying to happen on a file that, despite being created the line before, has been removed.

I'm not sure by what means which file may be being removed, though. If we can rely on the log order, the error properly happens just before this line of the test and sometime after this line

With a guess about where this is happening can I figure out what might be happening... (an hour later) No. No I cannot. The code gets too complex for an easy investigation by an outsider.

Luca, can you help me out here? There is a problem of (I think) multiple simultaneous accesses to the ExtensionScriptingStore kvstore in test_ext_scripting_persistAcrossSessions.js (perhaps between this line and this line). It mostly manifests on Windows, but I've gotten it to happen on Linux, too. Any ideas about whether this test is capable of simultaneous commits to kvstore?

Flags: needinfo?(lgreco)

If all we care about is a fix or confirming the suspicion, I suppose I could try adding a salt to the tmp file's name and seeing if that passes the test. We wouldn't want to ship it (failing to rename the file would result in filling up the data dir with unlimited numbers of temp files), but if works we'll at least know it's simultaneous access.

(In reply to Chris H-C :chutten from comment #16)

If all we care about is a fix or confirming the suspicion, I suppose I could try adding a salt to the tmp file's name and seeing if that passes the test. We wouldn't want to ship it (failing to rename the file would result in filling up the data dir with unlimited numbers of temp files), but if works we'll at least know it's simultaneous access.

Ugh. Simultaneous access would mean it is probably broken in some very hidden way in the old way as well (overriding data in some cases).
(The "safe mode" is far less safe than I would like)

(In reply to Chris H-C :chutten from comment #16)

If all we care about is a fix or confirming the suspicion, I suppose I could try adding a salt to the tmp file's name and seeing if that passes the test. We wouldn't want to ship it (failing to rename the file would result in filling up the data dir with unlimited numbers of temp files), but if works we'll at least know it's simultaneous access.

I think that this part has been already figure out, but 266 is the line from PromiseTestUtils.sys.mjs where the uncaught rejection has been originated (and not the line from the test_ext_scripting_presistAcrossSession.js test file):

[task 2022-12-15T16:21:39.697Z] 16:21:39     INFO -  resource://testing-common/PromiseTestUtils.sys.mjs:assertNoUncaughtRejections:266

About the simultaneous access to the rkv backend used by the ExtensionScriptingStore.jsm:

  • ExtensionScriptingStore does use a single rkv backend for all extensions and so it may be that a task scheduled for an rkv data update related to one of the test task to be still in flight while we are running another test task, but all of them should be triggered by the same process (the parent process) and from the same thread (the parent process main thread) and I'm not sure if that would actually count as simultaneous access from an rkv perspective

  • but, xpcshell test can be executed in parallel by default, and so I wonder if there may be multiple xpcshell processes running separate xpcshell test file in parallel using the same Gecko profile and that to be triggering the unexpected simultaneous access

I think it may be possible to double-check if that may be the case by forcing that test to not be running in parallel using run-sequentially in the xpcshell ini config for that test file.

(In reply to Chris H-C :chutten from comment #16)

If all we care about is a fix or confirming the suspicion, I suppose I could try adding a salt to the tmp file's name and seeing if that passes the test. We wouldn't want to ship it (failing to rename the file would result in filling up the data dir with unlimited numbers of temp files), but if works we'll at least know it's simultaneous access.

Yeah, looking to https://github.com/mozilla/rkv/pull/231/files made me wonder the same, it seems worth giving that a try as an alternative way to try to better pinpoint what is actually triggering the issue (I don't think it is unlikely to be what is triggering the failure when paired with multiple xpcshell test file running in parallel over the same Gecko profile directory).

Let me know how that goes and if I can help further on investigating this issue.

Flags: needinfo?(lgreco)

but, xpcshell test can be executed in parallel by default, and so I wonder if there may be multiple xpcshell processes running separate xpcshell test file in parallel using the same Gecko profile and that to be triggering the unexpected simultaneous access

xpcshell test failures when running simultaneously are retried sequentially, and only count as a "proper" failure if they fail again. This is one of those cases (see here), so we're not in the case where we're worrying about simultaneity between multiple xpcshell processes. So at least there's that.

But that leaves us trying to figure out how a single test running "in isolation" (maybe there's carry-over state?) is triggering this. I can try adding a whackload of logging in addition to the salt and see whether we can detect the simultaneous access somehow. (Or maybe it'll refuse to reproduce with logging providing synchronization. Either way, maybe we learn something)

We have evidence of simultaneous access! I wrapped each call of writer.commit()?; in toolkit/components/kvstore/task.rs with logs, and obtained the following log of a test failure on my Windows laptop:

 0:21.22 INFO "Verify stale persisted scripts cleared on re-install"
 0:21.21 pid:5256 [2022-12-19T14:52:19Z ERROR kvstore::task] WriteMany commit start!
 0:21.21 pid:5256 [2022-12-19T14:52:19Z ERROR kvstore::task] WriteMany commit end!
 0:21.23 INFO "CONSOLE_MESSAGE: (info) 1671461539514    addons.repository       DEBUG   cacheAddons: enabled false IDs ["extension1@mochi.test"]"
 0:21.23 INFO "CONSOLE_MESSAGE: (info) 1671461539515    addons.xpi      DEBUG   Starting install of extension1@mochi.test from file:///C:/Users/chutten/gecko/obj-x86_64-pc-mingw32/temp/xpc-profile-9163tqx2/temp/generated-extension-14.xpi"
 0:21.23 INFO "CONSOLE_MESSAGE: (info) 1671461539515    addons.xpi      DEBUG   Addon extension1@mochi.test will be installed as a packed xpi"
 0:21.23 INFO "CONSOLE_MESSAGE: (info) 1671461539518    addons.xpi-utils        DEBUG   Make addon app-profile:extension1@mochi.test visible"
 0:21.23 INFO "CONSOLE_MESSAGE: (info) 1671461539518    addons.xpi      DEBUG   XPIStates adding add-on extension1@mochi.test in {"addons":{},"staged":{},"path":"C:\\\\Users\\\\chutten\\\\gecko\\\\obj-x86_64-pc-mingw32\\\\temp\\\\xpc-profile-9163tqx2\\\\extensions"}: C:\\Users\\chutten\\gecko\\obj-x86_64-pc-mingw32\\temp\\xpc-profile-9163tqx2\\extensions\\extension1@mochi.test.xpi"
 0:21.23 INFO "CONSOLE_MESSAGE: (info) 1671461539518    addons.xpi      DEBUG   Updating XPIState for {"id":"extension1@mochi.test","syncGUID":"{b19b456f-048a-4af2-8ee2-f5980a717510}","version":"1.0","type":"extension","loader":null,"updateURL":null,"installOrigins":null,"manifestVersion":2,"optionsURL":null,"optionsType":null,"optionsBrowserStyle":true,"aboutURL":null,"defaultLocale":{"name":"Generated extension","creator":null,"developers":null,"translators":null,"contributors":null},"visible":true,"active":true,"userDisabled":false,"appDisabled":false,"embedderDisabled":false,"installDate":1671461539514,"updateDate":1671461539514,"applyBackgroundUpdates":1,"path":"C:\\\\Users\\\\chutten\\\\gecko\\\\obj-x86_64-pc-mingw32\\\\temp\\\\xpc-profile-9163tqx2\\\\extensions\\\\extension1@mochi.test.xpi","skinnable":false,"sourceURI":"file:///C:/Users/chutten/gecko/obj-x86_64-pc-mingw32/temp/xpc-profile-9163tqx2/temp/generated-extension-14.xpi","releaseNotesURI":null,"softDisabled":false,"foreignInstall":false,"strictCompatibility":true,"locales":[],"targetApplications":[{"id":"toolkit@mozilla.org","minVersion":null,"maxVersion":null}],"targetPlatforms":[],"signedState":4,"signedDate":null,"seen":true,"dependencies":[],"incognito":"spanning","userPermissions":{"permissions":["scripting"],"origins":[]},"optionalPermissions":{"permissions":[],"origins":[]},"icons":{},"iconURL":null,"blocklistState":0,"blocklistURL":null,"startupData":null,"hidden":false,"installTelemetryInfo":null,"recommendationState":null,"rootURI":"jar:file:///C:/Users/chutten/gecko/obj-x86_64-pc-mingw32/temp/xpc-profile-9163tqx2/extensions/extension1@mochi.test.xpi!/","location":"app-profile"}"
 0:21.23 INFO "CONSOLE_MESSAGE: (info) 1671461539518    addons.xpi      DEBUG   Install of file:///C:/Users/chutten/gecko/obj-x86_64-pc-mingw32/temp/xpc-profile-9163tqx2/temp/generated-extension-14.xpi completed."
 0:21.23 INFO "CONSOLE_MESSAGE: (info) 1671461539518    addons.xpi      DEBUG   Loading bootstrap scope from jar:file:///C:/Users/chutten/gecko/obj-x86_64-pc-mingw32/temp/xpc-profile-9163tqx2/extensions/extension1@mochi.test.xpi!/"
 0:21.23 INFO "CONSOLE_MESSAGE: (info) 1671461539518    addons.xpi      DEBUG   Calling bootstrap method install on extension1@mochi.test version 1.0"
 0:21.23 INFO "CONSOLE_MESSAGE: (info) 1671461539518    addons.xpi      DEBUG   Calling bootstrap method startup on extension1@mochi.test version 1.0"
 0:21.24 INFO "Extension attached"
 0:21.24 INFO "Extension attached"
 0:21.23 pid:5256 [2022-12-19T14:52:19Z ERROR kvstore::task] PutTask commit start!
 0:21.23 pid:5256 [2022-12-19T14:52:19Z ERROR kvstore::task] DeleteTask commit start!
 0:21.23 pid:5256 [2022-12-19T14:52:19Z ERROR kvstore::task] PutTask commit end!
 0:21.23 pid:5256 [2022-12-19T14:52:19Z ERROR kvstore::task] PutTask commit start!
 0:21.23 pid:5256 [2022-12-19T14:52:19Z ERROR kvstore::task] DeleteTask commit end!
 0:21.25 pid:5256 JavaScript error: c:\\users\\chutten\\gecko\\testing\\xpcshell\\head.js, line 238: uncaught exception: store error: IoError(Os { code: 2, kind: NotFound, message: "The system cannot find the file specified." })
 0:21.25 INFO "CONSOLE_MESSAGE: (error) [JavaScript Error: "uncaught exception: store error: IoError(Os { code: 2, kind: NotFound, message: "The system cannot find the file specified." })" {file: "c:\\users\\chutten\\gecko\\testing\\xpcshell\\head.js" line: 238}]"
 0:21.26 INFO "CONSOLE_MESSAGE: (info) 1671461539544    addons.xpi      DEBUG   removeTemporaryFile: file:///C:/Users/chutten/gecko/obj-x86_64-pc-mingw32/temp/xpc-profile-9163tqx2/temp/generated-extension-14.xpi does not own temp file"

As you can see, there are three tasks (two Put, one Delete) interleaving their commits just before the IoError manifests. This is bad for a couple reasons:

  1. This means there were simultaneous commits happening on this test before this change. Which probably isn't great and may have lead to some interesting racy behaviour.
  2. Unless this is shown to be test-only, this could mean we're having racy behaviour right now.
  3. We don't know where this assumed thread safety is breaking down (Is kvstore supposed to provide it via nsISerialEventTarget? Is rkv supposed to via the writer() API?)

:rpl, if you can confirm that this is something test-specific we can file that investigation as a follow-up, temporarily disable this test, and land this bug. If that's not something we can guarantee, then... I guess we need Jan-Erik to help us understand whether rkv is supposed to offer any thread safety guarantees.

Flags: needinfo?(lgreco)
Flags: needinfo?(jrediger)

Hi chutten thanks for the additional details, I recorded a run of that test (as it is in mozilla-central right now, and so without the changes attached to Bug 1805427 and without your additional debug logs) using rr so that I could dig a bit more into which threads are running the <rkv::backend::impl_safe::transaction::RwTransactionImpl as rkv::backend::traits::BackendRwTransaction>::commit method (which I think may be the one where you added the debug logs you mentioned in your last comment on bugzilla).

I've been able to observe that:

  • while all the kvstore tasks are being handled from the thread 1 and queued into the background queue named "KeyValueDatabase"
  • then they are picked up from a "Background Pool #N" thread where the commit method is actually executed and the changes being written on disk,
  • but that Background Pool thread isn't always the same for the entire test (it changed a bunch of times in the recorded run I was inspecting), to be fair the one used earlier was already gone at that point (but I'm not sure if that would always be the case or if there may be something that would make two or more of those "Background Pool #N" threads to be running at the same time).

I wonder if Nika may help us to better understand either how we may be triggering the simultaneous access (and if there is anything we are doing wrong that is triggering it, either on the rkv or kvstore side or on the WebExtensions internals or test side).

About the test file itself, it doesn't seem we are doing anything particularly special in it:
it runs 2 test extensions and then it is installing a third one on top of the first test extension, and I wouldn't expect that to be having any special side-effect of what you observed in your build with the additional debug logs (the commit method being executed from different threads simultaneously).

And so with the details about the issue we have got so far, I wouldn't assume this to be a test only issue, at the moment it seems that if we manage to hit that in our tests there is at least a chance we may have been hitting it already and just didn't notice because rkv wasn't using a tmp file that was failing to be moved.

Flags: needinfo?(lgreco)

Yeah, sounds like it could indeed be a Nika question: kvstore is using a Background Task Queue which appears to be using multiple threads at different times (and perhaps simultaneously at least during tests!). Any ideas what might be going on?

Flags: needinfo?(nika)

Yesterday I looked into the safe-mode implementation of rkv and that one indeed has no safety against multi-threaded interleaving writes really.
My recent change with the temp file makes this now more visible, but I argue that there were already potential cases where it could lead to lost updates when some rkv transactions overlap.

Flags: needinfo?(jrediger)

I am not able to help here reading the rust code, but

that Background Pool thread isn't always the same for the entire test (it changed a bunch of times in the recorded run I was inspecting)

sounds like the normal behavior of a task queue that runs on top of a thread pool. The task queue should guarantee, that the events are run one after the other, but there is no guarantee on which thread of the thread pool they will be executed - it just should not happen in parallel. I would assume/hope that this is true also for task queues/thread pools used from rust. You might be able to confirm this if you add some long sleep inside the executed tasks and see if they still overlap (I am not sure how precise our logging order is when the logging happens fast enough from different threads). Or you might be able to pinpoint the beginning and ending of the single events in rr/pernosco to see if they really overlap.

And it might be unrelated to what we're experiencing, as some deeper dives I've been doing have shown that it's actually ExtensionScriptingStore and ExtensionPermissions which are acting simultaneously. They each hold their own separate kvstores atop the same rkv, so (afaict) there's nothing keeping them from stomping on each other. (The former is awaiting in deleteAll while the latter is in put. Each's task queue is running commit on the same store simultaneously)

Given the timing wrt the end-of-year and the Fx cycle, I'm gonna retarget landing this update early in the Fx111 cycle (so, Jan 16-ish?) with option to uplift to Fx110. Apologies to the See Also and Blocks lists.

Next steps: ...I'm really not sure here. It'd be nice to get some clear indication either from the writers or consumers of these APIs about their expectations of safety. It's pretty clear that there's overwrite possibilities between ExtensionScriptingStore and ExtensionPermissions in real, live Firefoxes, so :rpl it'd be nice to get your opinion on what you'd like done with that (it's own bug?).

What is clear is that I can't land this code as-is because I don't know what throwing an error during these overwrite events might do, when presently this has been a silent failure.

Flags: needinfo?(lgreco)
Priority: P1 → P2

(In reply to Chris H-C :chutten from comment #25)

And it might be unrelated to what we're experiencing, as some deeper dives I've been doing have shown that it's actually ExtensionScriptingStore and ExtensionPermissions which are acting simultaneously. They each hold their own separate kvstores atop the same rkv, so (afaict) there's nothing keeping them from stomping on each other. (The former is awaiting in deleteAll while the latter is in put. Each's task queue is running commit on the same store simultaneously)

Given the timing wrt the end-of-year and the Fx cycle, I'm gonna retarget landing this update early in the Fx111 cycle (so, Jan 16-ish?) with option to uplift to Fx110. Apologies to the See Also and Blocks lists.

Next steps: ...I'm really not sure here. It'd be nice to get some clear indication either from the writers or consumers of these APIs about their expectations of safety. It's pretty clear that there's overwrite possibilities between ExtensionScriptingStore and ExtensionPermissions in real, live Firefoxes, so :rpl it'd be nice to get your opinion on what you'd like done with that (it's own bug?).

What is clear is that I can't land this code as-is because I don't know what throwing an error during these overwrite events might do, when presently this has been a silent failure.

ouch, thanks Chris, if I'm reading your comment correctly then I think that I may see when this actually became an issue:

  • when we have introduced ExtensionScriptingStore as a backend for the scripting API content scripts details persisted on disk, we should have either used a different store path or shared the same KeyValueDatabase instance if we really wanted to share the same store path

ExtensionPermissions is currently using the rkv backend only in Nightly at the moment (see createStore function defined in ExtensionPermissions.jsm and the legacy JSON file on all other channel, and so at least in Beta and Release ExtensionScriptingStore would be the only one of those two to be using the rkv backend and there wouldn't be any concurrent access that could be triggering the same kind of failure we are getting by the ExtensionScriptingStore.jsm on Nightly.

Chris, if you do have still the builds where you where digging into this issue, would you mind to confirm that changing useRkv to false in the createStore function prevents the unexpected failure as we are assuming it should? (unless you have already did that).

As an additional thoughts about this:

  • if this kind of usage is introducing this kind of issues, then I'm wondering if we should tweak the comment in nsIKeyValue.idl where the following description is (see here in nsIKeyValue.idl) to make sure it is not worded in a way that would make it sound using nsIKeyValue.getOrCreate as done in ExtensionPermissions.jsm and ExtensionScriptingStore.jsm as fine if it is actually a footgun:
     * The service supports multiple named databases at the same path
     * (i.e. within the same storage file), so you can call this method
     * multiple times with the same path and different names to retrieve
     * multiple databases stored in the same location on disk.
  • given that even the comment would be easy to miss, and a method named getOrCreate may make it sound like the KeyValueDatabase returned by two different calls to nsIKeyValue meant to be a single instance of the same underlying KeyValueDatabase, I'm wondering it may be reasonable (and doable) to guard it against wrong usage patterns (e.g. an explicit error raised when multiple instances pointing to the same store file are being created and alive at the same time, or a diagnostic assertion or something similar)

wdyt?

Next steps: ...I'm really not sure here. It'd be nice to get some clear indication either from the writers or consumers of these APIs about their expectations of safety. It's pretty clear that there's overwrite possibilities between ExtensionScriptingStore and ExtensionPermissions in real, live Firefoxes, so :rpl it'd be nice to get your opinion on what you'd like done with that (it's own bug?).

yeah, if (as it sounds) we have enough confirmation that ExtensionScriptingStore and/or ExtensionPermission should be changed to prevent issues due to concurrent changes to the same store file, I think it would be worth filing its own bug.

Let me know if you prefer that I file the bug tracking the WebExtensions related parts.

Flags: needinfo?(lgreco)

(In reply to Luca Greco [:rpl] [:luca] [:lgreco] from comment #26)

Chris, if you do have still the builds where you where digging into this issue, would you mind to confirm that changing useRkv to false in the createStore function prevents the unexpected failure as we are assuming it should? (unless you have already did that).

I do have the builds and have now made the change to your instructions and rerun the test several times to confirm we're not hitting the issue. The system works!

As an additional thoughts about this:
<snip>
wdyt?

For sure, though kvstore isn't in my realm of influence so it might be best to also ask for opinions from Nika (or Lina. I hear she's back!) before going ahead.

Let me know if you prefer that I file the bug tracking the WebExtensions related parts.

I definitely prefer you file the bug. I'm on the Cc list of too many as it is : D

Two KeyValueDatabase objects do not share a single task queue, so have no ordering relative to one-another, so if you have two KeyValueDatabase instances which both act on the same file, you could definitely see a race (https://searchfox.org/mozilla-central/rev/57527d50ef5d3df412caa5d99536f0709399be6f/toolkit/components/kvstore/src/lib.rs#135).

I don't know a ton about the rkv setup and how it works under the hood, so redirecting to :lina who might have some more insight into what the callers are doing wrong or whether something should be cleaned up on the kvstore side.

Flags: needinfo?(nika) → needinfo?(lina)
No longer duplicate of this bug: 1804359
No longer blocks: 1793784

(In reply to Chris H-C :chutten from comment #27)

(In reply to Luca Greco [:rpl] [:luca] [:lgreco] from comment #26)
...

Let me know if you prefer that I file the bug tracking the WebExtensions related parts.

I definitely prefer you file the bug. I'm on the Cc list of too many as it is : D

Filed as Bug 1807010 (and added it to the blockers on this bug and Bug 1646182).

Blocks: 1807010
No longer blocks: 1646182
No longer blocks: 1807010
Depends on: 1807010
Blocks: 1798919
See Also: 1798919

If bug 1807010 removes the only simultaneous user of rkv, we'll be good to land this update after it. Lemme do a quick audit of rkv and kvstore users in tree...

In-tree crates using/wrapping rkv:

  • cert_storage
    • cert_storage uses a nsISerialEventTarget to serialize the Tasks by which it accesses its db. It is a service (nsICertStorage) and so is unlikely to have more than one instance accessing it simultaneously.
  • kvstore
    • kvstore is a utility upon which a few things have built their storage.
    • KeyValueService is the JS service fronting it which we now know has issues if anything's getOrCreate'd it with the same path. (thus bug 1807010). NotificationDB uses it on path ProfD/notificationstore, SyncHistory uses it on path ProfD/settings, ExtensionPermissions uses it on ProfD/extensions-store, and ExtensionScriptingStore uses it on ProfD/extensions-store. (So, after bug 1807010, it looks as though for JS consumers of kvstore we're in the clear).
    • I don't see any Rust or C++ consumers of kvstore.
  • xulstore
    • xulstore has been turned off in Fx110 and is slated for removal (per email to firefox-dev), so I'll count it as safe.
  • glean-core
    • In glean-core, access to storage requires using the dispatcher which runs its tasks sequentially. Except in tests, where we force sequential operation via utilities like lock_test.

To this marginally-trained eye, it looks as though we're in the clear. I'm presently rebasing and rebuilding and retesting the patchstack (expect an update to it soonish, reviewers) so we'll be ready to land this once bug 1807010 is clear and Fx111 cycle begins fresh. This may be eligible for uplift to 110 in early beta, but I'm not risking sheriff ire by trying to land this just before soft freeze even if the timing works out.

I am placing the decision of "where does rkv and kvstore wish to place their thread safety guarantee in the light of lmdb removal" firmly outside of scope of my operations in this bug.

Attachment #9308123 - Attachment description: Bug 1805427 - Update Glean to 52.0.0, uuid to 1.0 r?glandium!,janerik! → Bug 1805427 - Update Glean to 52.0.0 r?glandium!,janerik!

Hey Chris,
would you mind to double-check that with Bug 1807010 patch applied locally you are not able to trigger that test failure anymore as we are expecting to be the case?

The idea is that if we didn't got any better option, we may land Bug 1807010 at the start of the 111 nightly cycle, that way we should be able to land this one earlier in 111 nightly cycle and then request an uplift to 110 beta early enough (and on beta and release Bug 1807010 would not be necessary and so we should be able to uplift this one on its own, and let it bake on beta for most of the 110 beta cycle).

Btw, thanks again for the huge help provided to pin point what was the actual underlying issue, your help has been absolutely invaluable!!!

Flags: needinfo?(chutten)

I've rebuilt afresh on Windows and confirmed that the test still fails (fairly) reliably with the (latest revision of the) Glean 52 patchset applied. Then I applied bug 1807010's patch, rebuilt, and reran the test 5 times without failure.

The evidence is consistent with the hypothesis that bug 1807010's patch will solve things.

I don't think we'll get a better option than your proposed changes in bug 1807010. I don't expect changes to kvstore or KeyValueService to be as straightforward or rapid... and that's assuming that kvstore/KeyValueService wishes to supply thread safety and doesn't opt to leave it at documenting the current threading behaviour. So I'd really like it if you followed your proposed land+uplift plan. Otherwise I'll need to come up with a v52.0.1 or something that downgrades rkv, and removes the protection for buggy fs drivers it introduced.

Flags: needinfo?(chutten)

Had a chat with :rpl today and figured out that bug 1807010 could be made to not be blocking Gleanv52 if we patch kvstore to exhibit its pre-rkv-version-0.18.4 behaviour.

So the plan now becomes filing and fixing a bug blocking this bug that will patch kvstore's use of rkv. It'll explicitly capture and ignore IOErrors from commit() that are of kind: NotFound (ie, the specific errors thrown by simultaneous access). (We should also file a follow-up kvstore bug for removing this when/if its strategy around thread safety is locked in.) This keeps kvstore's behaviour from changing with the rkv update (ie, simultaneous access will silently overwrite updates as it has been since rkv switched to safe mode), which means we can leave bug 1807010 to be scheduled without worrying it might slow down the Glean update.

I'll go file that bug.

Depends on: 1810205
See Also: → 1807048
Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/116b727ef127 Update glean_parser to 6.4.0 r=janerik https://hg.mozilla.org/integration/autoland/rev/4bd136319a75 Update Glean to 52.0.0 r=glandium,janerik,webdriver-reviewers,supply-chain-reviewers https://hg.mozilla.org/integration/autoland/rev/7ddfeb2d877e Turn on Glean's trim_data_to_registered_pings r=janerik https://hg.mozilla.org/integration/autoland/rev/33e4898adb8a Take advantage of new Cow<str> labels in Glean r=janerik
Blocks: 1810802

(Since this is fixed, I replied in bug 1807010, comment 2. 😊)

Flags: needinfo?(lina)
Regressions: 1811872
See Also: → 1821056
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: