Update to Glean SDK 52.0.0, glean_parser 6.4.0
Categories
(Toolkit :: Telemetry, task, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox111 | --- | fixed |
People
(Reporter: chutten, Assigned: chutten)
References
(Blocks 1 open bug)
Details
Attachments
(4 files)
Following the directions (and glean_parser
's, too) update to Glean SDK v52.0.0 and glean_parser v6.4.0
This ought to help unblock bug 1790872 by using CowString for labels. Might have interactions with bug 1804359 which has a Glean SDK update in it. Requires updates to t/c/g/build_scripts
to handle CowStrings properly and to have register_pings
appropriately register only the pings associated with the app_id being init (see bug 1804915). And probably some other stuff too, we'll see : )
Comment 1•2 years ago
|
||
This also unblocks bug 1798919
Assignee | ||
Comment 2•2 years ago
|
||
Assignee | ||
Comment 3•2 years ago
|
||
Depends on D164603
Assignee | ||
Comment 4•2 years ago
|
||
FOG gets to learn which app ids register which pings, allowing it to tell the
Glean SDK to trim all data not associated with those pings.
This will keep other app_ids' custom pings' events from growing without bound.
Alas, this does not come with tests as the trim behaviour is very internal to
the SDK. As such we'll have to rely on the SDK's tests.
Depends on D164604
Assignee | ||
Comment 5•2 years ago
|
||
Depends on D164605
Comment 8•2 years ago
|
||
Backed out for causing xpc failures in toolkit/components/extensions/test/xpcshell/test_ext_scripting_persistAcrossSessions.js
Backout link: https://hg.mozilla.org/integration/autoland/rev/5955c9f254723ec95c130c4cea747fbf230f5c1f
Assignee | ||
Comment 9•2 years ago
|
||
That's an odd place for this to be causing problems. What're the chances this'll reproduce locally on Linux...
Assignee | ||
Comment 10•2 years ago
|
||
Naturally, it doesn't reproduce locally on my Linux.
[task 2022-12-15T16:21:39.696Z] 16:21:39 WARNING - TEST-UNEXPECTED-FAIL | xpcshell-remote.ini:toolkit/components/extensions/test/xpcshell/test_ext_scripting_persistAcrossSessions.js | test_persisted_scripts_cleared_on_addon_updates - [test_persisted_scripts_cleared_on_addon_updates : 266] A promise chain failed to handle a rejection: store error: IoError(Os { code: 2, kind: NotFound, message: "The system cannot find the file specified." }) - stack: (No stack available.)
"No stack available." isn't particularly helpful here. Let's see about test_ext_scripting_persistAcrossSessions.js:266
... that's a weird line to have an error on. It's just a param in a dict. What's going on there?
Now that I'm on a different tip of tree, let's see what try
makes of it: https://treeherder.mozilla.org/jobs?repo=try&revision=73429d174c8451817da6617e2a76a2e01a42e29d
Assignee | ||
Comment 11•2 years ago
|
||
Wait a minute, the test it's referencing (test_persisted_scripts_cleared_on_addon_updates
) isn't on line 266
at all. It's at the bottom of the file.
Even in the version of that file that was on autoland
at the time that's the wrong line number for that test.
Weird. I'm very confused.
Let's take a different angle. What's the error. store error: IoError(Os { code: 2, kind: NotFound, message: "The system cannot find the file specified." })
. Looks Rust-ish. Ah it's from rkv (or kvstore, or xulstore). I suppose then it could be Glean-related: we do use rkv. Seems a little unlikely, but who knows.... oh wait, if it is rkv
then this is a migration error. To my knowledge Glean doesn't migrate on Firefox Desktop: we've been off LMDB since November 2020.
This is suggesting to me it isn't Glean, so maybe it's a different rkv
consumer or it's kvstore
or xulstore
related?
I'm running low on threads to pull. Hopefully try
gets back to me soonish.
Comment 12•2 years ago
|
||
Is there some data that might be in the profile? I'm not familiar with this particular test but I encountered errors in the past with tests that pulled the profile from under the rest of Firefox machinery, especially asynchronous machinery.
Assignee | ||
Comment 13•2 years ago
|
||
It's Glean, everything involves data (And the SDK does store stuff in the profile) : )
A read of the test suggests it isn't hitting disk anywhere near where the Glean SDK is operating, though. Far more likely to my mind is the possibility that the ExtensionScriptStore
under test's backing kvstore
is the thing returning the error than there is anything Glean's doing.
Assignee | ||
Comment 14•2 years ago
|
||
try
confirms it's borked on Windows. As such, I've been spending my morning trying to get a Windows build to compile (gosh, there sure are a lot of warnings, and misuses of ,
operators inside asserts mean I might not be allowed to compile debug). Fingers crossed this actually compiles for me soon. I'd like to figure this out and get this landed in as early in this Fx110 cycle as possible.
Assignee | ||
Comment 15•2 years ago
|
||
I have confirmed this is kvstore by logging which rkv
wrapper is throwing. I've also managed to make this fail exactly once on my Linux machine, so I don't think this is Windows-specific (just Windows-likely).
Looking at the suspects in the code changes, I've landed on a possibility: The rkv
update contains this change which writes to a temporary file then renames it (for atomicity). It could be that the rename
is trying to happen on a file that, despite being created the line before, has been removed.
I'm not sure by what means which file may be being removed, though. If we can rely on the log order, the error properly happens just before this line of the test and sometime after this line
With a guess about where this is happening can I figure out what might be happening... (an hour later) No. No I cannot. The code gets too complex for an easy investigation by an outsider.
Luca, can you help me out here? There is a problem of (I think) multiple simultaneous accesses to the ExtensionScriptingStore
kvstore
in test_ext_scripting_persistAcrossSessions.js
(perhaps between this line and this line). It mostly manifests on Windows, but I've gotten it to happen on Linux, too. Any ideas about whether this test is capable of simultaneous commits to kvstore
?
Assignee | ||
Comment 16•2 years ago
|
||
If all we care about is a fix or confirming the suspicion, I suppose I could try adding a salt to the tmp file's name and seeing if that passes the test. We wouldn't want to ship it (failing to rename the file would result in filling up the data dir with unlimited numbers of temp files), but if works we'll at least know it's simultaneous access.
(In reply to Chris H-C :chutten from comment #16)
If all we care about is a fix or confirming the suspicion, I suppose I could try adding a salt to the tmp file's name and seeing if that passes the test. We wouldn't want to ship it (failing to rename the file would result in filling up the data dir with unlimited numbers of temp files), but if works we'll at least know it's simultaneous access.
Ugh. Simultaneous access would mean it is probably broken in some very hidden way in the old way as well (overriding data in some cases).
(The "safe mode" is far less safe than I would like)
Comment 18•2 years ago
|
||
(In reply to Chris H-C :chutten from comment #16)
If all we care about is a fix or confirming the suspicion, I suppose I could try adding a salt to the tmp file's name and seeing if that passes the test. We wouldn't want to ship it (failing to rename the file would result in filling up the data dir with unlimited numbers of temp files), but if works we'll at least know it's simultaneous access.
I think that this part has been already figure out, but 266 is the line from PromiseTestUtils.sys.mjs where the uncaught rejection has been originated (and not the line from the test_ext_scripting_presistAcrossSession.js test file):
[task 2022-12-15T16:21:39.697Z] 16:21:39 INFO - resource://testing-common/PromiseTestUtils.sys.mjs:assertNoUncaughtRejections:266
About the simultaneous access to the rkv backend used by the ExtensionScriptingStore.jsm:
-
ExtensionScriptingStore does use a single rkv backend for all extensions and so it may be that a task scheduled for an rkv data update related to one of the test task to be still in flight while we are running another test task, but all of them should be triggered by the same process (the parent process) and from the same thread (the parent process main thread) and I'm not sure if that would actually count as simultaneous access from an rkv perspective
-
but, xpcshell test can be executed in parallel by default, and so I wonder if there may be multiple xpcshell processes running separate xpcshell test file in parallel using the same Gecko profile and that to be triggering the unexpected simultaneous access
I think it may be possible to double-check if that may be the case by forcing that test to not be running in parallel using run-sequentially
in the xpcshell ini config for that test file.
(In reply to Chris H-C :chutten from comment #16)
If all we care about is a fix or confirming the suspicion, I suppose I could try adding a salt to the tmp file's name and seeing if that passes the test. We wouldn't want to ship it (failing to rename the file would result in filling up the data dir with unlimited numbers of temp files), but if works we'll at least know it's simultaneous access.
Yeah, looking to https://github.com/mozilla/rkv/pull/231/files made me wonder the same, it seems worth giving that a try as an alternative way to try to better pinpoint what is actually triggering the issue (I don't think it is unlikely to be what is triggering the failure when paired with multiple xpcshell test file running in parallel over the same Gecko profile directory).
Let me know how that goes and if I can help further on investigating this issue.
Assignee | ||
Comment 19•2 years ago
|
||
but, xpcshell test can be executed in parallel by default, and so I wonder if there may be multiple xpcshell processes running separate xpcshell test file in parallel using the same Gecko profile and that to be triggering the unexpected simultaneous access
xpcshell test failures when running simultaneously are retried sequentially, and only count as a "proper" failure if they fail again. This is one of those cases (see here), so we're not in the case where we're worrying about simultaneity between multiple xpcshell processes. So at least there's that.
But that leaves us trying to figure out how a single test running "in isolation" (maybe there's carry-over state?) is triggering this. I can try adding a whackload of logging in addition to the salt and see whether we can detect the simultaneous access somehow. (Or maybe it'll refuse to reproduce with logging providing synchronization. Either way, maybe we learn something)
Assignee | ||
Comment 20•2 years ago
•
|
||
We have evidence of simultaneous access! I wrapped each call of writer.commit()?;
in toolkit/components/kvstore/task.rs
with logs, and obtained the following log of a test failure on my Windows laptop:
0:21.22 INFO "Verify stale persisted scripts cleared on re-install"
0:21.21 pid:5256 [2022-12-19T14:52:19Z ERROR kvstore::task] WriteMany commit start!
0:21.21 pid:5256 [2022-12-19T14:52:19Z ERROR kvstore::task] WriteMany commit end!
0:21.23 INFO "CONSOLE_MESSAGE: (info) 1671461539514 addons.repository DEBUG cacheAddons: enabled false IDs ["extension1@mochi.test"]"
0:21.23 INFO "CONSOLE_MESSAGE: (info) 1671461539515 addons.xpi DEBUG Starting install of extension1@mochi.test from file:///C:/Users/chutten/gecko/obj-x86_64-pc-mingw32/temp/xpc-profile-9163tqx2/temp/generated-extension-14.xpi"
0:21.23 INFO "CONSOLE_MESSAGE: (info) 1671461539515 addons.xpi DEBUG Addon extension1@mochi.test will be installed as a packed xpi"
0:21.23 INFO "CONSOLE_MESSAGE: (info) 1671461539518 addons.xpi-utils DEBUG Make addon app-profile:extension1@mochi.test visible"
0:21.23 INFO "CONSOLE_MESSAGE: (info) 1671461539518 addons.xpi DEBUG XPIStates adding add-on extension1@mochi.test in {"addons":{},"staged":{},"path":"C:\\\\Users\\\\chutten\\\\gecko\\\\obj-x86_64-pc-mingw32\\\\temp\\\\xpc-profile-9163tqx2\\\\extensions"}: C:\\Users\\chutten\\gecko\\obj-x86_64-pc-mingw32\\temp\\xpc-profile-9163tqx2\\extensions\\extension1@mochi.test.xpi"
0:21.23 INFO "CONSOLE_MESSAGE: (info) 1671461539518 addons.xpi DEBUG Updating XPIState for {"id":"extension1@mochi.test","syncGUID":"{b19b456f-048a-4af2-8ee2-f5980a717510}","version":"1.0","type":"extension","loader":null,"updateURL":null,"installOrigins":null,"manifestVersion":2,"optionsURL":null,"optionsType":null,"optionsBrowserStyle":true,"aboutURL":null,"defaultLocale":{"name":"Generated extension","creator":null,"developers":null,"translators":null,"contributors":null},"visible":true,"active":true,"userDisabled":false,"appDisabled":false,"embedderDisabled":false,"installDate":1671461539514,"updateDate":1671461539514,"applyBackgroundUpdates":1,"path":"C:\\\\Users\\\\chutten\\\\gecko\\\\obj-x86_64-pc-mingw32\\\\temp\\\\xpc-profile-9163tqx2\\\\extensions\\\\extension1@mochi.test.xpi","skinnable":false,"sourceURI":"file:///C:/Users/chutten/gecko/obj-x86_64-pc-mingw32/temp/xpc-profile-9163tqx2/temp/generated-extension-14.xpi","releaseNotesURI":null,"softDisabled":false,"foreignInstall":false,"strictCompatibility":true,"locales":[],"targetApplications":[{"id":"toolkit@mozilla.org","minVersion":null,"maxVersion":null}],"targetPlatforms":[],"signedState":4,"signedDate":null,"seen":true,"dependencies":[],"incognito":"spanning","userPermissions":{"permissions":["scripting"],"origins":[]},"optionalPermissions":{"permissions":[],"origins":[]},"icons":{},"iconURL":null,"blocklistState":0,"blocklistURL":null,"startupData":null,"hidden":false,"installTelemetryInfo":null,"recommendationState":null,"rootURI":"jar:file:///C:/Users/chutten/gecko/obj-x86_64-pc-mingw32/temp/xpc-profile-9163tqx2/extensions/extension1@mochi.test.xpi!/","location":"app-profile"}"
0:21.23 INFO "CONSOLE_MESSAGE: (info) 1671461539518 addons.xpi DEBUG Install of file:///C:/Users/chutten/gecko/obj-x86_64-pc-mingw32/temp/xpc-profile-9163tqx2/temp/generated-extension-14.xpi completed."
0:21.23 INFO "CONSOLE_MESSAGE: (info) 1671461539518 addons.xpi DEBUG Loading bootstrap scope from jar:file:///C:/Users/chutten/gecko/obj-x86_64-pc-mingw32/temp/xpc-profile-9163tqx2/extensions/extension1@mochi.test.xpi!/"
0:21.23 INFO "CONSOLE_MESSAGE: (info) 1671461539518 addons.xpi DEBUG Calling bootstrap method install on extension1@mochi.test version 1.0"
0:21.23 INFO "CONSOLE_MESSAGE: (info) 1671461539518 addons.xpi DEBUG Calling bootstrap method startup on extension1@mochi.test version 1.0"
0:21.24 INFO "Extension attached"
0:21.24 INFO "Extension attached"
0:21.23 pid:5256 [2022-12-19T14:52:19Z ERROR kvstore::task] PutTask commit start!
0:21.23 pid:5256 [2022-12-19T14:52:19Z ERROR kvstore::task] DeleteTask commit start!
0:21.23 pid:5256 [2022-12-19T14:52:19Z ERROR kvstore::task] PutTask commit end!
0:21.23 pid:5256 [2022-12-19T14:52:19Z ERROR kvstore::task] PutTask commit start!
0:21.23 pid:5256 [2022-12-19T14:52:19Z ERROR kvstore::task] DeleteTask commit end!
0:21.25 pid:5256 JavaScript error: c:\\users\\chutten\\gecko\\testing\\xpcshell\\head.js, line 238: uncaught exception: store error: IoError(Os { code: 2, kind: NotFound, message: "The system cannot find the file specified." })
0:21.25 INFO "CONSOLE_MESSAGE: (error) [JavaScript Error: "uncaught exception: store error: IoError(Os { code: 2, kind: NotFound, message: "The system cannot find the file specified." })" {file: "c:\\users\\chutten\\gecko\\testing\\xpcshell\\head.js" line: 238}]"
0:21.26 INFO "CONSOLE_MESSAGE: (info) 1671461539544 addons.xpi DEBUG removeTemporaryFile: file:///C:/Users/chutten/gecko/obj-x86_64-pc-mingw32/temp/xpc-profile-9163tqx2/temp/generated-extension-14.xpi does not own temp file"
As you can see, there are three tasks (two Put
, one Delete
) interleaving their commits just before the IoError
manifests. This is bad for a couple reasons:
- This means there were simultaneous commits happening on this test before this change. Which probably isn't great and may have lead to some interesting racy behaviour.
- Unless this is shown to be test-only, this could mean we're having racy behaviour right now.
- We don't know where this assumed thread safety is breaking down (Is
kvstore
supposed to provide it viansISerialEventTarget
? Isrkv
supposed to via thewriter()
API?)
:rpl, if you can confirm that this is something test-specific we can file that investigation as a follow-up, temporarily disable this test, and land this bug. If that's not something we can guarantee, then... I guess we need Jan-Erik to help us understand whether rkv
is supposed to offer any thread safety guarantees.
Comment 21•2 years ago
|
||
Hi chutten thanks for the additional details, I recorded a run of that test (as it is in mozilla-central right now, and so without the changes attached to Bug 1805427 and without your additional debug logs) using rr so that I could dig a bit more into which threads are running the <rkv::backend::impl_safe::transaction::RwTransactionImpl as rkv::backend::traits::BackendRwTransaction>::commit
method (which I think may be the one where you added the debug logs you mentioned in your last comment on bugzilla).
I've been able to observe that:
- while all the kvstore tasks are being handled from the thread 1 and queued into the background queue named "KeyValueDatabase"
- then they are picked up from a "Background Pool #N" thread where the
commit
method is actually executed and the changes being written on disk, - but that Background Pool thread isn't always the same for the entire test (it changed a bunch of times in the recorded run I was inspecting), to be fair the one used earlier was already gone at that point (but I'm not sure if that would always be the case or if there may be something that would make two or more of those "Background Pool #N" threads to be running at the same time).
I wonder if Nika may help us to better understand either how we may be triggering the simultaneous access (and if there is anything we are doing wrong that is triggering it, either on the rkv or kvstore side or on the WebExtensions internals or test side).
About the test file itself, it doesn't seem we are doing anything particularly special in it:
it runs 2 test extensions and then it is installing a third one on top of the first test extension, and I wouldn't expect that to be having any special side-effect of what you observed in your build with the additional debug logs (the commit
method being executed from different threads simultaneously).
And so with the details about the issue we have got so far, I wouldn't assume this to be a test only issue, at the moment it seems that if we manage to hit that in our tests there is at least a chance we may have been hitting it already and just didn't notice because rkv wasn't using a tmp file that was failing to be moved.
Assignee | ||
Comment 22•2 years ago
|
||
Yeah, sounds like it could indeed be a Nika question: kvstore
is using a Background Task Queue which appears to be using multiple threads at different times (and perhaps simultaneously at least during tests!). Any ideas what might be going on?
Yesterday I looked into the safe-mode
implementation of rkv and that one indeed has no safety against multi-threaded interleaving writes really.
My recent change with the temp file makes this now more visible, but I argue that there were already potential cases where it could lead to lost updates when some rkv transactions overlap.
Comment 24•2 years ago
|
||
I am not able to help here reading the rust code, but
that Background Pool thread isn't always the same for the entire test (it changed a bunch of times in the recorded run I was inspecting)
sounds like the normal behavior of a task queue that runs on top of a thread pool. The task queue should guarantee, that the events are run one after the other, but there is no guarantee on which thread of the thread pool they will be executed - it just should not happen in parallel. I would assume/hope that this is true also for task queues/thread pools used from rust. You might be able to confirm this if you add some long sleep inside the executed tasks and see if they still overlap (I am not sure how precise our logging order is when the logging happens fast enough from different threads). Or you might be able to pinpoint the beginning and ending of the single events in rr/pernosco to see if they really overlap.
Assignee | ||
Comment 25•2 years ago
|
||
And it might be unrelated to what we're experiencing, as some deeper dives I've been doing have shown that it's actually ExtensionScriptingStore
and ExtensionPermissions
which are acting simultaneously. They each hold their own separate kvstore
s atop the same rkv
, so (afaict) there's nothing keeping them from stomping on each other. (The former is awaiting in deleteAll
while the latter is in put
. Each's task queue is running commit on the same store simultaneously)
Given the timing wrt the end-of-year and the Fx cycle, I'm gonna retarget landing this update early in the Fx111 cycle (so, Jan 16-ish?) with option to uplift to Fx110. Apologies to the See Also and Blocks lists.
Next steps: ...I'm really not sure here. It'd be nice to get some clear indication either from the writers or consumers of these APIs about their expectations of safety. It's pretty clear that there's overwrite possibilities between ExtensionScriptingStore
and ExtensionPermissions
in real, live Firefoxes, so :rpl it'd be nice to get your opinion on what you'd like done with that (it's own bug?).
What is clear is that I can't land this code as-is because I don't know what throwing an error during these overwrite events might do, when presently this has been a silent failure.
Assignee | ||
Updated•2 years ago
|
Comment 26•2 years ago
|
||
(In reply to Chris H-C :chutten from comment #25)
And it might be unrelated to what we're experiencing, as some deeper dives I've been doing have shown that it's actually
ExtensionScriptingStore
andExtensionPermissions
which are acting simultaneously. They each hold their own separatekvstore
s atop the samerkv
, so (afaict) there's nothing keeping them from stomping on each other. (The former is awaiting indeleteAll
while the latter is input
. Each's task queue is running commit on the same store simultaneously)Given the timing wrt the end-of-year and the Fx cycle, I'm gonna retarget landing this update early in the Fx111 cycle (so, Jan 16-ish?) with option to uplift to Fx110. Apologies to the See Also and Blocks lists.
Next steps: ...I'm really not sure here. It'd be nice to get some clear indication either from the writers or consumers of these APIs about their expectations of safety. It's pretty clear that there's overwrite possibilities between
ExtensionScriptingStore
andExtensionPermissions
in real, live Firefoxes, so :rpl it'd be nice to get your opinion on what you'd like done with that (it's own bug?).What is clear is that I can't land this code as-is because I don't know what throwing an error during these overwrite events might do, when presently this has been a silent failure.
ouch, thanks Chris, if I'm reading your comment correctly then I think that I may see when this actually became an issue:
- when we have introduced
ExtensionScriptingStore
as a backend for thescripting
API content scripts details persisted on disk, we should have either used a different store path or shared the sameKeyValueDatabase
instance if we really wanted to share the same store path
ExtensionPermissions is currently using the rkv backend only in Nightly at the moment (see createStore
function defined in ExtensionPermissions.jsm and the legacy JSON file on all other channel, and so at least in Beta and Release ExtensionScriptingStore
would be the only one of those two to be using the rkv backend and there wouldn't be any concurrent access that could be triggering the same kind of failure we are getting by the ExtensionScriptingStore.jsm on Nightly.
Chris, if you do have still the builds where you where digging into this issue, would you mind to confirm that changing useRkv
to false
in the createStore function prevents the unexpected failure as we are assuming it should? (unless you have already did that).
As an additional thoughts about this:
- if this kind of usage is introducing this kind of issues, then I'm wondering if we should tweak the comment in nsIKeyValue.idl where the following description is (see here in nsIKeyValue.idl) to make sure it is not worded in a way that would make it sound using
nsIKeyValue.getOrCreate
as done in ExtensionPermissions.jsm and ExtensionScriptingStore.jsm as fine if it is actually a footgun:
* The service supports multiple named databases at the same path
* (i.e. within the same storage file), so you can call this method
* multiple times with the same path and different names to retrieve
* multiple databases stored in the same location on disk.
- given that even the comment would be easy to miss, and a method named
getOrCreate
may make it sound like the KeyValueDatabase returned by two different calls to nsIKeyValue meant to be a single instance of the same underlyingKeyValueDatabase
, I'm wondering it may be reasonable (and doable) to guard it against wrong usage patterns (e.g. an explicit error raised when multiple instances pointing to the same store file are being created and alive at the same time, or a diagnostic assertion or something similar)
wdyt?
Next steps: ...I'm really not sure here. It'd be nice to get some clear indication either from the writers or consumers of these APIs about their expectations of safety. It's pretty clear that there's overwrite possibilities between
ExtensionScriptingStore
andExtensionPermissions
in real, live Firefoxes, so :rpl it'd be nice to get your opinion on what you'd like done with that (it's own bug?).
yeah, if (as it sounds) we have enough confirmation that ExtensionScriptingStore
and/or ExtensionPermission
should be changed to prevent issues due to concurrent changes to the same store file, I think it would be worth filing its own bug.
Let me know if you prefer that I file the bug tracking the WebExtensions related parts.
Assignee | ||
Comment 27•2 years ago
|
||
(In reply to Luca Greco [:rpl] [:luca] [:lgreco] from comment #26)
Chris, if you do have still the builds where you where digging into this issue, would you mind to confirm that changing
useRkv
tofalse
in the createStore function prevents the unexpected failure as we are assuming it should? (unless you have already did that).
I do have the builds and have now made the change to your instructions and rerun the test several times to confirm we're not hitting the issue. The system works!
As an additional thoughts about this:
<snip>
wdyt?
For sure, though kvstore
isn't in my realm of influence so it might be best to also ask for opinions from Nika (or Lina. I hear she's back!) before going ahead.
Let me know if you prefer that I file the bug tracking the WebExtensions related parts.
I definitely prefer you file the bug. I'm on the Cc list of too many as it is : D
Comment 28•2 years ago
|
||
Two KeyValueDatabase
objects do not share a single task queue, so have no ordering relative to one-another, so if you have two KeyValueDatabase
instances which both act on the same file, you could definitely see a race (https://searchfox.org/mozilla-central/rev/57527d50ef5d3df412caa5d99536f0709399be6f/toolkit/components/kvstore/src/lib.rs#135).
I don't know a ton about the rkv setup and how it works under the hood, so redirecting to :lina who might have some more insight into what the callers are doing wrong or whether something should be cleaned up on the kvstore side.
Comment 29•2 years ago
|
||
(In reply to Chris H-C :chutten from comment #27)
(In reply to Luca Greco [:rpl] [:luca] [:lgreco] from comment #26)
...Let me know if you prefer that I file the bug tracking the WebExtensions related parts.
I definitely prefer you file the bug. I'm on the Cc list of too many as it is : D
Filed as Bug 1807010 (and added it to the blockers on this bug and Bug 1646182).
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Comment 30•2 years ago
|
||
If bug 1807010 removes the only simultaneous user of rkv
, we'll be good to land this update after it. Lemme do a quick audit of rkv
and kvstore
users in tree...
In-tree crates using/wrapping rkv
:
- cert_storage
cert_storage
uses ansISerialEventTarget
to serialize theTask
s by which it accesses its db. It is a service (nsICertStorage
) and so is unlikely to have more than one instance accessing it simultaneously.
- kvstore
kvstore
is a utility upon which a few things have built their storage.KeyValueService
is the JS service fronting it which we now know has issues if anything'sgetOrCreate
'd it with the same path. (thus bug 1807010).NotificationDB
uses it on pathProfD/notificationstore
,SyncHistory
uses it on pathProfD/settings
,ExtensionPermissions
uses it onProfD/extensions-store
, andExtensionScriptingStore
uses it onProfD/extensions-store
. (So, after bug 1807010, it looks as though for JS consumers of kvstore we're in the clear).- I don't see any Rust or C++ consumers of kvstore.
- xulstore
xulstore
has been turned off in Fx110 and is slated for removal (per email to firefox-dev), so I'll count it as safe.
- glean-core
- In
glean-core
, access to storage requires using the dispatcher which runs its tasks sequentially. Except in tests, where we force sequential operation via utilities likelock_test
.
- In
To this marginally-trained eye, it looks as though we're in the clear. I'm presently rebasing and rebuilding and retesting the patchstack (expect an update to it soonish, reviewers) so we'll be ready to land this once bug 1807010 is clear and Fx111 cycle begins fresh. This may be eligible for uplift to 110 in early beta, but I'm not risking sheriff ire by trying to land this just before soft freeze even if the timing works out.
I am placing the decision of "where does rkv
and kvstore
wish to place their thread safety guarantee in the light of lmdb
removal" firmly outside of scope of my operations in this bug.
Updated•2 years ago
|
Comment 31•2 years ago
|
||
Hey Chris,
would you mind to double-check that with Bug 1807010 patch applied locally you are not able to trigger that test failure anymore as we are expecting to be the case?
The idea is that if we didn't got any better option, we may land Bug 1807010 at the start of the 111 nightly cycle, that way we should be able to land this one earlier in 111 nightly cycle and then request an uplift to 110 beta early enough (and on beta and release Bug 1807010 would not be necessary and so we should be able to uplift this one on its own, and let it bake on beta for most of the 110 beta cycle).
Btw, thanks again for the huge help provided to pin point what was the actual underlying issue, your help has been absolutely invaluable!!!
Assignee | ||
Comment 32•2 years ago
|
||
I've rebuilt afresh on Windows and confirmed that the test still fails (fairly) reliably with the (latest revision of the) Glean 52 patchset applied. Then I applied bug 1807010's patch, rebuilt, and reran the test 5 times without failure.
The evidence is consistent with the hypothesis that bug 1807010's patch will solve things.
I don't think we'll get a better option than your proposed changes in bug 1807010. I don't expect changes to kvstore
or KeyValueService
to be as straightforward or rapid... and that's assuming that kvstore/KeyValueService wishes to supply thread safety and doesn't opt to leave it at documenting the current threading behaviour. So I'd really like it if you followed your proposed land+uplift plan. Otherwise I'll need to come up with a v52.0.1 or something that downgrades rkv
, and removes the protection for buggy fs drivers it introduced.
Assignee | ||
Comment 33•2 years ago
|
||
Had a chat with :rpl today and figured out that bug 1807010 could be made to not be blocking Gleanv52 if we patch kvstore
to exhibit its pre-rkv
-version-0.18.4 behaviour.
So the plan now becomes filing and fixing a bug blocking this bug that will patch kvstore
's use of rkv
. It'll explicitly capture and ignore IOError
s from commit()
that are of kind: NotFound
(ie, the specific errors thrown by simultaneous access). (We should also file a follow-up kvstore
bug for removing this when/if its strategy around thread safety is locked in.) This keeps kvstore
's behaviour from changing with the rkv
update (ie, simultaneous access will silently overwrite updates as it has been since rkv
switched to safe mode), which means we can leave bug 1807010 to be scheduled without worrying it might slow down the Glean update.
I'll go file that bug.
Comment 34•2 years ago
|
||
Comment 35•2 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/116b727ef127
https://hg.mozilla.org/mozilla-central/rev/4bd136319a75
https://hg.mozilla.org/mozilla-central/rev/7ddfeb2d877e
https://hg.mozilla.org/mozilla-central/rev/33e4898adb8a
Comment 36•2 years ago
|
||
(Since this is fixed, I replied in bug 1807010, comment 2. 😊)
Description
•