Closed Bug 1807010 Opened 2 years ago Closed 2 years ago

Prevent ExtensionPermissions and ExtensionScriptingStore kvstore based backends to hit race conditions issues

Categories

(WebExtensions :: Storage, defect, P2)

defect

Tracking

(firefox112 fixed)

RESOLVED FIXED
112 Branch
Tracking Status
firefox112 --- fixed

People

(Reporter: rpl, Assigned: rpl)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [addons-jira])

Attachments

(1 file)

As part of investigating the test_ext_scripting_persistAcrossSessions.js failures that are currently blocking Bug 1805427 we have determined that in Nightly (where ExtensionPermissions kvstore backend is enabled) ExtensionPermissions and ExtensionScriptingStore kvstore instances are racing with each other on updating the same rkv safemode data file from potentially two separate and independent Background Pool threads (See Bug 1805427 comments starting from Bug 1805427 comment 20, and in particular Bug 1805427 comment 28).

This race can only be currently hit in Nightly, because ExtensionPermissions doesn't use the kvstore backend in the non-Nightly channels including release, and so this bug is meant to block Bug 1646182 and track followup work to make sure to prevent this race.

It seems also reasonable to wait for Lina's perspective in Bug 1805427 to assess if the fix for this kind of issue should happen on the rkv / kvstore side or if we should consider changes ExtensionPermissions or ExtensionScriptingStore kvstore backends to prevent it.

Depends on: 1805427
Whiteboard: [addons-jira]
Assignee: nobody → lgreco
Status: NEW → ASSIGNED
Blocks: 1805427
No longer depends on: 1805427
Severity: -- → S4
Priority: -- → P2
Blocks: 1810212
See Also: → 1807048

It seems also reasonable to wait for Lina's perspective in Bug 1805427 to assess if the fix for this kind of issue should happen on the rkv / kvstore side or if we should consider changes ExtensionPermissions or ExtensionScriptingStore kvstore backends to prevent it.

Hi Luca! Since bug 1805427 is fixed now, I'll drop some comments in here!

It's been a really long time since I've looked at rkv and kvstore, but, if I'm understanding right, there are a few things going on here: LMDB thinks about databases one way; rkv abstracts over that and also provides a "safe mode" that's backed by flat files instead of LMDB; and kvstore wraps rkv's API with a new set of semantics:

  • In the LMDB world, the environment mediates access to all its containing databases. A database is a named bucket of key-value pairs, and is accessed via a transaction. Transactions are scoped to the entire environment. An environment can have multiple active read-only transactions, and only one active read-write transaction. Readers and writers don't block each other.
  • In rkv's safe mode, the environment maintains an in-memory copy of all its containing databases, which is persisted to a flat file. Database contents are loaded into memory when the file is opened, and flushed to disk after each write transaction.
  • In kvstore, nsIKeyValueService roughly corresponds to an rkv Manager, and the nsIKeyValueDatabase references that it hands out correspond to an rkv SingleStore. Each nsIKeyValueDatabase also gets its own separate task queue (nsISerialEventTarget), and all operations on the database are executed in a separate transaction on that queue.

I did a bit of digging, and came away with this assessment. Jan-Erik or Nan, could you check me on this, please? 😊

  1. rkv's safe mode tries to hand out references to the same database (per https://github.com/mozilla/rkv/pull/160). So, if you open the same database multiple times, those rkv *Store objects should all point to the same database object, right? But kvstore gives each nsIKeyValueDatabase its own task queue (bug 1805427, comment 28), so any operations run on those *Store*s can interleave.
  2. kvstore wraps each operation on the underlying database in its own transaction. So when deleteAll in ExtensionScriptingStore calls nsIKeyValueDatabase.delete in a loop (https://searchfox.org/mozilla-central/rev/5bcbe6ae54ee5b152f6cef29dc5627ec7c0e1f1e/toolkit/components/extensions/ExtensionScriptingStore.jsm#102-107), and ExtensionPermissions calls nsIKeyValueDatabase.put on the same underlying database, it's expected that the put can execute between the two deletes (bug 1805427, comment 20), because they're running on two different task queues in different transactions.
  3. rkv's safe mode protects the in-memory database contents with an RwLock<EnvironmentDbs> (https://github.com/mozilla/rkv/blob/c5a7594c830903f59eb28c7e32daa8200744323e/src/backend/impl_safe/environment.rs#L234-L236), but that lock is released in https://github.com/mozilla/rkv/blob/c5a7594c830903f59eb28c7e32daa8200744323e/src/backend/impl_safe/transaction.rs#L192, before flushing the contents to disk. So, even if (2) weren't a factor, multiple calls to EnvironmentImpl.write_to_disk() from different task queues could still race with each other.

Here are some ideas for fixing this in rkv / kvstore, so that multiple callers can use the same named database, like Extension Storage does now:

  1. Wait to drop the RwLock<EnvironmentDbs> until write_to_disk() is done. This isn't good, because it'll block all readers until the write is done, which doesn't match LMDB's semantics (readers and writers don't block each other). The file can also still change on disk, out from under rkv.
  2. Do something with file locks in rkv's safe mode backend, which fixes the "file changing on disk" issue from (1), in exchange for lots more cross-platform complexity...and would still block on the writer.
  3. Accept that this can happen in rkv, and fix the problem at the kvstore layer by caching nsIKeyValueDatabase objects, so that the same database always gets the same task queue. This would bring the semantics of nsIKeyValueService.getOrCreate in line with what we expect, but needs extra logic to manage the cache, and doesn't fix the issue for non-Firefox consumers. I wonder if this is just kicking the can further down the road...

...So, after all that, I feel like your approach is the best one. 😅

  1. Use separate named databases for ExtensionScriptingStore and ExtensionPermissions.
  2. In a follow-up, disallow opening multiple handles to the same database, like you suggested in bug 1805427, comment 26. This would diverge from LMDB, but it doesn't sound like we're planning to ever turn on rkv's LMDB backend, anyway? It's also the simplest option, and leaves debugging breadcrumbs for other consumers.

WDYT?

Flags: needinfo?(najiang)
Flags: needinfo?(jrediger)

Hi Lina,
thank you so much for gather back some context about the historical reasons why we got to the current state.

Last week I was briefly discussing this issue with William, and we got an idea that we would like to pass by you, as well as Jan-Erik or Nan:

  • we did notice that the path name we give to kvstore is actually used as a directory where the actual data file is being stored and so we were wondering what if we would use a separate data file for each of the databases? would that be doable as a way to make the usage of kvstore for multiple database stored in the same store path on the rkv or kvstore side?
Flags: needinfo?(lina)

we did notice that the path name we give to kvstore is actually used as a directory where the actual data file is being stored and so we were wondering what if we would use a separate data file for each of the databases? would that be doable as a way to make the usage of kvstore for multiple database stored in the same store path on the rkv or kvstore side?

Ooh, good question! I think only Extension Storage and Glean take advantage of multiple databases in the same environment, and neither uses cross-database transactions. That said, I'm wary of using separate files, because:

  • It's a departure from LMDB's semantics, where a single transaction is atomic across all databases in an environment.
  • We'd need to change the transaction implementation to keep track of which databases were modified, so that we know which files we need to save.
  • For transactions that affect multiple databases, we'd need to do extra work to avoid partial writes. (Or we could fail transactions that try to update multiple databases—but we'd still need to add code to track that, and then it's a backward-incompatible change for other consumers).
  • We'd need to decide if we should load the contents of all the files in memory when the environment is opened (that is, at the Rkv::new call) , or just when the database is opened (at the open_single call)?
  • What happens if file corruption only affects some of the database files in the directory? Do we move the entire directory aside, or do we only surface the corruption error when the caller tries to access the corrupt database? I'm not sure the latter is backward-compatible; Glean (and bug 1645907 😊) currently recover from FileInvalid when the environment is created, but not when an individual database in that environment fails to open.
  • Files and filesystems are surprisingly hard, even in the happy paths! These two articles by Dan Luu definitely made an impression on me: https://danluu.com/deconstruct-files/, https://danluu.com/file-consistency/

I like Dan Luu's conclusion in his first post:

I'm not saying you should never use files. There is a tradeoff here. But if you have an application where you'd like to reduce the rate of data corruption, considering using a database to store data instead of using files.

...Which, to bring it back to the beginning, is why we considered LMDB at first—history doesn't repeat, but it does rhyme! 😅 I think we considered having an SQLite backend for rkv at one time, too, but punted on it because we wanted to use LMDB in the future, and safe mode wasn't meant to be permanent. It's a bit outside the scope of this particular bug, but I wonder if it's worth revisiting that decision now, as an alternative to expanding safe mode...

Flags: needinfo?(lina)

I think Lina has covered most of the details & quirks about rkv (such a good memory she has ;) ).

To add a little more context about rkv/kvstore:

  • rkv used to use LMDB as the only backend, which not only supported multi-read/write accesses within the same process, it also supported multi-process write/read using mmap and MVCC with a lock file to serialize multiple writers (writers don't block readers though)
  • Unfortunately, we encountered various crash reports of LMDB in the past in the prerelease channels, which prevented us from enabling the LDMB backend of rkv in the release. We couldn't address all those crashes that's why the "safe-mode" backend was added to rkv by Victor Porof a few years ago
  • the safe-mode backend differs from the LMDB counterpart in its support of MVCC. Like Lina mentioned above, the memory view of the environment (i.e. the flat file) is protected solely by a RwLock, meaning it no longer supports multi-process read/write transactions. For single-process transactions, it still might run into race conditions as Lina described above

I think there are several inherent flaws (both in design and implementation) with rkv's safe-mode. Addressing all those might take some effort. Giving the LMDB backend (there have been 6 new releases since we last updated it for rkv) another shot might be worth trying as it supports all of our use cases out of the box.

Flags: needinfo?(najiang)

(In reply to Lina Butler [:lina] from comment #2)

It's been a really long time since I've looked at rkv and kvstore, but, if I'm understanding right, there are a few things going on here: LMDB thinks about databases one way; rkv abstracts over that and also provides a "safe mode" that's backed by flat files instead of LMDB; and kvstore wraps rkv's API with a new set of semantics:

First of thanks, Lina, for that summary.
I did plan to re-review rkv's code base and do a similar thing in the coming weeks. This saved me some trouble for sure.

  1. Use separate named databases for ExtensionScriptingStore and ExtensionPermissions.
  2. In a follow-up, disallow opening multiple handles to the same database, like you suggested in bug 1805427, comment 26. This would diverge from LMDB, but it doesn't sound like we're planning to ever turn on rkv's LMDB backend, anyway? It's also the simplest option, and leaves debugging breadcrumbs for other consumers.

WDYT?

Fully agree with 1. It's what we can do now, rather cheap and rather quick and would solve the main issue we saw.
2 sounds doable, yes.
We should also call out these safe-mode differences more clearly. I'm not confident we should encourage new consumers using rkv at the moment.

Which, to bring it back to the beginning, is why we considered LMDB at first—history doesn't repeat, but it does rhyme! 😅 I think we considered having an SQLite backend for rkv at one time, too, but punted on it because we wanted to use LMDB in the future, and safe mode wasn't meant to be permanent. It's a bit outside the scope of this particular bug, but I wonder if it's worth revisiting that decision now, as an alternative to expanding safe mode...

I experimented around with sqlite for Glean a while ago.
If we would ditch safe-mode it would be easier for us to go straight to SQLite rather than using rkv as an abstraction above it though.
That's not to say rkv wouldn't benefit from an SQLite mode for those that are still better served by the key-value semantics.

(In reply to Nan Jiang [:nanj] from comment #5)

[...]
I think there are several inherent flaws (both in design and implementation) with rkv's safe-mode. Addressing all those might take some effort. Giving the LMDB backend (there have been 6 new releases since we last updated it for rkv) another shot might be worth trying as it supports all of our use cases out of the box.

I agree here. safe-mode is not providing the guarantess one would expect.
And yet, at least for Glean, it is the trusted data store.
That is also the reason why I did take up maintenance on rkv (really just the bare minimum: access and minimal fixes).
I'm nowhere near to being able to maintain a full database implementation. "Just" using a trusted datastore would therefore always be my choice.

Bug 1780370 is actually about removing LMDB fully with rkv removing LMDB support as well.
The work on the Glean side has been prepared but not deployed.
The work on the rkv side hasn't started to my knowledge.

Glean is not using LMDB anymore in recent releases and we long migrated the majority of users to safe-mode.
Any change to rkv would mean careful work to ensure we're not losing data.

Flags: needinfo?(jrediger)
Attachment #9310808 - Attachment description: Bug 1807010 - Migrate ExtensionPermissions kvstore to a separate file path. r?willdurand! → Bug 1807010 - Migrate ExtensionPermissions kvstore to a separate file path. r?robwu!
Pushed by luca.greco@alcacoop.it: https://hg.mozilla.org/integration/autoland/rev/fe953289d892 Migrate ExtensionPermissions kvstore to a separate file path. r=willdurand,robwu
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 112 Branch
See Also: → 1821056
See Also: → 1919530
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: