Closed Bug 1869060 Opened 5 months ago Closed 3 months ago

Consider exposing SQLite's Online Backup API through mozIStorageAsyncConnection / Sqlite.sys.mjs

Categories

(Toolkit :: Storage, enhancement)

enhancement

Tracking

()

RESOLVED FIXED
124 Branch
Tracking Status
firefox124 --- fixed

People

(Reporter: mconley, Assigned: mconley)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

I spent a good chunk of this year investigating and building things to make it easier for users to migrate from their current device to a newer device. We've been calling this the "device migration" project. This year we mainly focused on getting users to learn about Sync and using a Mozilla account as a device migration pathway.

That worked to some extent, but we hope to go a bit broader now.

One thing I've been tasked with recently is to look at how we might be able to let users opt-in to having Firefox back up parts of the user profile (the Places database, favicons, logins, form history, cookies, etc) to a single file somewhere on the disk, and make it possible for Firefox to detect and recover from that file during new profile onboarding. The theory is that if we put this file in a place that's likely to survive device migration (the user's Documents folder on Windows, for example), then this will make it easier for them to have their data be ready for them once they get their new device set up and get Firefox installed.

One thing I experimented with was re-using the Sync engines to Sync various resources to a file rather than the Sync server. This works to some extent, but has the drawback of needing to query for all of the user's bookmarks in order to write the backup. As the Sync engines are in JS outside of Workers, this means that even though the query is off the main thread, handling the results would be on the main-thread - even if it's just serializing those results to some other thread to write out the backup archive.

Having a conversation with markh and jrmuizel yesterday, I realized that it might be possible to copy a live-running SQLite database like Places without requiring much in the way of main-thread processing or query blocking. Specifically, it seems like SQLite has a built-in capability for exactly this kind of thing: https://www.sqlite.org/backup.html

I've got a strawman patch here that adds that capability to mozIStorageAsyncConnection and Sqlite.sys.mjs, and from my limited testing, it does appear to successfully copy a Places database while Firefox is still attached to it.

The patch as-written might not be acceptable, but I'm curious to know if exposing this backup API would be something the storage folk would support. Or, alternatively, if there are other potential solutions for doing backup-to-file that we might want to consider instead.

Hey mak,

I know there are a few things immediately wrong with this strawman patch, but wanted to build it anyways just to see if it'd work.

Building this, I was able to create a new database called test-backup.sqlite in the profile directory using the following commands from the Browser Console:

let file = await IOUtils.getFile(PathUtils.profileDir, "test-backup.sqlite")
let db = await PlacesUtils.promiseDBConnection();
await db.backupDatabaseFileAsync(file);

So here's what I immediately notice about the patch:

  1. It uses a Promise as a return value from mozIStorageAsyncConnection, whereas the rest of that interface tends to use callbacks - so it might make sense to follow that convention there, and have Sqlite.sys.mjs handle the callback/Promise glue.
  2. The created backup file does appear to contain things like bookmarks and history from the originating database - however, it's also significantly smaller in filesize from the original. I'm only copying the "main" database. To get a proper backup, would I need to copy any other databases within the file? Like, do we tend to stuff more than 1 database in an SQLite file?
  3. I'm doing the work on a low-priority background thread rather than the mozIStorageAsyncConnection async execution thread. I chose that approach, because the Online Backup API asks callers to yield for a window of time after copying some number of pages so that queries can continue to be made on the database. The example I cribbed from uses sqlite3_sleep at 250ms for every 5 pages copied. I figured if I did this on the async execution thread, I'd actually be blocking any other queries from being made, since the thread would either be doing the backing up or sleeping. Is it a bad idea to work with the SQLite db handle on something like a low-priority background thread like this?
  4. Reading through the documentation at https://www.sqlite.org/backup.html and https://www.sqlite.org/c3ref/backup_finish.html, it seems like we're configured in such a way that the backup would eventually finish and never be stuck in a situation where it'd need to continually restart due to writes. I think this is because each connection is sharing the same handle... is that true, and am I interpreting that correctly?
  5. Any other gotchas that I'm not considering here?
Flags: needinfo?(mak)

I don't have any problem with exposing the API, but to somewhat summarize/recap my thoughts from the Workers & Storage channel from Thurs, Dec 7th (for anyone looking for scrollback), I think the higher level use case might be better met by a mechanism that provides some level of filtering and churn reduction. That said, it would likely make sense to expose the API for prototyping/experimentation which would guide any subsequent performance/scaling improvements if the prototype seems viable.

I may need more time to think about the technicalities. But let me throw out some thoughts.

(In reply to Mike Conley (:mconley) (:⚙️) from comment #2)

  1. It uses a Promise as a return value from mozIStorageAsyncConnection, whereas the rest of that interface tends to use callbacks

The main reason is that callbacks used to be easier to handle on the cpp side, and Storage has quite a few cpp consumers.
I'm not sure how ergonomic cpp Promises are today, I think they need a js context and boilerplate. Sqlite.sys.mjs is pretty much our promise-happy wrapper around Storage, so it remains easy to use on both languages. What's the reason for the API to return a promise, how does that help us?

  1. The created backup file does appear to contain things like bookmarks and history from the originating database - however, it's also significantly smaller in filesize from the original.

This may or may not be a problem. If you do a data comparison are contents identical?
The size difference may just be due to Chunked Growth (we artificially pump the size of the db to avoid disk fragmentation of future insertions).
Try setting places.database.growthIncrementKiB to 0 and see if you get a db of the same size.

  1. [...] I figured if I did this on the async execution thread, I'd actually be blocking any other queries from being made, since the thread would either be doing the backing up or sleeping. Is it a bad idea to work with the SQLite db handle on something like a low-priority background thread like this?

I don't know off-hand, must think more about it, but I think it would be better to open a new handle. Reusing the same handle on a third thread, as some connections like Places already use two, makes me a bit nervous.

  1. Reading through the documentation at https://www.sqlite.org/backup.html and https://www.sqlite.org/c3ref/backup_finish.html, it seems like we're configured in such a way that the backup would eventually finish and never be stuck in a situation where it'd need to continually restart

I think it's sufficient for the backup handle to be in the same process as the writing handle.
Places, due to history, is written often so it's a good candidate to test.

  1. Any other gotchas that I'm not considering here?
  1. how do we plan to handle a corruption, if it's detected during a backup operation (I assume the backup api would return SQLITE_CORRUPT, so detection should be possible). Would we just annotate the db and give up trying to backup it? (there is a recover API now, but let's not add too much here).
  2. how do we ensure a backup has enough time to complete in a meaningful timeframe? I'd like to see some example measurements of the time necessary to backup a large database (let's say 100MiB to be realistic) on a mechanical disk slow machine. Will users having most short sessions have sufficient time to do it? Do we have telemetry around users having mostly short sessions (how many users, average session time)?
  3. I couldn't find (yet) how SQLite handles partial backups... do they use a tmp folder or a tmp file name? If the backup is interrupted I mean.
  4. we surely want telemetry for time to backup, success/failure/interruption, per db telemetryName.

(In reply to Marco Bonardo [:mak] from comment #4)

What's the reason for the API to return a promise, how does that help us?

Mainly to simplify Sqlite.sys.mjs so that it just does state checking and a passthrough, without having to convert callbacks into Promises. Also, I'd learned how to do C++ Promise stuff in bug 1529276. It is possible to handle DOM Promises from C++, but if there are C++ consumers that would want to use this proposed API, staying consistent with callbacks probably makes sense.

  1. The created backup file does appear to contain things like bookmarks and history from the originating database - however, it's also significantly smaller in filesize from the original.

This may or may not be a problem. If you do a data comparison are contents identical?

At first glance, the contents appear identical, but that's just from hand-comparing using DB Browser for SQLite on a simple database. I'll see what sqldiff spits out and report back.

The size difference may just be due to Chunked Growth (we artificially pump the size of the db to avoid disk fragmentation of future insertions).
Try setting places.database.growthIncrementKiB to 0 and see if you get a db of the same size.

I'll try that too!

  1. [...] I figured if I did this on the async execution thread, I'd actually be blocking any other queries from being made, since the thread would either be doing the backing up or sleeping. Is it a bad idea to work with the SQLite db handle on something like a low-priority background thread like this?

I don't know off-hand, must think more about it, but I think it would be better to open a new handle. Reusing the same handle on a third thread, as some connections like Places already use two, makes me a bit nervous.

The only drawback to using a new handle is that any writes to the database that occur on a separate handle will cause the backup process to restart itself on the next "step". Given many frequent writes, we might end up in a case where a backup is never completed. That worries me, as I figure Places is written to during basically every navigation.

  1. Reading through the documentation at https://www.sqlite.org/backup.html and https://www.sqlite.org/c3ref/backup_finish.html, it seems like we're configured in such a way that the backup would eventually finish and never be stuck in a situation where it'd need to continually restart

I think it's sufficient for the backup handle to be in the same process as the writing handle.
Places, due to history, is written often so it's a good candidate to test.

The documentation here reads:

If the source database is modified by an external process or via a database connection other than the one being used by the backup operation, then the backup will be automatically restarted by the next call to sqlite3_backup_step(). If the source database is modified by the using the same database connection as is used by the backup operation, then the backup database is automatically updated at the same time.

So, you're right, if we're in the same process (but different thread), we're okay - but if we use a different database connection (which I assume is synonymous with handle?) then we're in the situation where the backup will restart after every write.

  1. how do we plan to handle a corruption, if it's detected during a backup operation (I assume the backup api would return SQLITE_CORRUPT, so detection should be possible). Would we just annotate the db and give up trying to backup it? (there is a recover API now, but let's not add too much here).

If the database is corrupt, yes, it doesn't make much sense to back it up - unless we have ways of attempting to recover from corrupt databases?

  1. how do we ensure a backup has enough time to complete in a meaningful timeframe? I'd like to see some example measurements of the time necessary to backup a large database (let's say 100MiB to be realistic) on a mechanical disk slow machine. Will users having most short sessions have sufficient time to do it? Do we have telemetry around users having mostly short sessions (how many users, average session time)?

Good question, and I think this depends on whether or not we use the same handle that we use for writing to Places or not, as if we don't, this means that the backup process will restart at every write if it's not yet completed. We certainly have Telemetry on session length. We also get to control how many pages are backed up at each step, so we can tune between DB responsiveness and the probability of completing a backup.

  1. I couldn't find (yet) how SQLite handles partial backups... do they use a tmp folder or a tmp file name? If the backup is interrupted I mean.

It looks to me like the passed in filename is created write away, as soon as the backup is initiated. It looks like .wal files are also created. If a backup is halfway done or aborted, the files remain and I think it's the responsibility of the caller of the API to clean them up.

  1. we surely want telemetry for time to backup, success/failure/interruption, per db telemetryName.

Yes, absolutely.

This may or may not be a problem. If you do a data comparison are contents identical?

sqldiff reports that there's no difference to the source database as the backup, so I suspect it's the chunked growth, as you mentioned.

(In reply to Mike Conley (:mconley) (:⚙️) from comment #5)

If the source database is modified by an external process or via a database connection other than the one being used by the backup operation

I see, I had visually skipped the "database connection other than" part. Then we likely don't want to use a different handle because we'd just restart too often.
At this point we should better evaluate if there may concerns with using the handle on a third thread, there may be some special handling to do on shutdown (hold shutdown mutex and ensure we don't continue the backup after closing the connection). Or eval whether the cost of doing it in the helper thread is acceptable.

It looks to me like the passed in filename is created write away, as soon as the backup is initiated. It looks like .wal files are also created. If a backup is halfway done or aborted, the files remain and I think it's the responsibility of the caller of the API to clean them up.

Then how does a consumer know if that backup is complete or not, supposing the caller crashes in the middle of the backup? it may be worth using a custom file extension (like for downloads) and rename once backup is complete.

Another concern that I was just thinking about is privacy, if we're just making copies of profile database somewhere else in the system, then when I'm removing data from my profile, those copies will still have it until a new backup is started. Though recognizing which operations should or should not discard them may not be trivial.

(In reply to Marco Bonardo [:mak] from comment #7)

At this point we should better evaluate if there may concerns with using the handle on a third thread, there may be some special handling to do on shutdown (hold shutdown mutex and ensure we don't continue the backup after closing the connection). Or eval whether the cost of doing it in the helper thread is acceptable.

Alright, I'll start reading through the SQLite documentation to see about sharing handles across threads, and what dangers might lurk there. I'm also attempting to test the performance of a backup of a large Places database for a profile hosted on an SD card to simulate the slow disk experience - I figure the card will give us the extreme case.

Then how does a consumer know if that backup is complete or not, supposing the caller crashes in the middle of the backup? it may be worth using a custom file extension (like for downloads) and rename once backup is complete.

Yeah, I think it's really to the caller to manage things, as you describe. What I envision is that we create our backup database in a temporary folder on the system, and upon successful completion, we then compress each of the backed up databases into an archive file with a manifest.

Another concern that I was just thinking about is privacy, if we're just making copies of profile database somewhere else in the system, then when I'm removing data from my profile, those copies will still have it until a new backup is started. Though recognizing which operations should or should not discard them may not be trivial.

Yep, asuth brought up the same thing. A very valid concern. The naive approach is to blow away any known backups during any kind of sanitization operation, and then create a new backup immediately after. This may or may not be the user's expectation, and might require us to alter our sanitization settings somewhat. I agree that going through existing backups to try to figure out which ones to destroy is probably a bridge too far. I suspect instead that this might be a user education / UX challenge.

(In reply to Mike Conley (:mconley) (:⚙️) from comment #8)

Alright, I'll start reading through the SQLite documentation to see about sharing handles across threads, and what dangers might lurk there. I'm also attempting to test the performance of a backup of a large Places database for a profile hosted on an SD card to simulate the slow disk experience - I figure the card will give us the extreme case.

Reading https://www.sqlite.org/threadsafe.html and going through searchfox, it seems that the SQLite instance used by mozIStorageService doesn't declare a threading mode at build, run or connection time. The documentation says that in that case, we're falling back to serialized mode, and that means we can use handles between threads (access through those shared handles is serialized using mutexes).

You're right that we'll need to handle the shutdown case. It seems that sqlite3_backup_remaining and sqlite3_backup_pagecount can be used to get a sense of progress through a backup, so perhaps we could make some kind of runtime judgement call if we're shutting down during a backup. If we're past a particular threshold, perhaps we let the backup complete or fail before shutting down, and otherwise we cancel the backup and blow away any fragments that were created, in the hopes that we can backup next time.

The alternative to using a background thread is to continue using the execution thread of the mozIStorageAsyncConnection, but after each step of the backup, queue a runnable to do the next step after some number of milliseconds. This might actually be better than using the background thread pool, as I believe the execution threads would be blocked attempting to access the database during the backup steps anyhow. We can probably avoid thread contention by just having the connection execution thread do it. From what I remember, I believe we can use nsITimer to do this.

I updated the strawman patch to use the execution thread of the connection, and use an nsITimer to kick off each step. I also switched to using the mozIStorageCompletionCallback pattern rather than a Promise, as this is more consistent with the rest of the storage code.

(In reply to Mike Conley (:mconley) (:⚙️) from comment #9)

(In reply to Mike Conley (:mconley) (:⚙️) from comment #8)
Reading https://www.sqlite.org/threadsafe.html and going through searchfox, it seems that the SQLite instance used by mozIStorageService doesn't declare a threading mode at build, run or connection time. The documentation says that in that case, we're falling back to serialized mode, and that means we can use handles between threads (access through those shared handles is serialized using mutexes).

Yes, we use serialized threading, differently from Rusqlite that uses multithreading mode. The reason is we use Sqlite objects from the same connection across multiple threads.
Long term it would be better if we'd switch to the Rusqlite approach, that i think is more performant. Though, that requires everything using pure sync or pure async connections with a single thread (that means Places must complete its async transition), and that also means here it would be better to not introduce another thread.
I was mostly thinking about our (mozStorage) internal threading checks anyway, rather than SQLite's, as we manage shutdown, not them.

Thank you for the updates. I agree that we could start exposing "something", and then hammer out the gory details with first use cases.

Flags: needinfo?(mak)
Assignee: nobody → mconley
Status: NEW → ASSIGNED

The following patch is waiting for review from an inactive reviewer:

ID Title Author Reviewer Status
D195934 WIP: Bug 1869060 - Strawman SQLite Online Backup API support via mozIStorageAsyncConnection. Do not land. mconley mak: Back Jan 7, 2024

:mconley, could you please find another reviewer?

For more information, please visit BugBot documentation.

Flags: needinfo?(mconley)

Clearing needinfo - the WIP patch is posted for feedback from mak, who I suspect will get to this once he's back from holiday PTO.

Flags: needinfo?(mconley)

Hey mak - what are your thoughts on exposing this API? If you're amenable to it, is this more or less the right approach for it?

As an added wrinkle, this is something we might want to consider backporting to ESR 115 to help with device migration. Do you foresee any technical hurdles with that requirement?

Flags: needinfo?(mak)

Sorry, I'm handling some backlog. I think it's ok, I left some comments on Phab.
I don't see problems with backporting provided other topics like privacy are handled appropriately.

Flags: needinfo?(mak)

This utility copies an SQLite database, but does so just by performing a file
copy on the database itself. It assumes that there are no open connections on
the underlying database file.

Given that this is only used by Places, and that a later patch in this stack
adds a database backup utility that works even if there are open connections,
means we can move this old utility out form the mozIStorageService and into
a dedicated Places helper instead.

Attachment #9367728 - Attachment description: WIP: Bug 1869060 - Strawman SQLite Online Backup API support via mozIStorageAsyncConnection. Do not land. → Bug 1869060 - Add SQLite Online Backup API support via mozIStorageAsyncConnection. r?mak!
Pushed by mconley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c701c7c32260
Move old backupDatabaseFile utility out from mozIStorageService to a Places helper. r=mak,places-reviewers
https://hg.mozilla.org/integration/autoland/rev/43a95c9c6fd4
Add SQLite Online Backup API support via mozIStorageAsyncConnection. r=mak
Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → 124 Branch
Blocks: 1881542
Blocks: 1883052
Blocks: 1891141
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: