Closed Bug 609286 Opened 14 years ago Closed 14 years ago

Detect a corrupt Places.sqlite and replace the database on next startup

Categories

(Toolkit :: Places, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla2.0b10
Tracking Status
blocking2.0 --- betaN+

People

(Reporter: mak, Assigned: mak)

References

Details

(Keywords: regression, Whiteboard: [fixed-in-places][softblocker])

Attachments

(4 files, 2 obsolete files)

Lately we got a good amount of reports of corrupt databases, in some cases these were corrupt from a long time, a user sent me a database that was corrupt from 6 months. This happened because newer SQLite uses the indices more and thus is more picky regarding their corruption. We don't notify the user about the issue, nor we try to replace the database. Beltzner said he would be ok with adding some late string changes to provide some sort of UI to notify the user his database is corrupt and that we can replace it with a new working one. We need to define the interaction though. This is what we can do: - on idle (frequency to be defined) we can check the database - if the database is corrupt we could fire a notification, nsBrowserGlue could catch it and show some sort of UI notification to the user. - what we can do today is offer the user to replace the database with a new one on next application start (force a immediate restart too?). - notice we will restore bookmarks from a JSON backup but we don't have a history backup, thus history gets lost. We could try to do an emergency backup but it could fail. So we should be clear that all or a part of history will be lost. Today we have a red bar that notifies the user if the database is locked (See https://bug414715.bugzilla.mozilla.org/attachment.cgi?id=352841), not sure if we want to follow the same UI design here.
Whiteboard: [needs UX directions]
cc-ing philiKON too, I'm curious about how Sync handles the fact a user replaces his database with a new one (without history). Is history synced from network or viceversa?
(In reply to comment #1) > cc-ing philiKON too, I'm curious about how Sync handles the fact a user > replaces his database with a new one (without history). Is history synced from > network or viceversa? I guess replacing places.sqlite will not trigger any notifications, and Sync only ever syncs deltas, so Sync will happily apply new history items and bookmarks, but it won't sync the old ones down. To do that the user will have to manually choose "Reset Sync" from the Sync prefs. It will also not sync the effective history wipe up. We want Sync to wipe the server + and other clients in the future whenever the user manually wipes their history locally (bug 578694). Replacing places.sqlite is an effective local wipe, but my guess is it's hard to detect. Or is it?
(In reply to comment #2) > It will also not sync the effective history wipe up. We want Sync to wipe the > server + and other clients in the future whenever the user manually wipes their > history locally (bug 578694). Replacing places.sqlite is an effective local > wipe, but my guess is it's hard to detect. Or is it? Well there is a way to detect it, history.databaseStatus tells you if the db has been created or replaced in that session (could be fancy to detect it when we start firefox.exe twice in some startup path, I guess this still happens sometimes but I don't recall when). What we care about in this case is that user has some easy way to recollect history from Sync (can we present them a button that will do that?) and that history is not wiped if he did not do that explicitly, but we did it for him because the db was corrupt. Also, if we figure out a way to create an emergency history backup you will most likely get duped history entries when we reinsert them, but I guess that's not an issue?
Do we have any sense of what percent of Firefox users are impacted by this issue? Initial thoughts are that we should attempt to create the history backup without the user's involvement, and if that actually works then we should continue to move forward and replace the database without surfacing that anything went wrong. In the event that the history backup did not work, then the user is going to have to make a decision, although I'm not really sure how we are supposed to frame it. What is the impact of having a corrupted database? Is there any situation that the user would logically choose to keep their history and continue to have a corrupt database?
(In reply to comment #4) > Do we have any sense of what percent of Firefox users are impacted by this > issue? in the last 2 weeks I've heard of about 10 cases. But I have no idea how to multiply that by a good ratio. They are far more than what I heard in the previous 2 months though. > Initial thoughts are that we should attempt to create the history backup > without the user's involvement, and if that actually works then we should > continue to move forward and replace the database without surfacing that > anything went wrong. We could try, if the database is corrupt though, the possibility that history is corrupt is pretty high, since it's the largest table. That's why I'm not much optimistic. Ideally we should backup history before the bad happening, but doing so involves privacy. A history backup is static, while history changes pretty much in short times. Invalidating or syncing the backup would be hard and not really performant and we would not want to store those information for more than 1 day for privacy concerns. > In the event that the history backup did not work, then the user is going to > have to make a decision, although I'm not really sure how we are supposed to > frame it. What is the impact of having a corrupted database? In most of the last cases I've heard about, history menu comes up empty, bookmarks could or could not work. locationbar usually does not work. Is there any > situation that the user would logically choose to keep their history and > continue to have a corrupt database? At first glance I'd say not. There could be some edge case, where a part of history works, but if we try to read all history we fail. Our issue is that we can't tell which entries are good and which are not, thus we will just try to read all the table row-by-row. A query like "last 10 visits" could hit a working part of the table. These are edge cases though, I don't think we want to design around them, most likely no user could work with a corrupt history file.
We could even evaluate to save crypted backups, that would require us to ask another "master password" to the user. But there is already one for Sync and one to crypt passwords and both are optional functionality. That would most likely be seen as an annoyance. Would be better if we'd ask to create a single key on install and use that one for all the crypto needs around browser.
>most likely no user could work with a corrupt history file. Then there isn't really any value in bringing this up as an optional action with a question and details. It sounds like we should just begin to correct the problem. How long will the recovery process take once we start it?
(In reply to comment #7) > >most likely no user could work with a corrupt history file. > > Then there isn't really any value in bringing this up as an optional action > with a question and details. It sounds like we should just begin to correct > the problem. How long will the recovery process take once we start it? the time is driven by the bookmarks.json restore, that depends on size of it (thus number of bookmarks we have to restore). If we add a tentative history backup the time increases. Could even be some (up to 5 I'd say in worst cases) minutes for large sets. I'm mostly worried by the hated "a script timed out" popup (that I still don't understand why we don't disable for chrome code in releases, how we did in 3.5). We could probably temporarily flip the pref while doing the work. So, I think we have to notify that history could be lost and that the process could take up to some minutes (And this is probably where the user will get annoyed). Since we can just do this on startup, we could give the possibility to restart the browser immediately or later to allow finishing what he was trying to do on the web.
Here's my current thoughts based on information above: Since users can't work with a corrupt history file, this action shouldn't be optional. It also isn't in our interest to draw attention to the fact that we need to correct an error if we don't need to. So we could silently begin rebuilding the database on close. If the user launches Firefox again (they are going to think they are opening it, not knowing that we still have a process runnning), then in that case we need to display a dialog explaining that we are performing maintenance, and give them the option to cancel it and launch immediately. If the user shuts down their computer before we are able to silently finish, then we'll need to clean up the attempt the next time we launch and try again the next time they close. If that happens a certain number of times (the user only closes Firefox directly before shutting their computer down), we may want to then decide to display the window showing maintenance progress on close as an indication that they might want to wait before powering down.
We're over-engineering and designing for problems that seem to be an edge case - both the corruption itself, and an import taking minutes. Also, the proposed solutions require a bunch of changes to code that would be very risky to take at this point (old profile manager code that's untested, for example). The immediate approach should fix the common case, allowing the user to get about their business: If we detect a corrupt places.sqlite at startup, we delete it, create a new one, import the bookmarks backup and move on. Done. While lossy, this is silent and visibly effective - the user's bookmarks are there on startup. Worst that happens is the user is annoyed at a mildly longer startup (in the common case). We should file followup bugs for things like: * backup/restore of history * detecting when an import takes too long, and notifying the user * detecting recurring corruption and notifying the user * moving it to shutdown (but doing things at shutdown bites us everytime, is a known dead-end IMO)
>If we detect a corrupt places.sqlite at startup, we >delete it, create a new one, import the bookmarks backup and move on. Done. > >While lossy, this is silent and visibly effective yep, works for me
Assignee: nobody → mak77
Status: NEW → ASSIGNED
Keywords: late-l10n, uiwanted
Whiteboard: [needs UX directions]
Attached patch part 1: revise PlacesDBUtils (obsolete) — Splinter Review
This makes PlacesDBUtils being able to run various operations, while preserving backwards compatibility. It will also try to run async and in transactions, as far as possible (stats is still sync due to a tables loop, and I'm not sure I want to write complicate code to make it async, also because it's not called automatically). I still have to add nice javadocs. If the database is found corrupt, a pref is setup, so that Places can replace the database on next startup. Next parts should do the database replacement and try to implement a way to backup and restore history, as far as possible. As it is, this runs on idle-daily, but I think running it once a week could be enough, so I could change that, thoughts?
Asking blocking final, even if we never handled these cases correctly, SQLite 3.7.x and newer versions uses indices more than before to optimize queries, thus we are seeing an increased number of reported corruptions after the upgrade from 3.6.x to 3.7.x. We will most likely see even more cases with the final release, giving user a working browser sounds like pretty much important at this point.
blocking2.0: --- → ?
blocking2.0: ? → betaN+
Keywords: regression
I have had corrupted places.sqlite twice now (SeaMonkey). To say the least, it is disconcerting. To me, the loss of history, locationbar, ... is a big loss. (places.sqlite, since it contains so much, really needs automated backups. Though since it is so large ... Ah the good 'ol days of html & dat files.) I've been using the same Profiles, jumping between Branch releases & Trunk lately. (Trunk has been extremely crash prone for me of late.) Twice I have "resurrected" corrupted places.sqlite (or at least I haven't noticed problems my "fixed" copies). Thankfully places.sqlite.corrupt was created (backed up I suppose) rather then just outright deleted/recreated. I happened to have about a month old backup of (each) places.sqlite. Comparing the files, I notice the file "headers" were different (& obviously other content too, but the content looked to be there, just inaccessible). Having no clue, figured I'd just hack it, & see what happens. What I did was to copy the first $C0$ bytes from the backups, replacing the same in the .corrupt. ($C0$ just looked like a good chuck of bytes to work with.) Screenshot: http://i56.tinypic.com/2mgt44w.jpg
(In reply to comment #15) > (places.sqlite, since it contains so much, really needs automated backups. > Though since it is so large ... Ah the good 'ol days of html & dat files.) bookmarks have backup, the problem with history is not just backups (we could do them), but mostly privacy. if you remove stuff from history it could persist in your backup, and you can then say goodbye to any privacy. > I've been using the same Profiles, jumping between Branch releases & Trunk > lately. (Trunk has been extremely crash prone for me of late.) > > Twice I have "resurrected" corrupted places.sqlite (or at least I haven't > noticed problems my "fixed" copies). there is a known issue where if you downgrade to nightlies older than a couple weeks ago it will mark the db as corrupt even if it's not. This is known due to recent changes.
(In reply to comment #16) > there is a known issue where if you downgrade to nightlies older than a couple > weeks ago it will mark the db as corrupt even if it's not. This is known due to > recent changes. He's downgrading to Gecko 1.9.1 which according to sdwilsh is incapable of opening the 2.0 places.sqlite file anyway.
ugh, yes, 1.9.1 is unsopported due to it not being upgraded to sqlite 3.7.x
ok, first part is ready for review. I had to change most of the logic in the module, but all the external interface is backwards compatible (indeed I did not have to do big changes to tests). The new logic allows to run a single maintenance task, or a list of tasks. Unfortunately we are going to kill the Error console, but this can easily be used by a restartless add-on (indeed we should do one). I also changed so that we run maintenance once a week rather than daily, most of the un-coherences from the past were fixed, so there is no need to be so much picky today. We can eventually tweak this interval. In case of detected (and unfixable) corruption, this will set a places.database.replaceOnStartup bool pref. Next part will check this pref on startup and consider the database corrupt if it's set. A third part (or most likely a separate bug) should handle a try to recover history, this part imo doesn't block, but would be nice to have, thus I suggest having it in a follow-up to be approved if ready for the release.
Attachment #495240 - Attachment is obsolete: true
Attachment #499505 - Flags: review?(dietrich)
Whiteboard: [has patch][needs review dietrich]
Summary: Notify the user about corrupt Places.sqlite and replace the database → Detect a corrupt Places.sqlite and replace the database on next startup
Comment on attachment 499505 [details] [diff] [review] part 1: revise PlacesDBUtils logic r=me, just a few nits: >+ /** >+ * Forces a full refres of Places views. refresh >+/** >+ * LIFO tasks queue. >+ * >+ * @param aTasks >+ * Array of tasks or another Tasks object. >+ */ >+function Tasks(aTasks) >+{ >+ if (Array.isArray(aTasks)) { >+ this.list = aTasks.slice(0, aTasks.length); >+ } >+ else if ("list" in aTasks) { >+ this.list = aTasks.list.slice(0, aTasks.length); >+ this.callback = aTasks.callback; >+ this.scope = aTasks.scope; >+ } throw on a final else? >+ /** >+ * Returns array of log messages. >+ */ >+ getLog: function T_getLog() >+ { >+ let cpLog = this._log.slice(0, this._log.length); >+ return cpLog; >+ } why not return directly? >+} >diff --git a/toolkit/components/places/tests/unit/test_preventive_maintenance.js b/toolkit/components/places/tests/unit/test_preventive_maintenance.js >--- a/toolkit/components/places/tests/unit/test_preventive_maintenance.js >+++ b/toolkit/components/places/tests/unit/test_preventive_maintenance.js >@@ -47,25 +47,13 @@ Components.utils.import("resource://gre/ > > const FINISHED_MAINTANANCE_NOTIFICATION_TOPIC = "places-maintenance-finished"; nit: maintenance
Attachment #499505 - Flags: review?(dietrich) → review+
(In reply to comment #20) > >+function Tasks(aTasks) > >+{ > >+ if (Array.isArray(aTasks)) { > >+ this.list = aTasks.slice(0, aTasks.length); > >+ } > >+ else if ("list" in aTasks) { > >+ this.list = aTasks.list.slice(0, aTasks.length); > >+ this.callback = aTasks.callback; > >+ this.scope = aTasks.scope; > >+ } > > throw on a final else? ehr, actually aTasks is optional, I'll fix the javadoc to reflect it. > >+ /** > >+ * Returns array of log messages. > >+ */ > >+ getLog: function T_getLog() > >+ { > >+ let cpLog = this._log.slice(0, this._log.length); > >+ return cpLog; > >+ } > > why not return directly? I just don't want the caller to be able to play with my internal logs array. Ideally this can easily be used by a extension that can add whatever it wants before pushing out messages.
Fixed comments, I also simplified a bit the interface of the Tasks object, to make it more like a stack (using pop and push names) since it's clearer this way.
Attachment #499505 - Attachment is obsolete: true
And this is part 2, if the replaceOnStartup pref is set, we move the db to places.sqlite.corrupt and create a new valid one. nsBrowserGlue restores bookmarks, but history is lost, this is basically automating the process to fix a corrupt db. There is no difference regarding history. The advantage is that we can detect and try to fix corruption earlier rather than when it's too late (most cases can be fixed by reindex, and this can prevent worse corruptions sometimes). As previously stated, even if restoring history is a must-have, I'd not block on it, thus I'll file a follow-up for that part.
Attachment #501622 - Flags: review?(dietrich)
Blocks: 623489
Comment on attachment 501622 [details] [diff] [review] part 2: replace the database on next startup looks fine, r=me
Attachment #501622 - Flags: review?(dietrich) → review+
Whiteboard: [has patch][needs review dietrich] → [has patch][has review]
Flags: in-testsuite+
Whiteboard: [has patch][has review] → [fixed-in-places]
When will this merge to trunk?
Whiteboard: [fixed-in-places] → [fixed-in-places][soft blocker]
(In reply to comment #26) > When will this merge to trunk? I intend to merge later this week.
Whiteboard: [fixed-in-places][soft blocker] → [fixed-in-places][softblocker]
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla2.0b10
If I run the test contributed in Part 2 more than once on trunk it succeeds first time and fails on subsequent runs. Will attach nspr-logs from runs #1 and #2.
Attached file Log from first testrun
A second run immediately after the first one using make SOLO_FILE="test_database_replaceOnStartup.js" -C toolkit/components/places/tests check-one If I do hg upd -C default I can run the test successfully once again, then it fails on subsequent runs again.
file a separate bug please.
(Just thought I should mention this since I stumbled across it when debugging something else. My apologies if it is intended behaviour... )
Depends on: 649211
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: