Closed Bug 1394845 Opened 7 years ago Closed 6 years ago

places.sqlite gets corrupt in Firefox v55

Categories

(Toolkit :: Places, defect)

55 Branch
defect
Not set
normal

Tracking

()

VERIFIED INCOMPLETE
Tracking Status
firefox-esr52 --- unaffected
firefox55 --- wontfix
firefox56 + fix-optional
firefox57 + fix-optional

People

(Reporter: andrei.faber, Unassigned)

Details

(Keywords: dataloss)

User Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:55.0) Gecko/20100101 Firefox/55.0
Build ID: 20170824053622

Steps to reproduce:

Upgraded to v55


Actual results:

My places.sqlite file was marked as corrupt and all website history disappeared. (The file itself doesn't have errors when I checked it in mysql; there is only a single warning "*** in database main *** Page 24718 is never used"


Expected results:

Nothing.
Component: Untriaged → Places
Keywords: dataloss
Product: Firefox → Toolkit
Did you use Firefox Refresh after you upgraded? Would you be willing to share your corrupt database with one of our engineers to figure out what is wrong with it?
Flags: needinfo?(andrei.faber)
(In reply to Panos Astithas [:past] (56 Regression Engineering Owner) (please ni?) from comment #2)
> Did you use Firefox Refresh after you upgraded? Would you be willing to
> share your corrupt database with one of our engineers to figure out what is
> wrong with it?

1. No
2. No, it's private

I can confirm that the same places.sqlite works perfectly with Firefox v54
Flags: needinfo?(andrei.faber)
fandrei, is it 55.0 or 55.0.3 ?
[Tracking Requested - why for this release]: A dataloss issue in Firefox 55, still needs investigating though.
(In reply to Sylvestre Ledru [:sylvestre] from comment #4)
> fandrei, is it 55.0 or 55.0.3 ?

It's 55.0.3
Were your bookmarks restored? did you only lose history?
Did you ever execute any manual SQL query on this database?
Did you have add-ons that could have done changes to bookmarks and history?

When we migrate the schema, there can be data incoherencies that are unmanageable. Those could even stay asymptomatic for many versions, until a migration touches exactly that data.
That's what I suspect is happening here.

Unfortunately without having the db I can't debug the problem. Unfortunately if your db can't be upgraded and you don't want to lose history, you'll also be stuck on v54, because it's hard to fix something we can't reproduce, and we still don't have an history backup system in place.
Off-hand I don't think this is related to the profile reset problem. It's likely that on every update we have some users whose db is found to be corrupt and replaced, as well as it's likely only a minority of users will file a bug here.

We have an opt-out probe on that tells us how many corruptions we saw from 55 on, and I indeed see it being high on 55 (almost 140k pings), though we don't have a previous value to compare with, nor I can tell how many of those are repeated tries from the same profile. 56 looks much better stats-wise.
> Were your bookmarks restored? did you only lose history?

After the first start of Firefox after upgrade, the bookmarks were empty. However, they appeared after restarting Firefox; they only lost all favicons.

> Did you ever execute any manual SQL query on this database?

No.

> Did you have add-ons that could have done changes to bookmarks and history?

No.
PS I can perform any checks on my places.sqlite file if you have ideas where to look.
Tomorrow I could probably build some SQL queries and ask you to run those and report results.

But first of all, we should understand if "Page 24718 is never used" is a mistake that can be fixed, since otherwise the db will just be corrupt forever. Two things to try (after a backup) is executing a VACUUM on the db, then checking again with PRAGMA integrity_check. Another thing to try is REINDEX.

Once PRAGMA integrity_check returns OK, we can proceeding into checking the data coherence.
Flags: needinfo?(mak77)
> But first of all, we should understand if "Page 24718 is never used" is a
> mistake that can be fixed, since otherwise the db will just be corrupt
> forever. Two things to try (after a backup) is executing a VACUUM on the db,
> then checking again with PRAGMA integrity_check. Another thing to try is
> REINDEX.

Executing vacuum and then PRAGMA integrity_check, getting the following error: "database disk image is malformed: PRAGMA integrity_check".
I see, I'm not surprised that 55 bails out and considers the database corrupt, because it looks really unrecoverable.
You may try to build a new one with the steps outlined here:
https://developer.mozilla.org/en-US/docs/Mozilla/Tech/Places/places.sqlite_Database_Troubleshooting#How_to_(try_to)_recover_from_a_corrupt_places.sqlite
(In reply to Marco Bonardo [::mak] from comment #13)
> I see, I'm not surprised that 55 bails out and considers the database
> corrupt, because it looks really unrecoverable.
> You may try to build a new one with the steps outlined here:
> https://developer.mozilla.org/en-US/docs/Mozilla/Tech/Places/places.
> sqlite_Database_Troubleshooting#How_to_(try_to)_recover_from_a_corrupt_places
> .sqlite

Trying these steps... It sill leaves me with 2 questions:
1. Why pre-v55 Firefox worked with this file and didn't have any problems?
2. Why it didn't show any warnings that the database is severely damaged?
PS playing with user_version haven't helped at all. Any other ideas how to fix the file, or how to export the history from v54 and import in v55?
(In reply to fandrei from comment #14)
> Trying these steps... It sill leaves me with 2 questions:
> 1. Why pre-v55 Firefox worked with this file and didn't have any problems?

55 moves favicons to a new database and then runs a vacuum on the original one because a lot of space can be regained from it. Looks like you have one kind of corruption that gets uncovered exactly by a vacuum operation.
Usually we also vacuum on idle, but maybe in your case maintenance never found a long enough idle to run.

> 2. Why it didn't show any warnings that the database is severely damaged?

Because until the vacuum was executed your corruption was asymptomatic, the times something failed could have been unnoticed or considered glitches.

We currently detect db corruption at 2 times:
1. on startup, but this depends on whether sqlite3_open succeeds. There are a lot of asymptomatic corruptions where open just works, and most of the queries work.
2. on weekly maintenance, that happens on a long idle. Unfortunately for some users idle doesn't seem to ever happen, so we never hit this. We surely can improve in proactive detection.

The only thing that can detect a db corruption properly is PRAGMA integrity_check, but it's an expensive operation we can't run lightly when opening the db. And with a bogus idle there's no big choice.
I filed time ago bug 1125157 to allow consumers to listen to "corrupt" notifications from Sqlite, and we should also integrate (or try to) the document I pointed out as a Firefox functionality.

Firefox did what it could once it was able to detect the situation, it couldn't recover the db, so it just created a new one, and imported bookmarks into it. Ideally it would be a transparent process, if we'd have an history backup (bug 431274).
(In reply to fandrei from comment #15)
> PS playing with user_version haven't helped at all. Any other ideas how to
> fix the file, or how to export the history from v54 and import in v55?

Please pay attention to follow the steps one by one starting from the db you had BEFORE the vacuum. The user_version is just one of the steps. Where does the tutorial fail exactly?
(In reply to Marco Bonardo [::mak] from comment #16)
> 2. on weekly maintenance, that happens on a long idle. Unfortunately for
> some users idle doesn't seem to ever happen, so we never hit this. We surely
> can improve in proactive detection.

A 2 months old backup version of this file exhibits the same error. Apparently, the integrity check was never executed during these 2 months (at least).

> The only thing that can detect a db corruption properly is PRAGMA
> integrity_check, but it's an expensive operation we can't run lightly when
> opening the db.

Is it? It only took 1570 msecs on my PC.
(In reply to Marco Bonardo [::mak] from comment #17)
> Please pay attention to follow the steps one by one starting from the db you
> had BEFORE the vacuum. The user_version is just one of the steps. Where does
> the tutorial fail exactly?

Those instructions didn't fail. However, Firefox lost all the history and bookmarks after I used the resulting file.
I'm sorry to insist, it's possible you did something wrong, the cloning process (.clone command) copies over everything, included history, and then you just need to properly set user_version, or Firefox may again consider it corrupt because it can't find a proper version. Don't start from the -corrupt file, start from the file that Firefox 54 can open and use properly, clone that into a new db.
Currently there's no other simple way to move history, any other try may be very complex and end up still creating data coherency problems.

(In reply to fandrei from comment #18)
> > The only thing that can detect a db corruption properly is PRAGMA
> > integrity_check, but it's an expensive operation we can't run lightly when
> > opening the db.
> 
> Is it? It only took 1570 msecs on my PC.

Right, 1.5s where we'd block most of the other db operations and possibly cause an I/O bottleneck. We have users on very old systems, we can't just fire a check at a random time. That's why we use idle, but idle detection has some bugs.
Flags: needinfo?(mak77)
Track 56+/57+ as data loss regression.
this is not a regression, corrupt databases happen at every version. I'm indeed not sure what we could do about it, apart from trying to improve the experience, but that's not something we can do lightly and uplift.
(In reply to Marco Bonardo [::mak] from comment #20)

> Don't start from the
> -corrupt file, start from the file that Firefox 54 can open and use
> properly, clone that into a new db.

That's exactly what I did.

> Right, 1.5s where we'd block most of the other db operations and possibly
> cause an I/O bottleneck. We have users on very old systems, we can't just
> fire a check at a random time. 

Firefox often takes way much more time to start than 1.5s.

> That's why we use idle, but idle detection
> has some bugs.

I'd say that it's just completely defunct.
(In reply to Marco Bonardo [::mak] from comment #22)
> I'm
> indeed not sure what we could do about it, apart from trying to improve the
> experience, but that's not something we can do lightly and uplift.

Make Firefox really check for database integrity. At very least, add a button in the settings to invoke it manually.
there is already a button in about:support but it's not complete, indeed I plan (as soon as I have time) to augment that functionality to do a full check and also a more complete try to fix the db.
(In reply to Marco Bonardo [::mak] from comment #25)
> there is already a button in about:support but it's not complete, indeed I
> plan (as soon as I have time) to augment that functionality to do a full
> check and also a more complete try to fix the db.

By the way, did you ever consider the idea to use a plain text file for storing places? This functionality doesn't really require SQL features, and text file would be way much easier to validate or fix it manually. And faster as well.
It wouldn't be faster and we definitely need the Sqlite ACID guarantees, plus the awesome bar wouldn't be possible without it. Indeed most browsers nowadays use Sqlite for that.
(In reply to Marco Bonardo [::mak] from comment #27)
> It wouldn't be faster 

Just test it, plain file is always faster when data fits into RAM.

> and we definitely need the Sqlite ACID guarantees,

I've lost all my history, and quite a few users lost all bookmarks as well. It doesn't look like these guarantees really guarantee anything.
it's a few cases out of millions, and ACID cannot guarantee safety if a filesystem lies (many do), or if a system has (even small) memory, disk or overclock issues causing data breakage. Well in those cases nothing can really do anything.
(In reply to Marco Bonardo [::mak] from comment #29)
> it's a few cases out of millions

You wrote that for v55 there were 140k corruption cases reported by telemetry :) And there are more users who turned the telemetry off.

> and ACID cannot guarantee safety if a
> filesystem lies (many do)

Or just a power outage or Firefox crash. Those guarantees are very weak, anyway. And manual data recovery is pretty much impossible for SQLite.
no, power outage or crashes can't corrupt Sqlite in normal conditions (if the conditions in comment 29 are not satisfied).

(In reply to fandrei from comment #30)
> You wrote that for v55 there were 140k corruption cases reported by
> telemetry :) And there are more users who turned the telemetry off.

140k pings, the same profile may send multiple ones. And you should compare that to 100 millions users.
(In reply to Marco Bonardo [::mak] from comment #31)

> no, power outage or crashes can't corrupt Sqlite in normal conditions 

How can you be completely sure about that?

> And you should compare
> that to 100 millions users.

That's a few in a thousand, not a few in a million. Difference in 3 orders of magnitude.
I'm sorry but I feel like the discussion continues in loop, please read the sqlite3 documentation, their team has a long time expertise with these issues, and we also have some knowledge after 10 years of releases :) We could go on for hours with technical details, but none of them is useful to move on this ticket.
Removing the regression keyword as DB corruption reports appear in a few cases, but all the time during upgrades. Marking as fix-optional for 56 and 57 as there is no actionable info yet in this bug. We will reconsider the priority if we can get steps to reproduce the problem so proper investigation can occur.
Status: UNCONFIRMED → NEW
Ever confirmed: true
> I'm sorry but I feel like the discussion continues in loop, please read the sqlite3 documentation, their team has a long time expertise with these issue, and we also have some knowledge after 10 years of releases :)

Database corruption problem isn't solved, it happens for hundreds thousands users, and database consistency isn't really checked. Those are the facts.
We are making work to improve reliability of the data, but there's nothing specific we can address for this report, thus resolving for now. We'll keep working on general data reliability.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.