Closed Bug 1085161 Opened 11 years ago Closed 11 years ago

investigate sync failure after upgrade/downgrade from 33->34->33

Categories

(Toolkit :: Places, defect)

defect
Not set
normal
Points:
3

Tracking

()

RESOLVED FIXED
Iteration:
36.2
Tracking Status
firefox34 - ---
firefox35 - ---
firefox36 - ---

People

(Reporter: Gavin, Assigned: mak)

References

Details

(Keywords: regression)

I accidentally launched my "main" profile in Beta 34. I quickly fixed that and switched back to Firefox 33.0, but now Sync is failing with: > 1413749053327 Sync.Status DEBUG Status for engine history: error.engine.reason.unknown_fail > 1413749053327 Sync.Status DEBUG Status.service: success.status_ok => error.sync.failed_partial > 1413749053327 Sync.ErrorHandler DEBUG history failed: database disk image is malformed No traceback available I confirmed with a bit of debugging that the failure is here: http://hg.mozilla.org/mozilla-central/annotate/33c0181c4a25/services/sync/modules/engines/history.js#l173 And indeed executing this statement produces an error: http://hg.mozilla.org/mozilla-central/annotate/33c0181c4a25/services/sync/modules/engines/history.js#l146 The mozIStorageError returned has result==20, "datatype mismatch". Marco, have any ideas what would have caused this? Anything else I can do to debug further?
Flags: needinfo?(mak77)
[Tracking Requested - why for this release]: Should be tracked cause it breaks upgrade/downgrande path between 34 and 33. Off-hand this must be related to the migration we did to schema version 24 in bug 1017502, even if off-hand I cannot find an obvious cause there (I just found a small improvement we could apply, but that's not breaking anything). Though I find quite strange that a simple SELECT can issue a datatype mismatch error, I'd expect that from an INSERT or an UPDATE. I will try to do the upgrade/downgrande dance and see if I can reproduce this to investigate further.
Assignee: nobody → mak77
Blocks: 1017502
Flags: needinfo?(mak77)
Keywords: regression
Flags: firefox-backlog+
Marco, can you please estimate and say whether this needs QA?
Flags: needinfo?(mak77)
Points: --- → 3
Flags: needinfo?(mak77) → qe-verify+
Iteration: --- → 36.1
Iteration: 36.1 → ---
Status: NEW → ASSIGNED
Iteration: --- → 36.2
I'd like to know how you ran the query on the db in comment 0. I'd be interested to know if a clone of the db will fix this issue: 1. backup the corrupt places.sqlite somewhere, so we can analyze it and figure the kind of corruption and why integrity_check doesn't detect it 2. close firefox, ensure places.sqlite-wal is not present, otherwise open once the db with sqlite3.exe command line, it should be merged. Ensure the wal file disappears 3. rename places.sqlite to places.sqlite.corrupt 4. using sqlite3 command line, open places.sqlite.corrupt and do ".clone places.sqlite" 5. try to use the new database if everything works properly, it's very likely this is an obscure corruption in the database headers, we should ask Sqlite team to help us figuring out why it happens to mitigate its possibilities.
Flags: needinfo?(gavin.sharp)
(In reply to Marco Bonardo [::mak] (needinfo? me) from comment #3) > I'd like to know how you ran the query on the db in comment 0. In my Firefox build JS shell, I ran: Weave.Service.engineManager.get("history")._store._allUrlStm.executeAsync({handleResult: function (){print("handleResult"); }, handleError: function (err) { print("handleError: " + err.result + ": " + err.message); }}) and got: [xpconnect wrapped mozIStoragePendingStatement] handleError: 20: datatype mismatch > if everything works properly, it's very likely this is an obscure corruption > in the database headers, we should ask Sqlite team to help us figuring out > why it happens to mitigate its possibilities. Things still seem to be broken the same way after that (same steps above produce the same result).
Flags: needinfo?(gavin.sharp)
does executing the same statement in the sqlite3 command shell AND/OR Sqlite Manager cause the same error? SELECT url FROM moz_places WHERE last_visit_date > 1 ORDER BY frecency DESC LIMIT 2 (I used random numbers here but should do)
Just to sum up, we figured that this is some sort of database file corruption that is NOT detected by PRAGMA integrity_check, but it's fixed when the database data is .cloned to a new database. you should be able to set a breakpoint on sqlite3CorruptError, and get a stack at the point the corruption is returned, so we can try to ask to the Sqlite team about it.
So, based on the above, I don't think this is a specific bug that affects upgrade/downgrade to 34. Indeed I couldn't reproduce these failures nor on Win nor on Mac with such procedure. On the other side, we should at a minimum get a stack to see where the corruption is detected, and if possible send the db, or a minified copy of it, to the Sqlite team so they can improve integrity_check to detect this kind of corruption.
Flags: needinfo?(gavin.sharp)
We could also improve Storage so that the first time a SQLITE_CORRUPT error is hit, it fires an nsIObserver notification that a consumer can intercept and do whatever it thinks is needed. A consumer could show a dialog and ask the user if it can try to fix the database, then it could use an implementation of the .clone command to try fixing the db. I'm saying we should ask the user cause that would likely hang the UI, since we need to get exclusive access to the database, so any synchronous query would block the UI. An alternative could be to try doing that on shutdown, with an unusual very long shutdown. We plan to do the same to fix the page size, for example. What .clone does is pretty simple, it tries to copy db contents to a new db using SELECT queries, differently from the online backup feature that blindly copies file pages, that may be corrupt. We could implement a clonedb API in Storage. That would not solve any issue clearly, if the corruption disallows to copy some data, that data is lost and the clone could be imperfect/incomplete. the ticket implementing .clone in Sqlite is http://www.sqlite.org/src/info/d1dfadea87ecf18eeb6d2f21769deaa97473ca0e
Given that there's no reason to believe this is a widespread issue, we don't need to track this. Let's call this fixed given the investigative work Marco has done here. I will file a separate bug on doing the stuff I need to do to debug this case further, and maybe file some followups for Marco's ideas from comment 8.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Flags: qe-verify-
Flags: qe-verify+
Flags: needinfo?(gavin.sharp)
Resolution: --- → FIXED
Summary: Sync failure after upgrade/downgrade from 33->34->33 → investigate sync failure after upgrade/downgrade from 33->34->33
You need to log in before you can comment on or make changes to this bug.