Closed
Bug 1085161
Opened 11 years ago
Closed 11 years ago
investigate sync failure after upgrade/downgrade from 33->34->33
Categories
(Toolkit :: Places, defect)
Toolkit
Places
Tracking
()
RESOLVED
FIXED
Iteration:
36.2
People
(Reporter: Gavin, Assigned: mak)
References
Details
(Keywords: regression)
I accidentally launched my "main" profile in Beta 34. I quickly fixed that and switched back to Firefox 33.0, but now Sync is failing with:
> 1413749053327 Sync.Status DEBUG Status for engine history: error.engine.reason.unknown_fail
> 1413749053327 Sync.Status DEBUG Status.service: success.status_ok => error.sync.failed_partial
> 1413749053327 Sync.ErrorHandler DEBUG history failed: database disk image is malformed No traceback available
I confirmed with a bit of debugging that the failure is here:
http://hg.mozilla.org/mozilla-central/annotate/33c0181c4a25/services/sync/modules/engines/history.js#l173
And indeed executing this statement produces an error:
http://hg.mozilla.org/mozilla-central/annotate/33c0181c4a25/services/sync/modules/engines/history.js#l146
The mozIStorageError returned has result==20, "datatype mismatch".
Marco, have any ideas what would have caused this? Anything else I can do to debug further?
Flags: needinfo?(mak77)
Assignee | ||
Comment 1•11 years ago
|
||
[Tracking Requested - why for this release]:
Should be tracked cause it breaks upgrade/downgrande path between 34 and 33.
Off-hand this must be related to the migration we did to schema version 24 in bug 1017502, even if off-hand I cannot find an obvious cause there (I just found a small improvement we could apply, but that's not breaking anything).
Though I find quite strange that a simple SELECT can issue a datatype mismatch error, I'd expect that from an INSERT or an UPDATE.
I will try to do the upgrade/downgrande dance and see if I can reproduce this to investigate further.
Assignee | ||
Updated•11 years ago
|
Keywords: regression
Updated•11 years ago
|
status-firefox33:
--- → wontfix
status-firefox34:
--- → affected
status-firefox35:
--- → affected
status-firefox36:
--- → affected
tracking-firefox35:
--- → +
tracking-firefox36:
--- → +
Reporter | ||
Updated•11 years ago
|
Flags: firefox-backlog+
Comment 2•11 years ago
|
||
Marco, can you please estimate and say whether this needs QA?
Flags: needinfo?(mak77)
Assignee | ||
Updated•11 years ago
|
Points: --- → 3
Flags: needinfo?(mak77) → qe-verify+
Reporter | ||
Updated•11 years ago
|
Iteration: --- → 36.1
Updated•11 years ago
|
Iteration: 36.1 → ---
Updated•11 years ago
|
Status: NEW → ASSIGNED
Iteration: --- → 36.2
Assignee | ||
Comment 3•11 years ago
|
||
I'd like to know how you ran the query on the db in comment 0.
I'd be interested to know if a clone of the db will fix this issue:
1. backup the corrupt places.sqlite somewhere, so we can analyze it and figure the kind of corruption and why integrity_check doesn't detect it
2. close firefox, ensure places.sqlite-wal is not present, otherwise open once the db with sqlite3.exe command line, it should be merged. Ensure the wal file disappears
3. rename places.sqlite to places.sqlite.corrupt
4. using sqlite3 command line, open places.sqlite.corrupt and do
".clone places.sqlite"
5. try to use the new database
if everything works properly, it's very likely this is an obscure corruption in the database headers, we should ask Sqlite team to help us figuring out why it happens to mitigate its possibilities.
Flags: needinfo?(gavin.sharp)
Reporter | ||
Comment 4•11 years ago
|
||
(In reply to Marco Bonardo [::mak] (needinfo? me) from comment #3)
> I'd like to know how you ran the query on the db in comment 0.
In my Firefox build JS shell, I ran:
Weave.Service.engineManager.get("history")._store._allUrlStm.executeAsync({handleResult: function (){print("handleResult"); }, handleError: function (err) { print("handleError: " + err.result + ": " + err.message); }})
and got:
[xpconnect wrapped mozIStoragePendingStatement]
handleError: 20: datatype mismatch
> if everything works properly, it's very likely this is an obscure corruption
> in the database headers, we should ask Sqlite team to help us figuring out
> why it happens to mitigate its possibilities.
Things still seem to be broken the same way after that (same steps above produce the same result).
Flags: needinfo?(gavin.sharp)
Assignee | ||
Comment 5•11 years ago
|
||
does executing the same statement in the sqlite3 command shell AND/OR Sqlite Manager cause the same error?
SELECT url FROM moz_places WHERE last_visit_date > 1 ORDER BY frecency DESC LIMIT 2 (I used random numbers here but should do)
Assignee | ||
Comment 6•11 years ago
|
||
Just to sum up, we figured that this is some sort of database file corruption that is NOT detected by PRAGMA integrity_check, but it's fixed when the database data is .cloned to a new database.
you should be able to set a breakpoint on sqlite3CorruptError, and get a stack at the point the corruption is returned, so we can try to ask to the Sqlite team about it.
Assignee | ||
Comment 7•11 years ago
|
||
So, based on the above, I don't think this is a specific bug that affects upgrade/downgrade to 34. Indeed I couldn't reproduce these failures nor on Win nor on Mac with such procedure.
On the other side, we should at a minimum get a stack to see where the corruption is detected, and if possible send the db, or a minified copy of it, to the Sqlite team so they can improve integrity_check to detect this kind of corruption.
Flags: needinfo?(gavin.sharp)
Assignee | ||
Comment 8•11 years ago
|
||
We could also improve Storage so that the first time a SQLITE_CORRUPT error is hit, it fires an nsIObserver notification that a consumer can intercept and do whatever it thinks is needed.
A consumer could show a dialog and ask the user if it can try to fix the database, then it could use an implementation of the .clone command to try fixing the db.
I'm saying we should ask the user cause that would likely hang the UI, since we need to get exclusive access to the database, so any synchronous query would block the UI. An alternative could be to try doing that on shutdown, with an unusual very long shutdown. We plan to do the same to fix the page size, for example.
What .clone does is pretty simple, it tries to copy db contents to a new db using SELECT queries, differently from the online backup feature that blindly copies file pages, that may be corrupt. We could implement a clonedb API in Storage.
That would not solve any issue clearly, if the corruption disallows to copy some data, that data is lost and the clone could be imperfect/incomplete.
the ticket implementing .clone in Sqlite is http://www.sqlite.org/src/info/d1dfadea87ecf18eeb6d2f21769deaa97473ca0e
Reporter | ||
Comment 9•11 years ago
|
||
Given that there's no reason to believe this is a widespread issue, we don't need to track this.
Let's call this fixed given the investigative work Marco has done here.
I will file a separate bug on doing the stuff I need to do to debug this case further, and maybe file some followups for Marco's ideas from comment 8.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
status-firefox33:
wontfix → ---
status-firefox34:
affected → ---
status-firefox35:
affected → ---
status-firefox36:
affected → ---
Flags: qe-verify-
Flags: qe-verify+
Flags: needinfo?(gavin.sharp)
Resolution: --- → FIXED
Summary: Sync failure after upgrade/downgrade from 33->34->33 → investigate sync failure after upgrade/downgrade from 33->34->33
You need to log in
before you can comment on or make changes to this bug.
Description
•