1085161 - investigate sync failure after upgrade/downgrade from 33->34->33

:Gavin Sharp [email: gavin@gavinsharp.com]

Reporter

Description

•

11 years ago

I accidentally launched my "main" profile in Beta 34. I quickly fixed that and switched back to Firefox 33.0, but now Sync is failing with: > 1413749053327 Sync.Status DEBUG Status for engine history: error.engine.reason.unknown_fail > 1413749053327 Sync.Status DEBUG Status.service: success.status_ok => error.sync.failed_partial > 1413749053327 Sync.ErrorHandler DEBUG history failed: database disk image is malformed No traceback available I confirmed with a bit of debugging that the failure is here: http://hg.mozilla.org/mozilla-central/annotate/33c0181c4a25/services/sync/modules/engines/history.js#l173 And indeed executing this statement produces an error: http://hg.mozilla.org/mozilla-central/annotate/33c0181c4a25/services/sync/modules/engines/history.js#l146 The mozIStorageError returned has result==20, "datatype mismatch". Marco, have any ideas what would have caused this? Anything else I can do to debug further?

Flags: needinfo?(mak77)

Marco Bonardo [:mak]

Assignee

Comment 1

•

11 years ago

[Tracking Requested - why for this release]: Should be tracked cause it breaks upgrade/downgrande path between 34 and 33. Off-hand this must be related to the migration we did to schema version 24 in bug 1017502, even if off-hand I cannot find an obvious cause there (I just found a small improvement we could apply, but that's not breaking anything). Though I find quite strange that a simple SELECT can issue a datatype mismatch error, I'd expect that from an INSERT or an UPDATE. I will try to do the upgrade/downgrande dance and see if I can reproduce this to investigate further.

Assignee: nobody → mak77

Blocks: 1017502

tracking-firefox34: --- → ?

Flags: needinfo?(mak77)

Marco Bonardo [:mak]

Assignee

Updated

•

11 years ago

Keywords: regression

Lawrence Mandel [:lmandel] (use needinfo)

Updated

•

11 years ago

status-firefox33: --- → wontfix

status-firefox34: --- → affected

status-firefox35: --- → affected

status-firefox36: --- → affected

tracking-firefox34: ? → +

tracking-firefox35: --- → +

tracking-firefox36: --- → +

:Gavin Sharp [email: gavin@gavinsharp.com]

Reporter

Updated

•

11 years ago

Flags: firefox-backlog+

Tim Taubert [:ttaubert] (inactive)

Comment 2

•

11 years ago

Marco, can you please estimate and say whether this needs QA?

Flags: needinfo?(mak77)

Marco Bonardo [:mak]

Assignee

Updated

•

11 years ago

Points: --- → 3

Flags: needinfo?(mak77) → qe-verify+

:Gavin Sharp [email: gavin@gavinsharp.com]

Reporter

Updated

•

11 years ago

Iteration: --- → 36.1

Marco Mucci [:MarcoM]

Updated

•

11 years ago

Iteration: 36.1 → ---

Marco Mucci [:MarcoM]

Updated

•

11 years ago

Status: NEW → ASSIGNED

Iteration: --- → 36.2

Marco Bonardo [:mak]

Assignee

Comment 3

•

11 years ago

I'd like to know how you ran the query on the db in comment 0. I'd be interested to know if a clone of the db will fix this issue: 1. backup the corrupt places.sqlite somewhere, so we can analyze it and figure the kind of corruption and why integrity_check doesn't detect it 2. close firefox, ensure places.sqlite-wal is not present, otherwise open once the db with sqlite3.exe command line, it should be merged. Ensure the wal file disappears 3. rename places.sqlite to places.sqlite.corrupt 4. using sqlite3 command line, open places.sqlite.corrupt and do ".clone places.sqlite" 5. try to use the new database if everything works properly, it's very likely this is an obscure corruption in the database headers, we should ask Sqlite team to help us figuring out why it happens to mitigate its possibilities.

Flags: needinfo?(gavin.sharp)

:Gavin Sharp [email: gavin@gavinsharp.com]

Reporter

Comment 4

•

11 years ago

(In reply to Marco Bonardo [::mak] (needinfo? me) from comment #3) > I'd like to know how you ran the query on the db in comment 0. In my Firefox build JS shell, I ran: Weave.Service.engineManager.get("history")._store._allUrlStm.executeAsync({handleResult: function (){print("handleResult"); }, handleError: function (err) { print("handleError: " + err.result + ": " + err.message); }}) and got: [xpconnect wrapped mozIStoragePendingStatement] handleError: 20: datatype mismatch > if everything works properly, it's very likely this is an obscure corruption > in the database headers, we should ask Sqlite team to help us figuring out > why it happens to mitigate its possibilities. Things still seem to be broken the same way after that (same steps above produce the same result).

Flags: needinfo?(gavin.sharp)

Marco Bonardo [:mak]

Assignee

Comment 5

•

11 years ago

does executing the same statement in the sqlite3 command shell AND/OR Sqlite Manager cause the same error? SELECT url FROM moz_places WHERE last_visit_date > 1 ORDER BY frecency DESC LIMIT 2 (I used random numbers here but should do)

Marco Bonardo [:mak]

Assignee

Comment 6

•

11 years ago

Just to sum up, we figured that this is some sort of database file corruption that is NOT detected by PRAGMA integrity_check, but it's fixed when the database data is .cloned to a new database. you should be able to set a breakpoint on sqlite3CorruptError, and get a stack at the point the corruption is returned, so we can try to ask to the Sqlite team about it.

Marco Bonardo [:mak]

Assignee

Comment 7

•

11 years ago

So, based on the above, I don't think this is a specific bug that affects upgrade/downgrade to 34. Indeed I couldn't reproduce these failures nor on Win nor on Mac with such procedure. On the other side, we should at a minimum get a stack to see where the corruption is detected, and if possible send the db, or a minified copy of it, to the Sqlite team so they can improve integrity_check to detect this kind of corruption.

Flags: needinfo?(gavin.sharp)

Marco Bonardo [:mak]

Assignee

Comment 8

•

11 years ago

We could also improve Storage so that the first time a SQLITE_CORRUPT error is hit, it fires an nsIObserver notification that a consumer can intercept and do whatever it thinks is needed. A consumer could show a dialog and ask the user if it can try to fix the database, then it could use an implementation of the .clone command to try fixing the db. I'm saying we should ask the user cause that would likely hang the UI, since we need to get exclusive access to the database, so any synchronous query would block the UI. An alternative could be to try doing that on shutdown, with an unusual very long shutdown. We plan to do the same to fix the page size, for example. What .clone does is pretty simple, it tries to copy db contents to a new db using SELECT queries, differently from the online backup feature that blindly copies file pages, that may be corrupt. We could implement a clonedb API in Storage. That would not solve any issue clearly, if the corruption disallows to copy some data, that data is lost and the clone could be imperfect/incomplete. the ticket implementing .clone in Sqlite is http://www.sqlite.org/src/info/d1dfadea87ecf18eeb6d2f21769deaa97473ca0e

:Gavin Sharp [email: gavin@gavinsharp.com]

Reporter

Comment 9

•

11 years ago

Given that there's no reason to believe this is a widespread issue, we don't need to track this. Let's call this fixed given the investigative work Marco has done here. I will file a separate bug on doing the stuff I need to do to debug this case further, and maybe file some followups for Marco's ideas from comment 8.

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

status-firefox33: wontfix → ---

status-firefox34: affected → ---

status-firefox35: affected → ---

status-firefox36: affected → ---

tracking-firefox34: + → -

tracking-firefox35: + → -

tracking-firefox36: + → -

Flags: qe-verify-

Flags: qe-verify+

Flags: needinfo?(gavin.sharp)

Resolution: --- → FIXED

Summary: Sync failure after upgrade/downgrade from 33->34->33 → investigate sync failure after upgrade/downgrade from 33->34->33

:Gavin Sharp [email: gavin@gavinsharp.com]

Reporter

Comment 10

•

11 years ago

Filed bug 1092390

Depends on: 1092390

Bugzilla

investigate sync failure after upgrade/downgrade from 33->34->33

Categories

(Toolkit :: Places, defect)

Tracking

()

People

(Reporter: Gavin, Assigned: mak)

References

Details

(Keywords: regression)

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Updated

Updated

Comment 2

Updated

Updated

Updated

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10