Last Comment Bug 734507 - Provide gloda soft schema bump to recover DB badness from bug 723372 which manifested as a.contact is undefined in gloda.js
: Provide gloda soft schema bump to recover DB badness from bug 723372 which ma...
Status: RESOLVED FIXED
:
Product: MailNews Core
Classification: Components
Component: Database (show other bugs)
: Trunk
: All All
: -- normal (vote)
: Thunderbird 13.0
Assigned To: Andrew Sutherland [:asuth]
:
Mentors:
Depends on: 732372 754780
Blocks:
  Show dependency treegraph
 
Reported: 2012-03-09 14:16 PST by Andrew Sutherland [:asuth]
Modified: 2012-05-29 05:08 PDT (History)
11 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
-
+
fixed


Attachments
migration logic and a unit test (41.50 KB, patch)
2012-03-10 23:10 PST, Andrew Sutherland [:asuth]
mozilla: review+
jonathan.protzenko: feedback+
standard8: approval‑comm‑beta+
Details | Diff | Splinter Review

Description Andrew Sutherland [:asuth] 2012-03-09 14:16:53 PST
+++ This bug was initially created as a clone of Bug #732372 +++

The fallout from bug 723372 is that people who used Thunderbird Beta (11), Earlybird (12), or Daily (13) prior to its fix landing will have a messed up gloda database like so:

1) Newly created e-mail identities that were in the address book will end up missing their contact.

2) E-mails indexed involving messed-up identities as in #1 will end up being marked 'bad' (id=1).  This occurs because either:
  2a) The thrown exception during the indexing of the address book card from #1 directly caused a failure in indexing the message.
  2b) The call to 'optimize' for fundattr.js exploded when the identity created from #1 is loaded and an attempt is made to access the .contact attribute and one of its properties.

Unfortunately, because users can jump around amongst Thunderbird versions and we don't maintain any useful metadata about that, the best way to ensure that we correct the database problem is to track that the corrective pass has been performed using the schema revision.


The actions required to perform the soft schema change are as easy as:

A) Get a list of all identities without contacts using the SQL query from bug 723372.  Use this to create the missing contacts.

B) Change the 'bad' marker from 1 to something else less than 32 so that any messages we see marked with id 1 get reindexed.  (Although id 2 is obvious, I'll need to look at the iterator filter logic; it might be better to alternate between 1 and 31 or something like that for range scan purposes.)

C) If we did anything in step A, then mark all folders as dirty and trigger an indexing sweep.  This will do a scan of the messaging headers looking for any messages that should be indexed.

This can be done asynchronously as part of the indexing sweep at startup so that we don't make the upgrade process any more annoying than it has to be.  We will want to do this check before we do the normal indexing sweep for efficiency purposes.  It is essential that the contact id's are generated before any message indexing takes place, so all of that logic should happen in a single worker job.


The bad news is that if someone runs a build with this fix, then goes back to Thunderbird 10, Thunderbird 10's gloda will nuke the database because it will be scared.  Regrettably, this would not have been the case with Thunderbird 9, which intentionally left some spare schema revisions that it did not find terrifying to deal with situations like this.  I'll reinstate the idiom and document its need more strongly.  I let it disappear because I thought the rapid release model would moot that need, but I clearly did not think through the fallout of transient problems on daily/earlybird/beta...

I can provide a fix for Thunderbird 10 that will make it not explode if it sees a database where this fix has been applied that we can leave around to go out if we end up slipstreaming any Thunderbird 10 updates.  Not sure how that works with TB 11 being fairly imminent.
Comment 1 Andrew Sutherland [:asuth] 2012-03-10 23:10:28 PST
Created attachment 604746 [details] [diff] [review]
migration logic and a unit test

Here's migration logic as discussed on the bug.  I provide a unit test that attempts to duplicate the symptoms of the problem and then fix them.

I also duplicated by using both 1) an unhappy TB 11 b4 build and 2) an earlybird build that pre-dates the fix landing to generate bad gloda dbs from scratch.  I manually verified that the gloda db was messed up by inspecting the sqlite database, then ran with this patch and verified the db was not messed up afterwards.

All gloda xpcshell tests pass locally and I've pushed to try:

http://build.mozillamessaging.com/tinderboxpushlog/?tree=ThunderbirdTry&rev=472096b7c6d6

I've directed the review request at bienvenu, but protz I would appreciate it if you could look over the patch as well, particularly the comment that codifies a more explicit migration strategy than the ad-hoc stuff that was happening before.
Comment 2 Andrew Sutherland [:asuth] 2012-03-10 23:23:38 PST
Ludo, you will probably want to verify once the try build is fully baked.

STR are basically:
0) have some contacts in your address book who you have sent mail to in your sent mail folder.
1) delete global-messages-db.sqlite in your profile directory (Without TB running)
2) run an affected TB build; TB 11b4 or any build prior to March 8th should be fine.  Be running glodaquilla with forced compatibility so you can see the gloda-id's in columns.
3) look at the activity manager, wait for the gloda indexing sweep to have finished indexing the "sent mail" folders (it indexes them first) and moved onto Inboxes (or something else; anything but sent mail.)  Then wait 20 seconds to make sure a database commit has happened.
4) Go into a sent mail folder, and with glodaquilla's gloda id column showing, verify that some of the messages have a gloda id of '1' which is the *old* bad message identifier.  This should happen to all messages involving contacts from your address book.
5) Quit (the busted) Thunderbird.

6) Run with the patched Thunderbird.
7) wait for gloda to have moved on past the sent mail folders.
8) go into the sent folder
9) observe that the 1's are now gone and replaced by hex numbers >= 20.  If you are in the folder while it's still indexing, the 1's may change as you move your mouse and cause paint invalidations.

Hopefully that covers it.
Comment 3 Jonathan Protzenko [:protz] 2012-03-11 01:55:51 PST
Unterminated comment at line 763 in datastore.js. Did you upload the right version of the patch? :)
Comment 4 Jonathan Protzenko [:protz] 2012-03-11 03:16:32 PDT
Comment on attachment 604746 [details] [diff] [review]
migration logic and a unit test

Review of attachment 604746 [details] [diff] [review]:
-----------------------------------------------------------------

I think you need to document more what happens when one wants to migrate the schema version (there's a fair bit already but some things are still unclear, and in case you're not around forever ready to fix things on a Sunday before a release, that might be worth it :)). I think you need to mention that:
- when moving a from a schema version in the "acceptable" range to the "downgrade" range, one should use currentSchemaVersion+5 as the new version number,
- if one wants to perform from there on a backwards-compatible, upgrade-transition update, they're basically stuck and have to do a full, nuking schema bump.

Is that correct?

Overall, I think this leaves room for quite a few mitigation strategies but will not save us a few full upgrades of the database, with all the reindexing. Speaking of which, has anyone looked into the telemetry results for the indexing rate, and the alleged problems in indexing speed / reactivity? (« My Thunderbird has been indexing for two days straight »-like symptoms).

::: mailnews/db/gloda/modules/datastore.js
@@ +763,5 @@
> +   * - Backwards incompatible, minor schema changes.
> +   *    Thunderbird N+1 does something that does not require nuking the database
> +   *    but will break Thunderbird N's ability to use the database.
> +   *
> +   * - Regression fixes.  Sometimes we land something that 

Unterminated comment here. I feel like you whipped up this patch in a hurry :)

@@ +1076,5 @@
>          var tokenizer = Cc["@mozilla.org/messenger/fts3tokenizer;1"].
>                            getService(Ci.nsIFts3Tokenizer);
>          tokenizer.registerTokenizer(dbConnection);
>  
> +        // -- database schame changes

typo

::: mailnews/db/gloda/modules/index_msg.js
@@ +89,5 @@
>   *  release is made.
> + *
> + * This should ideally just flip between 1 and 2, with GLODA_OLD_BAD_MESSAGE_ID
> + *  flipping in the other direction.  If we start having more trailing badness,
> + *  _indxerGetEnumerator and GLODA_OLD_BAD_MESSAGE_ID will need to be altered.

typo

::: mailnews/db/gloda/test/unit/resources/glodaTestHelper.js
@@ +1340,5 @@
>                              .getService(Components.interfaces.nsIAbManager);
> +  // XXX bug 314448 demands that we trigger creation of the ABs...  If we don't
> +  //  do this, then the call to addCard will fail if someone else hasn't tickled
> +  //  this.
> +  abManager.directories;

I hope the JITs won't optimize this into a no-op later...
Comment 5 David :Bienvenu 2012-03-11 09:05:33 PDT
I'll run with this patch later this afternoon and make sure it fixes my corruption.

> +   * - Regression fixes.  Sometimes we land something that 

"wasn't finished"? Oh, I see what you did there :-)
Comment 6 David :Bienvenu 2012-03-11 16:29:31 PDT
Comment on attachment 604746 [details] [diff] [review]
migration logic and a unit test

modulo protz's comments, this looks good. It fixed my gloda search issues, and the unit tests look good.
Comment 7 Andrew Sutherland [:asuth] 2012-03-11 22:39:19 PDT
Pushed to trunk:
http://hg.mozilla.org/comm-central/rev/541d14e08463

Requesting TB 11 and 12 landings or what not since they both were vulnerable to this.  If there's already a TB 13 branch, I guess it will want this too.

(In reply to Jonathan Protzenko [:protz] from comment #4)
> I think you need to document more what happens when one wants to migrate the
> schema version (there's a fair bit already but some things are still
> unclear, and in case you're not around forever ready to fix things on a
> Sunday before a release, that might be worth it :)). I think you need to
> mention that:
> - when moving a from a schema version in the "acceptable" range to the
> "downgrade" range, one should use currentSchemaVersion+5 as the new version
> number,
> - if one wants to perform from there on a backwards-compatible,
> upgrade-transition update, they're basically stuck and have to do a full,
> nuking schema bump.

I added some more discussion of this.  In practice, you can keep redefining the revisions as you go, especially as there will be a finite number of Thunderbird versions in play.

If we end up doing a lot more migrations, it's probably simpler to just add a table/tables so that different versions of thunderbird can communicate; there's no need to actually try and distill everything down to a single schema version.

For example, if we had a table that tracked the earliest and most recent build used in a Thunderbird release series, we could determine whether we even need to run regression fixes like this.  So our check would have been to see if any TB 11, 12, or 13 build prior to March 9th had been run.  If so, we run our fix and make note of our fix build date which suppresses future runs of the fix logic.

But for now that's not really required and the engineering effort would be better spent on further fleshing out gloda unit tests so as to further reduce the potential for silent regressions.

> Overall, I think this leaves room for quite a few mitigation strategies but
> will not save us a few full upgrades of the database, with all the
> reindexing. Speaking of which, has anyone looked into the telemetry results
> for the indexing rate, and the alleged problems in indexing speed /
> reactivity? (« My Thunderbird has been indexing for two days straight »-like
> symptoms).

No?  I just looked at the telemetry data:
https://metrics.mozilla.com/pentaho/content/pentaho-cdf-dd/Render?solution=metrics2&path=telemetry&file=telemetryHistogram.wcdf

And THUNDERBIRD_GLODA_SIZE_MB's "1" bar is at over 57%; the JSON says this amounts to 229981 samples.  I'm assuming that one sample point is not one user, implying that we don't have a huge set of data-points coming in, and that many users are just not using gloda.  Having that data point is useful though, as it should provide some guidance for tuning the call to setGrowthIncrement.

> > +   * - Regression fixes.  Sometimes we land something that 
> 
> Unterminated comment here. I feel like you whipped up this patch in a hurry
> :)

Sorta.  As I'm writing comments, they remind me of things I need to do, so then I go and do those things, and sometimes forget to go back to the comments.  If I'm less rushed on/tired of a patch, I tend to do a more thorough job reviewing my patch to notice these things.

> I hope the JITs won't optimize this into a no-op later...

Since getters are a first-class thing in JS, I expect optimizers would only do that if they had full visibility into the thing and include a PIC shape-guard to skip the call.  Since it's XPConnect, it would likely avoid that.  Also, it's a unit test that fails when that breaks, so we'll be the first to know about the exciting semantics change :)  Better yet, I stole that bit from some address book unit test code, so Standard8 will be the one who fixes it for us!
Comment 8 Andrew Sutherland [:asuth] 2012-03-13 10:33:02 PDT
Thunderbird 11 had the regression and so anyone who ran Thunderbird 11 may have a messed up gloda database.  Are you assuming that they will stay on the beta channel and we will land this in 12 and so they'll get their database fixed?

If so, it's still advisable to land something on Thunderbird 11, because as it stands, Thunderbird 11 will nuke the gloda database whenever it sees a database updated by this fix.  As such, it would be advisable to at least land the database tolerance part of this patch.
Comment 9 Andrew Sutherland [:asuth] 2012-03-14 16:12:21 PDT
I'm restoring the tracking-thunderbird11 flag because some action, even if it's a new patch, is strongly advised for Thunderbird 11 even though we've probably missed the initial boat now.  Comment 8 is what should be addressed.
Comment 10 Mark Banner (:standard8) 2012-03-20 07:49:04 PDT
(In reply to Andrew Sutherland (:asuth) from comment #8)
> Thunderbird 11 had the regression and so anyone who ran Thunderbird 11 may
> have a messed up gloda database.

Just to make this clear, that's "ran Thunderbird 11 beta".

> Are you assuming that they will stay on
> the beta channel and we will land this in 12 and so they'll get their
> database fixed?

Yep, I suspect not many people notice it anyway.

> If so, it's still advisable to land something on Thunderbird 11, because as
> it stands, Thunderbird 11 will nuke the gloda database whenever it sees a
> database updated by this fix.  As such, it would be advisable to at least
> land the database tolerance part of this patch.

Unfortunately this patch came in a bit too late for 11. I think we can safely take it in 12, and that will resolve most of our issues - and fix the beta users.

Its unclear if there would be any side effects for TB 11 release users going back to TB 10, but as we didn't change the gloda schema version for 11, I'm guessing not.
Comment 11 Mark Banner (:standard8) 2012-03-20 07:51:20 PDT
Comment on attachment 604746 [details] [diff] [review]
migration logic and a unit test

[Triage Comment]
a=me, I'll land this in a bit.
Comment 12 Mark Banner (:standard8) 2012-03-20 07:56:15 PDT
Checked in:

http://hg.mozilla.org/releases/comm-beta/rev/39fafab28941
Comment 13 Andrew Sutherland [:asuth] 2012-03-21 10:31:37 PDT
> Its unclear if there would be any side effects for TB 11 release users going
> back to TB 10, but as we didn't change the gloda schema version for 11, I'm
> guessing not.

With you landing this on Thunderbird 12, the transitions will look like this with a TB 12 build with this patch:

TB 11->10, 13->12: no change, same schema revisions.
TB 10->11, 11->12+, 10->12+: database checked for bad contacts, schema upgraded
TB 12->11, 13->11: database nuked
Comment 14 Andrew Sutherland [:asuth] 2012-03-21 10:40:19 PDT
(In reply to Andrew Sutherland (:asuth) from comment #13)
> > Its unclear if there would be any side effects for TB 11 release users going
> > back to TB 10, but as we didn't change the gloda schema version for 11, I'm
> > guessing not.
> 
> With you landing this on Thunderbird 12, the transitions will look like this
> with a TB 12 build with this patch:

whoops. the 10->11 change was wrong.  revised table:

TB 10->11, TB 11->10, 13->12: no change, same schema revisions.
TB 10,11->12+: database checked for bad contacts, schema upgraded
TB 12+->11: database nuked
Comment 15 Mark Banner (:standard8) 2012-03-22 12:21:29 PDT
(In reply to Andrew Sutherland (:asuth) from comment #14)
> (In reply to Andrew Sutherland (:asuth) from comment #13)
> > > Its unclear if there would be any side effects for TB 11 release users going
> > > back to TB 10, but as we didn't change the gloda schema version for 11, I'm
> > > guessing not.
> > 
> > With you landing this on Thunderbird 12, the transitions will look like this
> > with a TB 12 build with this patch:
> 
> whoops. the 10->11 change was wrong.  revised table:
> 
> TB 10->11, TB 11->10, 13->12: no change, same schema revisions.
> TB 10,11->12+: database checked for bad contacts, schema upgraded
> TB 12+->11: database nuked

Ok, that's fine, I can live with that.
Comment 16 Jonathan Protzenko [:protz] 2012-05-24 12:11:33 PDT
I've had a surprisingly high number of people reporting an unusable Thunderbird Conversations due to a bad global-message-db.sqlite file (Thunderbird 12), see https://getsatisfaction.com/mozilla_messaging/topics/thunderbird_conversations_not_working_correctly_on_osx among others.

As to me, on every one of the three computers running Thunderbird, the global search suddenly became unusable, throwing "Error: aTab is undefined" when launching it. This was fixed by deleting the sqlite file.

Could this be related to the schema bump?
Comment 17 Andrew Sutherland [:asuth] 2012-05-25 10:40:49 PDT
(In reply to Jonathan Protzenko [:protz] from comment #16)
> I've had a surprisingly high number of people reporting an unusable
> Thunderbird Conversations due to a bad global-message-db.sqlite file
> (Thunderbird 12), see
> https://getsatisfaction.com/mozilla_messaging/topics/
> thunderbird_conversations_not_working_correctly_on_osx among others.

This is probably https://bugzilla.mozilla.org/show_bug.cgi?id=754780

Note You need to log in before you can comment on or make changes to this bug.