Closed Bug 519543 Opened 15 years ago Closed 14 years ago

Gloda search while indexing hangs and is very slow; search activity should suppress indexing

Categories

(Thunderbird :: Search, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED WORKSFORME
Thunderbird 3

People

(Reporter: sdwilsh, Assigned: asuth)

References

Details

(Keywords: perf, Whiteboard: [no l10n impact][fixed RC1 build 2, needs solution for security/stability branch, and core bug 528802 landing on trunk])

Attachments

(6 files, 3 obsolete files)

5.84 KB, patch
Bienvenu
: review+
Details | Diff | Splinter Review
3.42 KB, text/plain
Details
10.64 KB, patch
asuth
: review+
Details | Diff | Splinter Review
411.01 KB, application/x-bzip
gozer
: review+
Details
545.46 KB, application/x-bzip
gozer
: review+
Details
1.44 KB, patch
Details | Diff | Splinter Review
When I type into the search box and then select something like "messages with ALL of:...", I get a long hang.  I do have a lot of messages (all of my bugmail) being indexed, which could be a factor here.
Flags: blocking-thunderbird3?
I'm pretty sure this is a dupe, and iirc it is due to the fact we haven't got some async sqlite stuff on the 1.9.1 branch (but I could be totally wrong).
Whiteboard: dupeme
(In reply to comment #1)
> I'm pretty sure this is a dupe, and iirc it is due to the fact we haven't got
> some async sqlite stuff on the 1.9.1 branch (but I could be totally wrong).
It might be possible for us to backport it if you really need it, so if that's the case let me know.  I found the user experience here to be pretty horrible, sadly.
from talking to sdwilsh, i think the summary is that the main thread locks up _real bad_ if there's indexing going on and we do a search.  I think we should at least investigate ways to mitigate that, on asuth's return, especially given the opportunity provided by a reporter whose code is involved. =)
Assignee: nobody → bugmail
Flags: blocking-thunderbird3? → blocking-thunderbird3+
I think I maybe spoke confusingly online.  The activity manager nor the status bar indicated that indexing was going on when I was trying to do my search.
...and I should have added this before I pressed commit, but this is reproducible for me, but sometimes it's much worse than other times.  There is always a hang, and it almost always beachballs.
Whiteboard: dupeme → [no l10n impact] dupeme
How large are your address books? I have a 1400 entry address book. When gloda search locks up for me, it's because we're iterating over the address books looking for a card for an e-mail address. For e-mail addresses not in my address books, it's especially painful because when nsAbMDBDirectory::CardForEmailAddress fails on the hash table lookup of the primary address, it falls back on a linear search of all the cards in the db for the secondary address, so it can do a case-insensitive lookups.
So the bad scenario is to have a lot of address book entries, and to get a search with a lot of hits, with some hits containing addresses not in the address book(s).

We could do the same trick we do for the primary addresses, which is to store a lower case copy of the secondary address column, and do the hash table lookup on the lower-cased address we're looking for. We'd have to migrate older address books, but that's fairly easy.
(In reply to comment #7)
> So the bad scenario is to have a lot of address book entries, and to get a
> search with a lot of hits, with some hits containing addresses not in the
> address book(s).

How intelligent is the search? At the moment when you do a search does it repeatedly look for the same address even though it already has that result?
Are we talking about locking up in the autocomplete popup creation, or in the rendering of results based on a full-text search?

I'm pretty sure that the autocomplete search process is going through GlodaContact objects, not nsIAbCards.  http://mxr.mozilla.org/comm-central/source/mailnews/db/gloda/components/glautocomp.js#190

I'm not even convinced that the search results does _anything_ with addressbook entries.
I think bienvenu & sdwilsh's problems have very different root causes.
I'm talking about the displaying of the search results from a full text search (or, rather, the code that processes the search results prior to displaying them).

See http://mxr.mozilla.org/comm-central/source/mailnews/db/gloda/modules/utils.js#117

If Venkman actually worked, I could give you a stack trace. We may not be doing anything with address book entries, but we're definitely getting them.

I don't know if gloda is caching the results or just asking for them for each address it sees.
It'd be good to figure out why gloda needs the data.  I still suspect this is different than sdwilsh's behavior, which I think is "crappy query performance while indexing going on".

(aside: I will sometimes insert a:

  try { this.wont.work(); } catch(e) { logException(e) };

in cases when I want a stack trace.  We could use a better way to do that.)
I won't be at the computer that has this problem until Monday, but I don't think the address book is fairly large (probably less than 100).
Here's the js stack for taking a search result and looking in the ab for it:

in get card for email
gloda_utils_getCardForEmail("mlecontact@na2.us.ml.com")@file:///C:/mozilla-build
/msys/acland/objdir-tb/mozilla/dist/bin/modules/gloda/utils.js:123
()@file:///C:/mozilla-build/msys/acland/objdir-tb/mozilla/dist/bin/modules/gloda
/datamodel.js:685
()@file:///C:/mozilla-build/msys/acland/objdir-tb/mozilla/dist/bin/modules/gloda
/datamodel.js:701
gloda_fundattr_score([object Object])@file:///C:/mozilla-build/msys/acland/objdi
r-tb/mozilla/dist/bin/modules/gloda/fundattr.js:696
gloda_ns_grokNounItem([object Array],[object Object],[object Array])@file:///C:/
mozilla-build/msys/acland/objdir-tb/mozilla/dist/bin/modules/gloda/gloda.js:2188

GlodaMsgSearcher_onItemsAdded([object Array],[object Object])@file:///C:/mozilla
-build/msys/acland/objdir-tb/mozilla/dist/bin/modules/gloda/msg_search.js:291
gloda_coll_onItemsAdded([object Array])@file:///C:/mozilla-build/msys/acland/obj
dir-tb/mozilla/dist/bin/modules/gloda/collection.js:540
([object Array],[object Object])@file:///C:/mozilla-build/msys/acland/objdir-tb/
mozilla/dist/bin/modules/gloda/datastore.js:143
gloda_coll_onItemsAdded([object Array])@file:///C:/mozilla-build/msys/acland/obj
dir-tb/mozilla/dist/bin/modules/gloda/collection.js:540
([object Array],[object Object])@file:///C:/mozilla-build/msys/acland/objdir-tb/
mozilla/dist/bin/modules/gloda/datastore.js:143
gloda_coll_onItemsAdded([object Array])@file:///C:/mozilla-build/msys/acland/obj
dir-tb/mozilla/dist/bin/modules/gloda/collection.js:540
([object Array],[object Object])@file:///C:/mozilla-build/msys/acland/objdir-tb/
mozilla/dist/bin/modules/gloda/datastore.js:143
gloda_coll_onItemsAdded([object Array])@file:///C:/mozilla-build/msys/acland/obj
dir-tb/mozilla/dist/bin/modules/gloda/collection.js:540
(null,[object Object],true)@file:///C:/mozilla-build/msys/acland/objdir-tb/mozil
la/dist/bin/modules/gloda/datastore.js:143
gloda_ds_qfq_handleCompletion(0)@file:///C:/mozilla-build/msys/acland/objdir-tb/
mozilla/dist/bin/modules/gloda/datastore.js:385
While I generally suspect that the "async is not actually async" problem is likely the problem (and could be resolved by porting storage changes to 1.9.1), we certainly could do a better job at caching knowledge about whether an address book card exists.  So let's try and take the address book card out of the picture...

Our current caching only caches successful location of an address book card.  As it sounds like the failure case is the expensive one, this can potentially get ugly.  We were doing this because we didn't want stale false values and didn't have address book indexer logic at the time.  We now have such logic, so this patch changes things so we cache our determination and then just keep it up-to-date by using our address book indexer.

I have a unit test that is (regrettably) dependent on my unit test refactoring for my gloda-delete patch that I will attach separately.  It passes.
Attachment #404711 - Flags: review?(bienvenu)
Oh, and I forgot to mention that the preceding patch only deals with the primary e-mail address when it wants to find out e-mail addresses from the address book.  We probably should be doing something with the secondary address but given the poor address book story there and the complete lack of any UI to merge contacts, I don't think it's remotely a high priority, and definitely not this bug.
Attachment #404715 - Attachment mime type: application/empty → text/plain
And to reiterate, I think our problem is the async issue but this is by inference; I have not done proper profiling using proper tooling.

Our auto-completion code is pretty much guaranteed to lock the UI up for a little bit as it dispatches two queries in rapid succession, both of which are wildcarded LIKEs which will result in table scans on O(number of e-mail addresses you have ever seen).  We can mitigate that one somewhat easily.

If indexing is happening in the background, lock-ups are quite likely and potentially very long-running.  The specific long-running possibilities are:
- Our actual fulltext query is not cheap.
- A fulltext insertion by the indexer can result in FTS segment merges which are not cheap.
- A database commit triggered by the indexer involves fsyncs and is not cheap.
- A SQLite cache spill will result in fsyncs event without compelling a transaction.

We can attempt to mitigate the indexing problem by suppressing indexing while something is happening with the search box.

sdwilsh, if you could ballpark the work to port the async changes to 1.9.1, that would be great.  I originally had assumed it would not be possible because places changes might need to come along, but now that I think about it, I think the existing storage API has remained relatively stable.  It would be hard to bring some of the places changes to 1.9.1 without storage, but probably not as much vice versa.  I had been operating on the assumption that gloda search was just going to randomly suck until we moved to 1.9.2.
What bugs did you have in mind?  I can't recall all of the ones that got fixed that might impact this, sadly.

Does the autocomplete code perform a synchronous lookup on the database?
(In reply to comment #18)
> What bugs did you have in mind?  I can't recall all of the ones that got fixed
> that might impact this, sadly.

I guess the main bug is bug 506022 and its dependencies.  1.9.1's executeAsync will clone the SQL which requires access to the sqlite mutex and dooms us.  Unfortunately, that bug built on tons of work that came before it.

I was interpreting backport as a "cp 1.9.2/storage/* 1.9.1/storage/*" sort of thing.

> Does the autocomplete code perform a synchronous lookup on the database?

Gloda performs a few synchronous operations during initialization, but is forever async after that.  By the time code is able to get a reference to Gloda, it is already done with its synchronous work.

The auto-complete code is just bad because we issues two long-running async tasks in quick succession.  This makes it very likely that the first one will cause the second one to block.
Comment on attachment 404711 [details] [diff] [review]
v1 change ab card existence caching to always cache and use notifications to keep it up-to-date v1

Does onItemAdded want to set _hasAddressbookCard to true? If not, a comment would be in order...

   onItemAdded: function ab_indexer_onItemAdded(aParentDir, aItem) {
+    if (!(aItem instanceof Ci.nsIAbCard))
+      return;
+
+    this._log.debug("Received Card Add Notification");
+    let identity = GlodaCollectionManager.cacheLookupOneByUniqueValue(
+      Gloda.NOUN_IDENTITY, "email@" + aItem.primaryEmail.toLowerCase());
+    if (identity)
+      identity._hasAddressBookCard = false;
(In reply to comment #20)
> (From update of attachment 404711 [details] [diff] [review])
> Does onItemAdded want to set _hasAddressbookCard to true? If not, a comment
> would be in order...

Whoops.  I fixed that, but it ended up in the patch where the unit test is... I'll push that back down the stack.
Comment on attachment 404711 [details] [diff] [review]
v1 change ab card existence caching to always cache and use notifications to keep it up-to-date v1

cool, r=me, once the onItemAdded part is fixed.
Attachment #404711 - Flags: review?(bienvenu) → review+
I think the state of the bug is that there are two known problematic things:

1) address book card lookup is allegedly very slow.  at least for bienvenu.  and even with the landed patch.

2) executeAsync is not actually async and will hitch pretty baddly.

The fundamental problem of #2 is not going to get fixed for 3.0 (because it needs to be fixed on 1.9.1 and the time/risk factors don't work out).  There is probably some low-hanging mitigation we can do, but I can't get to that yet.

Which leaves #1.  So I'm assigning this over to bienvenu to provide feedback about the address book performance and/or try and profile that and/or do something about that and/or try and get Standard8 to do something about that.  It should probably be set back to me once there is nothing actionable left to do about the address book.
Assignee: bugmail → bienvenu
I'd definitely like to get some stats/more info on #1.

For each search (or a typical), how many times are we calling GlodaUtils.cardForEmailAddress?

From tests I've done previously, unless we're hitting it loads of times I believe the impact should be small (it is compose autocomplete that is really slow for other reasons).
I think the address book patch actually helped, and the slowness I saw after that was due to #2, in particular, doing a search while indexing is going on.

With the patch, we will call cardForEmailAddress once per unique address in the to/from/cc list. I was doing a search that had 350 or so unique addresses. Prior to the patch, we would not cache negative results, so if I had 1500 non-unique addresses in the message hits, and none of them happened to be in my address book, we were doing a ton of linear searches over a large address book (1400 entries).
Okay, so with #1 dealt with, it sounds like the problem is that searches while indexing is active sucks.  (This is not surprising.)

Gloda has support for suppressing indexing.  It is currently a boolean and slaved to online/offline.  I would augment it to support a counter for 'transient suppression'.  I call it transient because it is very easy to conceive of the search window throwing an exception and never actually remembering to tell the indexer it can start back up.  So if we make the indexer assume that if it hasn't heard anything for some number of seconds it should just zero that suppression so we can avoid ugly failure modes.  We could try and do something clever where the timeouts extend when we see query activity, but it's easy to imagine pathological situations with that.

We probably want to initiate suppression as soon as we get any typing in the global search mode box.

The other mitigation we likely want to do is make the gloda contact/identity completion queries run in serial rather than parallel.
Assignee: bienvenu → bugmail
OS: Mac OS X → All
Hardware: x86 → All
Summary: Selecting on option for Search hangs thunderbird → Gloda search while indexing hangs and is very slow; search activity should suppress indexing
Have the address book changes landed?  I can tell you if that's made an improvement.
(In reply to comment #28)
> Have the address book changes landed?  I can tell you if that's made an
> improvement.

Yes, the changes to have gloda cache the result of address book lookups have landed.  Current and recent nightlies have this.
It still feels slow, and nothing is said in the activity manager about indexing going on.  Shark profile shows me this:
	55.0%	55.0%	libSystem.B.dylib	semaphore_timedwait_signal_trap
	0.0%	55.0%	libSystem.B.dylib	 _pthread_cond_wait
	0.0%	55.0%	libSystem.B.dylib	  pthread_cond_timedwait
	0.0%	55.0%	libnspr4.dylib	   pt_TimedWait
	0.0%	55.0%	libnspr4.dylib	    PR_WaitCondVar
	0.0%	40.0%	libnspr4.dylib	     PR_Wait
	0.0%	5.0%	thunderbird-bin	     XPCJSRuntime::WatchdogMain(void*)
	0.0%	5.0%	libxpcom_core.dylib	     TimerThread::Run()
	0.0%	5.0%	thunderbird-bin	     nsHostResolver::GetHostToLookup(nsHostRecord**)
	24.8%	24.8%	libSystem.B.dylib	semaphore_wait_signal_trap
	0.0%	22.3%	libSystem.B.dylib	 _pthread_cond_wait
	0.0%	22.3%	libSystem.B.dylib	  pthread_cond_wait
	0.0%	22.3%	libnspr4.dylib	   PR_WaitCondVar
	0.0%	12.3%	libnspr4.dylib	    PR_Wait
	0.0%	5.0%	thunderbird-bin	    nsSSLThread::Run()
	0.0%	5.0%	thunderbird-bin	    nsCertVerificationThread::Run()
	0.0%	2.5%	libSystem.B.dylib	 pthread_mutex_lock
	0.0%	2.5%	libsqlite3.dylib	  pthreadMutexEnter
	0.0%	2.5%	libsqlite3.dylib	   sqlite3_finalize
	0.0%	2.5%	thunderbird-bin	    mozStorageStatement::Finalize()
	0.0%	0.0%	libsqlite3.dylib	   sqlite3LockAndPrepare
	0.0%	0.0%	thunderbird-bin	    mozStorageConnection::ExecuteAsync(mozIStorageStatement**, unsigned int, mozIStorageStatementCallback*, mozIStoragePendingStatement**)
	0.0%	0.0%	thunderbird-bin	    mozStorageStatement::Initialize(mozStorageConnection*, nsACString_internal const&)
The profile isn't perfect since I had to start shark, switch to thunderbird and then type.  I could get rid of the locking from mozStorageStatement::Initialize on 1.9.1 probably, but that doesn't look like it is what's killing us either.
Er...that was not just the main thread which means it's not so useful.  I just ran it again, and it was much much faster, and the only thing showing up in the profile is sqlite3_finalize.  Caching may be good enough to get on for 1.9.1 if we can convince drivers that it is safe.
(In reply to comment #31)
> profile is sqlite3_finalize.  Caching may be good enough to get on for 1.9.1 if
> we can convince drivers that it is safe.

You lost me.  What caching?
Just taking a part of bug 506022 - namely http://mxr.mozilla.org/mozilla-central/source/storage/src/mozStorageStatement.cpp#282 and storing the async statement for later use.  We'll save the finalize, and preparing the statement each time it is used.
The async code in 1.9.1 still traffics in sqlite3_stmt instances which is where the bindings get stored.  From a multiplicity perspective this means we need to create a lot of these unless we limit the number in flight.  (It also uses sqlite3_transfer_bindings/sqlite3TransferBindings to transfer that state from the synchronous statement which requires access to the mutex.)  Would we maintain a pool of cloned statements that have no state on them?  That definitely saves the cloning/finalization leaving only the binding transfer, but the pool part sounds ugly.

Unless we take bug 490085 which adds support for storing the bindings independent of the sqlite3_stmt (at which point follow-on stuff like bug 506022 is comparatively trivial), I'm not sure there are any really easy wins, but it's been a while since I've really dug into the 1.9.1 storage code.
Oh hrm - I completely forgot about bindings.  I'm not sure we have anything easy to take for 1.9.1 to help this.  The good news, I guess, is that Thunderbird 3.1 is supposed to come out shortly afterward built on 1.9.2, right?  This has those fixes.
(In reply to comment #35)
> Oh hrm - I completely forgot about bindings.  I'm not sure we have anything
> easy to take for 1.9.1 to help this.  The good news, I guess, is that
> Thunderbird 3.1 is supposed to come out shortly afterward built on 1.9.2,
> right?  This has those fixes.

That is the plan.  And quite a good incentive too :)
Target Milestone: --- → Thunderbird 3.0rc1
It would be useful to know how many people are experiencing unusable levels of performance for gloda searches while indexing is active still.  (People using debug builds without the patch to disable the nsAutolock debug issues in 1.9.1 don't count.)

I'm a bit concerned that the quick-search widget (under edge-cases involving tab switching) and faceting UI are simply not bulletproof enough to trivially trigger gloda indexing suppression without a risk of disabling gloda indexing entirely for the session.  Which is to say, given how soon we want to cut the RC, I think it's a bad idea to do this unless performance is absurdly poor for people who fall within 1 std-dev of average.  (Which means ruling out people searching for a term that 10,000 of their messages match in an account with 100,000 messages.)
Whiteboard: [no l10n impact] dupeme → [no l10n impact][re-triage]
Blocking-, given comment 37, unless we have some reason to believe that a large number of people are seeing this.
Flags: blocking-thunderbird3+ → blocking-thunderbird3-
For what it's worth, I've been using autocomplete for addresses recently, and those seem to be hanging just as bad as this, so maybe this is the address lookup that I'm seeing.
I tested early this morning with 20091107 on a slowish vista laptop and I don't see a problem with operating while indexing is going strong.  But then a) the index wasn't yet fully populated and b) I'm not 100% sure I understand the problem.
I should note we do have one reliable way to fix the async database problem.  If we have our own release branch of 1.9.1, we can just land the 1.9.2 storage implementation (sans the most recent change to Connection::Close) on our branch and everything becomes awesome immediately.

The storage implementation is a reasonably atomic set of code, so it would be a pretty simple operation and unlikely to present any meaningful trouble in merging other 1.9.1 branch changes other than the need to merge at all.
I believe we'd need to re-land it for each release off of the 1.9.1 branch; Standard8 can confirm if that's correct.
Whiteboard: [no l10n impact][re-triage] → [no l10n impact]
(In reply to comment #42)
> I believe we'd need to re-land it for each release off of the 1.9.1 branch;
> Standard8 can confirm if that's correct.

We'd either have to re-land it each time, or port/land all the security fixes on our branch. Neither of which we really want to do IMO.
(In reply to comment #41)
> I should note we do have one reliable way to fix the async database problem. 
> If we have our own release branch of 1.9.1, we can just land the 1.9.2 storage
> implementation (sans the most recent change to Connection::Close) on our branch
> and everything becomes awesome immediately.
> 
> The storage implementation is a reasonably atomic set of code, so it would be a
> pretty simple operation and unlikely to present any meaningful trouble in
> merging other 1.9.1 branch changes other than the need to merge at all.

I've ported 1.9.2 storage to 1.9.1 and made it possible to use it in a fully async manner for gloda's needs.

pbranches can be found here:
http://hg.mozilla.org/users/bugmail_asutherland.org/m191-with-m192-storage-pbranches/


pbranches are as follows, in order of application:

- storage-from-1.9.2: This is literally the storage and sqlite3 dirs copied over from 1.9.2.  No other changes.

- storage-from-1.9.2-fixups: These are patches that layer on top of the 1.9.2 version of storage.  These mods mainly port the Mutex abstraction and put back in Makefile stuff that 1.9.1 requires.

- storage-full-async: Storage on 1.9.2 is still not fully async safe for gloda, this fixes that problem.  For example, gloda has to generate SQL queries on the fly.  With stock 1.9.2 storage, this hangs.  The main thing gloda had to give up is binding arguments by name.


I have verified that things are really as async as I expect by setting a conditional breakpoint on sqlite3_mutex_enter where the condition is that we are the main thread.  I have done this for: a newly added storage test; for the gloda (with minor patch) tests test_index_messages_local.js and test_query_messages_local.js; and for actually running thunderbird.  In all cases I do not set the conditional breakpoint until after gloda has gotten beyond its synchronous bootstrap stage.

While the gloda results do not really come any faster with my patch, they do avoid locking up the main thread.  This is pretty nice.  For example, when I search on the string "all" from my special build, Thunderbird does not lock up and I can watch that spinny 'searching' animation happily.  When I do the same search on a normal build, stuff locks up and is generally sad.

I want to spin a try build so I can share the awesomeness, but I'm unclear on how I can get the try-server to use the right base for the 'mozilla' subdir.


I would argue that we want this change for 3.0 and it is worth another release candidate.  I would also argue that the pbranches (as structured) should allow us to very easily keep the patches up-to-date so that we can easily patch each new 1.9.1.x release to the state we require.

I think the canonical example for why we need this is doing a gloda search for your own first name.  If that doesn't lock up your thunderbird, I don't know what will.  Now imagine if your Thunderbird hadn't locked up.  Nice, eh?
Flags: blocking-thunderbird3- → blocking-thunderbird3?
1.9.2 storage provides a number of ways we could make the gloda async usage more efficient, but this patch does not attempt to do any of that.  This patch just stops us from making our patched-up storage attempt to acquire the sqlite mutex on the main thread.
Depends on: 528802
(In reply to comment #44)
> I would argue that we want this change for 3.0 and it is worth another release
> candidate.  I would also argue that the pbranches (as structured) should allow
> us to very easily keep the patches up-to-date so that we can easily patch each
> new 1.9.1.x release to the state we require.

So that's not quite as easy as you suggest. It doesn't deal with nightly builds or developer builds both of which would need these patches in to fully replicate the release so that when we write patches or have gecko landings affecting those areas that we can then ensure we don't regress the actual release build (which I hadn't thought of before).

To me, this says we need to maintain our own gecko 1.9.1 release branch. Even if pbranches make it simpler there is still going to be what I expect is a non-small cost to us for updating this or keeping it in sync with 1.9.1 in some form.


I would also point out that the complete patch branch includes changes to idls including the removal of at least one interface. It also includes additions and maybe some changes to xpcom (not just sqlite).

The changes to idls are IMO likely to cause issues for any extensions that use sqlite (or end up using it) - our interfaces won't be the same as for Firefox based on the same gecko version, which I can see causing confusion. Even if it doesn't cause confusion, changing interfaces after we've started building the first rc seems wrong especially when we've already done an add-on outreach.

The changes to xpcom mean we're also going to need to watch the relevant parts of 1.9.2 for security fixes in those areas and port them across if necessary (which we'll also have to do for the branched sqlite code).

I'm not against picking up a big perf improvement, but I am concerned that we're going to give ourselves more issues than we expect.
(In reply to comment #46)
> So that's not quite as easy as you suggest. It doesn't deal with nightly builds
> or developer builds both of which would need these patches in to fully
> replicate the release so that when we write patches or have gecko landings
> affecting those areas that we can then ensure we don't regress the actual
> release build (which I hadn't thought of before).

I was assuming we might lock comm-1.9.1's suggested mozilla branch to our release branch du jour for automated thunderbird builds of any importance.  So we'd go directly from the tag for Thunderbird 3.0 to the tag for Thunderbird 3.0.1 to the tag for Thunderbird 3.0.2.  We would not just sit on the tip of 1.9.1.

My gloda changes thus far do not actually require anything about 1.9.2 storage for correct operation.  A developer and/or tinderbox builds could easily build against 1.9.1 tip; presumably the developers would not have major problems with the lockups, and if they did, they could pull the pbranches directly.


> To me, this says we need to maintain our own gecko 1.9.1 release branch. Even
> if pbranches make it simpler there is still going to be what I expect is a
> non-small cost to us for updating this or keeping it in sync with 1.9.1 in some
> form.

Our own 1.9.1 release branch also works, but I agree it is potentially much more work.


> I would also point out that the complete patch branch includes changes to idls
> including the removal of at least one interface. It also includes additions and
> maybe some changes to xpcom (not just sqlite).
>
> The changes to idls are IMO likely to cause issues for any extensions that use
> sqlite (or end up using it) - our interfaces won't be the same as for Firefox
> based on the same gecko version, which I can see causing confusion. Even if it
> doesn't cause confusion, changing interfaces after we've started building the
> first rc seems wrong especially when we've already done an add-on outreach.

The XPCOM changes are for deadlock detector cleanup and I do not believe impact any actual signatures.

I agree that the storage IDLs have a number of changes and that these would impact binary extensions.  However, I am not aware of any binary extensions that would be affected by this (I checked enigmail), and I think it is very easy to make the case that the user-pleasing operation of a default feature is much more important than a temporary setback for an extension that has already chosen the hard route and will need to update itself to compensate for 1.9.2 anyways.

 
> The changes to xpcom mean we're also going to need to watch the relevant parts
> of 1.9.2 for security fixes in those areas and port them across if necessary
> (which we'll also have to do for the branched sqlite code).

Storage is a privileged API only accessible to chrome.  The XPCOM changes are for a mutex abstraction which is only used by storage (on my patched 1.9.1).  Any security exploit that involves any of the patched code has already broken out of the content sandbox and it is likely that putting it back in the sandbox is the actual fix.  In the event that content is able to somehow create malformed content that would affect storage/SQLite, I would expect the un-patched 1.9.1 branch to suffer from the problem as well.  (Also, I'd rather expect libmime to be a more likely point of exploitation.)

I am currently the primary reviewer for the storage component; I am likely to notice any patches going by.


> I'm not against picking up a big perf improvement, but I am concerned that
> we're going to give ourselves more issues than we expect.

I agree.  This is definitively the worst possible time to be attempting to address this problem.  However, it's a pretty bad problem and any other attempt at mitigation short of forking storage into comm-central is going to be incomplete.


per irc it sounds like I should be able to spin try builds.  I'll try that now.
Note that I'm starting to think the hanging that I'm seeing is a result of LDAP look up for addresses, since I see the same hang every time I compose an e-mail and add an address that I'm sending the mail to.
sdwilsh: can you confirm that you updated your LDAP config to match the new settings (https://intranet.mozilla.org/EmailAccess#Thunderbird_.26_IMAP_clients) ?
(In reply to comment #49)
> sdwilsh: can you confirm that you updated your LDAP config to match the new
> settings
> (https://intranet.mozilla.org/EmailAccess#Thunderbird_.26_IMAP_clients) ?

This triggered my memory - I had 3 LDAP accounts pointing to a non-existent localhost server (which hasn't existed for ages). I removed them and I no longer see beachball hangs.
(In reply to comment #50)
> This triggered my memory - I had 3 LDAP accounts pointing to a non-existent
> localhost server (which hasn't existed for ages). I removed them and I no
> longer see beachball hangs.

This is possibly still not the same issue, there isn't consistency in my test results and it may just be restarting TB actually reset broken things and made it "better".
(In reply to comment #49)
> sdwilsh: can you confirm that you updated your LDAP config to match the new
> settings
> (https://intranet.mozilla.org/EmailAccess#Thunderbird_.26_IMAP_clients) ?
This machine I'm currently on was not fully updated (although now I'm getting invalid SSL certificate errors), but I'm fairly sure my work machine is up to date.  The hangs were much worse here though.

Does the GLODA search look through directory server stuff synchronously?  Seems like we'd not want to not do that in the case of a gloda search (hitting the network is slow).
Comment on attachment 412469 [details] [diff] [review]
make gloda stop binding by name in the limited cases where it does so v1

Asking for review since we're planning to land this.
Attachment #412469 - Flags: review?(bienvenu)
Attached file error log for test_query_core.js (obsolete) —
I'm getting a test failure w/ this test.
ok, I get that failure w/o the patch as well...
Attachment #412469 - Flags: review?(bienvenu) → review+
This fixes the glitch embarassingly manifested by the patch.  There was a binding coercion pseudo-regression.  The binding-by-name native support was JS Date-object aware.  Our custom variant binding code is not JS Date aware and so got confused by this.  It's a pseudo-regression because it actually is a bug that we were relying on the automated JS Date coercion!  SQLite does not actually have a date type (for our purposes) so we cannot round-trip it and so we most definitely should not be putting it in.  Carrying forward review since the intent and body of the patch is maintained and the change is unit tested.
Attachment #412469 - Attachment is obsolete: true
Attachment #412755 - Attachment is obsolete: true
Attachment #412811 - Flags: review+
Attachment #412811 - Flags: approval-thunderbird3?
Comment on attachment 412811 [details] [diff] [review]
[landed] make gloda stop binding by name in the limited cases where it does so v2

a=Standard8
Attachment #412811 - Flags: approval-thunderbird3? → approval-thunderbird3+
Attachment #412811 - Attachment description: make gloda stop binding by name in the limited cases where it does so v2 → [landed] make gloda stop binding by name in the limited cases where it does so v2
Comment on attachment 412811 [details] [diff] [review]
[landed] make gloda stop binding by name in the limited cases where it does so v2

pushed to trunk:
http://hg.mozilla.org/comm-central/rev/74b53ca28a09

pushed to comm-1.9.1:
http://hg.mozilla.org/releases/comm-1.9.1/rev/885ec955b89e
Comment on attachment 412811 [details] [diff] [review]
[landed] make gloda stop binding by name in the limited cases where it does so v2

Checked into relbranch for build 2 of RC1:
http://hg.mozilla.org/releases/comm-1.9.1/rev/d9d36f81a5da
Flags: blocking-thunderbird3? → blocking-thunderbird3+
This is just a diff from the "storage-full-async" (p)branch from my repo against "default" with the pbranch meta-data cut out of the patch.  My repo is here:
http://hg.mozilla.org/users/bugmail_asutherland.org/m191-with-m192-storage-pbranches/

This is the repository that was pulled for the following successful try builds:

http://s3.mozillamessaging.com/build/try-server/2009-11-17_06:04-asutherland@asutherland.org-m191-full-async-v2c/asutherland@asutherland.org-m191-full-async-v2c-mail-try-linux.tar.bz2

http://s3.mozillamessaging.com/build/try-server/2009-11-17_06:04-asutherland@asutherland.org-m191-full-async-v2c/asutherland@asutherland.org-m191-full-async-v2c-mail-try-mac.dmg

http://s3.mozillamessaging.com/build/try-server/2009-11-17_06:04-asutherland@asutherland.org-m191-full-async-v2c/asutherland@asutherland.org-m191-full-async-v2c-mail-try-win32.zip

It is possible that I made one additional push and merge after triggering those try builds; I had missed something in the storage 'test' subdir which would not have triggered a try build failure because it wouldn't build the tests.

I am marking this for review by gozer mainly so it's up on his radar, but also to provide me with feedback as to whether this works nicely.
Attachment #412876 - Flags: review?(gozer)
Attachment #412876 - Attachment is patch: false
Attachment #412876 - Attachment mime type: text/plain → application/octet
Comment on attachment 412876 [details]
mozilla-1.9.1 patch, storage backport and full async changes v1

Doesn't apply cleanly to mozilla-1.9.1 @ THUNDERBIRD_3_0rc1_RELEASE
Attachment #412876 - Flags: review?(gozer) → review-
I had apparently fallen relatively behind repo-wise due to an unhappy .hgrc.  Luckily, as per my previous assertions, it's pretty easy to update this patch no matter the drift.  I just pexport'ed the patch, re-clobbered sqlite/storage from 1.9.2 for the base patch*, then imported my two additional patch layers on top.  (I had a brief diversion where I was hoping the pbranch would magic things up for me, but it got confused by the complex situation and it was easier/safer to avoid magic given the straightforward nature of the patches.)

The pbranch in my repo for this is "tbrc1-storage-full-async" and it is based off of THUNDERBIRD_3_0rc1_RELEASE at the bottom.

* Actually, I applied the exported sqlite3/storage 1.9.2 patch.  There were some conflicts because 1.9.1 got a sqlite upgrade, so I used meld to verify the differences and clobber sqlite and storage.
Attachment #412876 - Attachment is obsolete: true
Attachment #412884 - Flags: review?(gozer)
(I have just kicked off another set of try builds using the above branch which is the basis for the patch to paranoia-check.)
I get the feeling I'm misunderstanding something here.  Did 1.9.1 really get a sqlite upgrade that 1.9.2 didn't get?
(In reply to comment #64)
> I get the feeling I'm misunderstanding something here.  Did 1.9.1 really get a
> sqlite upgrade that 1.9.2 didn't get?

1.9.1 is on 3.6.16.1, 1.9.2 is on 3.6.20.
Comment on attachment 412884 [details]
mozilla-1.9.1 patch, storage backport and full async changes v2

comparing with http://hg.mozilla.org/releases/mozilla-1.9.1/
searching for changes
changeset:   26588:ebb9e52cff37
branch:      COMM1915_20091112_RELBRANCH
tag:         tip
parent:      26520:57f71400f4cf
user:        Philippe M. Chiasson <gozer@mozillamessaging.com>
date:        Tue Nov 17 14:39:57 2009 -0500
summary:     Bug 519543 -  Gloda search while indexing hangs and is very slow; search activity should suppress indexing. p=asuth r=Standard8,gozer a=Standard8
Attachment #412884 - Flags: review?(gozer) → review+
This is fixed for RC 1 build 2. I'm currently proposing that we keep this bug open until we've got a solution for having the core patch on the security branch.
Whiteboard: [no l10n impact] → [no l10n impact][fixed RC1 build 2, needs solution for security/stability branch]
Target Milestone: Thunderbird 3.0rc1 → Thunderbird 3
Whiteboard: [no l10n impact][fixed RC1 build 2, needs solution for security/stability branch] → [no l10n impact][fixed RC1 build 2, needs solution for security/stability branch, and core bug 528802 landing on trunk]
Depends on: 531309
fwiw, i'm not finding any obvious open bugs to dup to this.
Severity: normal → major
Flags: blocking-thunderbird3.1?
Keywords: perf
I was unable to sanity check this on Windows because my -j8 builds started hanging all over the place and my non-j build is still going.

This patch ends up being markedly simpler than the previous patch because we already have 1.9.2's storage.  This patch amounts to:

- SQLite 3.6.22 plus the mozilla configure.in change for this.  1.9.2 is only on 3.6.16.1.  Our 1.9.2 build was on 3.6.20.  This is what 1.9.3 is using.  The decision to upgrade us is made on the basis that we have some crashes that cast some aspersions on the SQLite FTS3 code.  Since there were non-trivial changes to the FTS3 code in 3.6.21 (and 3.6.22 has extra goodness) as the SQLite team increased their support level of FTS3, I think it makes the most sense to go with the latest and greatest.  They can support crashes in it better, plus some crashes may already have been erased.  [This clobber was manually performed.]

- Minor changes that bring 1.9.2 storage in line with what we shipped in Thunderbird 3.0/3.0.1.  I actually had backported us the current trunk state of storage (at a point in the past) rather than 1.9.2 with my patch based on that, so we have a very minor addition of an interface and some deprecation marking.  I did not revert us to 1.9.2 because I did not want to cause apparent inconsistencies in the deprecation marking. [This was manually performed using 'meld' to compare the differences between our patched 1.9.1 tree and 1.9.2's storage.  I additionally manually verified that our other changes to the xpcom tree were already sufficiently reflected in the xpcom tree and did not need to be made.  The bustage fix we made for 3.0.1 where tracemalloc was required for the bloat boxes was one of the things that got mooted out of existence.]

- The full async changes as we shipped in 3.0/3.0.1 which defer synchronous statement creation until something demonstrates a need for it. [This patch was re-applied with no manual intervention.  All fuzz was deemed reasonable.]

- The patch that fixes up everything that my 'full async' changes in the preceding bullet broke.  [This patch was part of the previous patch that was re-applied with no manual intervention and where all fuzz was deemed reasonable.]

I did not have time to re-evaluate the idiomatic situation that introduced the failures that the last patch segment fixes, but believe that this is low-ish probability and would be made obvious by a change in the number of test failures with the application of this patch.
Attachment #421387 - Flags: review?(gozer)
Attachment #421387 - Attachment description: mozilla-1.9.2 patch, minor storage 1.9.3 delta that was in our 1.9.1 build and full async changes v1 → [checked in] mozilla-1.9.2 patch, minor storage 1.9.3 delta that was in our 1.9.1 build and full async changes v1
Attachment #421387 - Flags: review?(gozer) → review+
Comment on attachment 421387 [details]
[checked in] mozilla-1.9.2 patch, minor storage 1.9.3 delta that was in our 1.9.1 build and full async changes v1

comparing with http://hg.mozilla.org/releases/mozilla-1.9.2
searching for changes
changeset:   33518:e493db62357b
branch:      COMM192_20100119_RELBRANCH
tag:         tip
user:        Philippe M. Chiasson <gozer@mozillamessaging.com>
date:        Fri Jan 15 21:16:00 2010 -0500
summary:     Bug 519543 - mozilla-1.9.2 patch, minor storage 1.9.3 delta that was in our 1.9.1 build and full async changes p=asuth,r=gozer Thunderbird release branch ONLY. CLOSED TREE
Flags: blocking-thunderbird3.1? → blocking-thunderbird3.1+
Unfortunately, the patch that landed unintentionally included partial changes to rename the sqlite DLL which caused the tree to go red.  Here's a patch that may fix the problem.  I'm mid-way through a build containing this patch.
Committed, along with a whitespace fix to avoid buildbot getting all freaked out.  asuth has graciously agreed to watch the tree.

changeset:   33520:a4a279ef64e4
branch:      COMM192_20100119_RELBRANCH
parent:      33518:e493db62357b
user:        Dan Mosedale <dmose@mozilla.org>
date:        Sat Jan 16 16:05:22 2010 -0800
summary:     Undo unintentional renaming of sqlite DLL to mozsqlite to un-break the tree so we can generate nightly builds.  Landing on the Thunderbird-release branch ONLY.  r=asuth (though not strictly necessary, as this is a bustage fix), a=me. CLOSED TREE

changeset:   33521:56c9674da5ab
tag:         tip
parent:      33519:b23eb9e516f8
user:        Dan Mosedale <dmose@mozilla.org>
date:        Sat Jan 16 16:34:40 2010 -0800
summary:     Avoid unnecessary buildbot confusion by touching whitespace in a README file so that a bustage-fix commit on another branch can't make 1.9.2 builds appear to be busted when they're actually not.  a="bustage"-fix
As part of my attempt to minimize our delta against 1.9.2 trunk for storage (versus slamming us all the way up to what 1.9.3 had going on) we ended up keeping around a dubious storage test that got angry with the more recent version of SQLite.  That test was removed in http://hg.mozilla.org/mozilla-central/rev/1192461c259d.

My pushed fix (removing the test) is here:
http://hg.mozilla.org/releases/mozilla-1.9.2/rev/818851040d95

My avoid-breaking mozilla-1.9.2 fix whitespace push is here:
http://hg.mozilla.org/releases/mozilla-1.9.2/rev/8400d5566f44

I have integrated dmose's pushed fix and this change into my local pbranch stack should we need to do this again, but should be able to do the correct thing vis a vis storage in the near future.
r=me to kill that test on 1.9.2 proper if it makes your life easier.  I'm going to try to advocate upgrading SQLite on branch anyway, so it'll have to be removed eventually.
Standard8, should we upgrade our SQLite on our 3.0.x release branch to 3.6.22?  We are currently on 3.6.20, and there may be useful fixes.  For example, bug 536308 and bug 536960 are likely candidates to be fixed if the problem is really SQLite and not just Thunderbird corrupting memory.
I'd go so far as to argue that we should upgrade 1.9.0 with 3.6.22, but I need to get it on 1.9.2 first.
(In reply to comment #76)
> Standard8, should we upgrade our SQLite on our 3.0.x release branch to 3.6.22? 
> We are currently on 3.6.20, and there may be useful fixes.  For example, bug
> 536308 and bug 536960 are likely candidates to be fixed if the problem is
> really SQLite and not just Thunderbird corrupting memory.

That's pretty hard for me to assess, considering that I've got no idea how SQLite development works, how it is tested, how it may affect integration with ourselves...

Given 3.1a1 is on 3.6.22, I'd certainly be happy considering this upgrade for a 3.0.3 release. Not sure about 3.0.2, I think I'd prefer a slightly longer time in baking as we know we don't have many users on 3.1/3.2. So we could potentially push an extra patch just after we build 3.0.2 and then stick the 3.0.3pre nightlies on them.
SQLite testing is many many many orders of magnitude better than our own:
http://www.sqlite.org/testing.html

I'm fine with not doing it for this release, just thought I'd raise the issue.
blocking-thunderbird3.1: --- → beta1+
Flags: blocking-thunderbird3.1+
I landed the previous patch that we had for 3.1a1 onto latest mozilla-1.9.2 today:

http://hg.mozilla.org/releases/mozilla-1.9.2/rev/f20ea707314c

Hence that's what we're using for 3.1b1, so moving this out to 3.1b2.
blocking-thunderbird3.1: beta1+ → beta2+
I'm marking this WFM since we have done the release branch thing up through 3.1b1, bug 507414 has the comm-central trunk story in hand, and bug 545625 is tracking the comm-1.9.2 3.1b2+ 3.1 series resolution.

(The end result is that everything actionable lives on its own bug and this is not a tracker bug so it gets closed.)
Status: NEW → RESOLVED
blocking-thunderbird3.1: beta2+ → ---
Closed: 14 years ago
Depends on: 507414, 545625
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: