Closed Bug 470329 Opened 16 years ago Closed 15 years ago

gloda indexer throttling logic needs to be adaptive

Categories

(MailNews Core :: Database, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 3.0b3

People

(Reporter: asuth, Assigned: asuth)

References

Details

(Whiteboard: [no l10n impact])

Attachments

(1 file)

The gloda indexer currently uses 2 sets of constants (1 for idle, 1 for an active user) to attempt to keep things reasonable.  It should be adaptive instead to make sure that things are responsive when the user is active.  When the user is idle, it's probably best to go full-bore.
libutil's Stopwatch class provides the cpu (user + system) time measurement I was hoping to use.  (http://mxr.mozilla.org/comm-central/source/mozilla/modules/libutil/public/stopwatch.h)  Unfortunately it is not exposed via XPCOM.  I did not see any other code in comm-central using the getrusage or times functions that I would expect to be used to this end (outside of test code).

It would be ideal to have access to the CPU time used, as only having the wall-clock at our disposal would make it hard to do anything sane, especially given that we might experience granularity problems.
It's fairly easy to create an XPCOM interface to it in the mailnews code - why don't you just do that?
(In reply to comment #2)
> It's fairly easy to create an XPCOM interface to it in the mailnews code - why
> don't you just do that?

Probably will.  Was just adding a note to myself and cc'ing you in the hopes you had an even better solution :)
Flags: blocking-thunderbird3+
Whiteboard: [no l10n impact]
Attached patch v1 adaptive gloda indexing — — Splinter Review
Per (limited) discussion on bug 457949, Stopwatch makes its way into mailnews.  Per bienvenu/dmose, it makes its way into mailnews/base/utils.  Per the fact that the code does much more than we need, I whittled it down to just what we use.  Per no non-XPCOM consumers, I crammed the logic of Stopwatch into nsStopwatch and did away with Stopwatch.

This patch adds a nice PERFORMANCE gloda debug output to help one observe the results of the algorithm.  The debug can probably go away at some point, but is relatively harmless.

I should note that our adaptive nature isn't the only thing that should improve the ridiculously bad performance observed by rkent.  An arbitrary (and known sketchy at the time) decision to have kWorkDone not consume a token combined with other changes and what not meant that I'm not sure we actually charged tokens for the actual message indexing itself, just the header traversal.  Combined with a lack of adaptive heuristics on less-powerful hardware and you have a highway to the 100% cpu zone.  A token should now correspond to a lot less work, so altering our token rate will result in more gentle changes in processor utilization.
Attachment #360250 - Flags: superreview?(dmose)
Attachment #360250 - Flags: review?(dmose)
Whiteboard: [no l10n impact] → [no l10n impact][needs review dmose][target slushy]
I was quite surprised that this is not implemented yet, as I updated to TB trunk a couple of days ago, enabled gloda indexer, and found my TB stayed usable, and the CPU suage always stayed below 50%. Perhaps there was some other problem that I was having in the code that sent my CPU to 100% previously when I enabled gloda.
this seems like a pretty decent improvement for my use case, in the sense that the slow-down with gloda is about half what it was w/o this patch.
(In reply to comment #7)
> this seems like a pretty decent improvement for my use case, in the sense that
> the slow-down with gloda is about half what it was w/o this patch.

But does it cross the frustration threshold?  I am thinking gloda is going to need indexing strategies (which it can cooperate with autosync on).  One important thing the strategy could do would be to know what indexing is keeping up with recent inflow (or getting up to speed on recent messages for first-run) versus indexing old messages.  We could avoid bulk indexing of old data when the user is actively using Thunderbird.

I am also going to do some more profiling/tracing to figure out where the time is being spent a little more.
(In reply to comment #8)

> But does it cross the frustration threshold?  I am thinking gloda is going to
> need indexing strategies (which it can cooperate with autosync on).

That's a good question. I would say that it does. Keeping in mind that this is a debug build, the feeling I get is that gloda does enough work right away on notifications like delete that gets in the way of loading the next message promptly.

Even though this is a debug build on a relatively slow machine (1.6 GHz), I do have more than enough memory, so it feels as fast as a machine that's slightly memory bound, which is pretty common.
(In reply to comment #9)
> That's a good question. I would say that it does. Keeping in mind that this is
> a debug build, the feeling I get is that gloda does enough work right away on
> notifications like delete that gets in the way of loading the next message
> promptly.

Yeah, this sounds exactly like what we are probably doing.  And it sounds like something we should exactly not be doing.

Reasonable options include just kicking off indexing with a fixed delay, or (probably better) keep delaying the indexing kick-off until no new events (that we care about) have happened for a fixed period of time.  The latter reduces to the former in the case of a one-off delete.  Assuming the delay is long enough, a user could go to town deleting things without gloda interfering.
 
> Reasonable options include just kicking off indexing with a fixed delay, or
> (probably better) keep delaying the indexing kick-off until no new events (that
> we care about) have happened for a fixed period of time. 

That's what I did with the Spotlight indexer - set a timer when we get an event; clear the timer if we get an other event before the timer fires, and set it again. Kick off indexing when the timer fires...
Whiteboard: [no l10n impact][needs review dmose][target slushy] → [no l10n impact][needs review dmose][likely to slip]
Target Milestone: Thunderbird 3.0b2 → Thunderbird 3.0b3
(In reply to comment #11)
> That's what I did with the Spotlight indexer - set a timer when we get an
> event; clear the timer if we get an other event before the timer fires, and set
> it again. Kick off indexing when the timer fires...

Bug 479008 spun off to track this change separately, since this patch will ideally land soon.
Comment on attachment 360250 [details] [diff] [review]
v1 adaptive gloda indexing

In general, this looks great, especially the level of commenting in the indexer.  Here are some minor things to fix with the stopwatch code (yeah, I know, most of this stuff isn't your fault, but since we're productionalizing this code for the first time...).

I'm partway through the indexer code; I'll finish that on Monday. 

* Please add license boilerplate to nsIStopwatch.idl, since it's a new file.

>+  /**
>+   * The total CPU time (user + system) accumulated between calls to
>+   * start/resume and stop.  You have to stop the stopwatch to cause this value
>+   * to update.
>+   */
>+  readonly attribute double cpuTime;
>+  /**
>+   * The total wall clock time accumulated between calls to start/resume and
>+   * stop.  You have to stop the stopwtach to cause this value to update.
>+   */
>+  readonly attribute double realTime;

Mentioning what units these numbers are in would be great.  :-)

>+++ comm-central.b34f8dbedf9a/mailnews/base/util/nsStopwatch.cpp	2009-02-02 21:55:38.000000000 -0800
>
>+NS_IMPL_ISUPPORTS1(nsStopwatch, nsIStopwatch)

It's good general policy to use classinfo (eg NS_IMPL_ISUPPORTS1_CI in this case) for scriptable classes whenever you can.

>+nsStopwatch::nsStopwatch()
>+ : fTotalRealTime(0.0)
>+ , fTotalCpuTime(0.0)
>+ , fRunning(false)
>+{
>+#ifdef XP_UNIX
>+  // idempotent in the event of a race under all coherency models
>+  if (!gTicks)
>+    gTicks = (clock_t)sysconf(_SC_CLK_TCK);

It'd be good to check errno here, since if for some crazy reason sysconf fails and returns zero,
we'll end up dividing by zero down the line.

>+NS_IMETHODIMP nsStopwatch::Stop()
>+{
>+#ifndef XP_UNIX
>+  fStopRealTime = GetRealTime();
>+  fStopCpuTime  = GetCPUTime();
>+#else
>+  struct tms cpt;
>+  fStopRealTime = (double)times(&cpt) / gTicks;
>+  fStopCpuTime  = (double)(cpt.tms_utime+cpt.tms_stime) / gTicks;
>+#endif

As far as I can tell, this (and presumably the other) callers of GetRealTime() and GetCPUTime() shouldn't need to do this ifdefing at all, because both of those functions already have semantically identical ifdefs inside them.  Since it makes the code hard to read, please get rid of these.

>+double nsStopwatch::GetRealTime()
>+{
>+#if defined(XP_MAC)
>+// return(double)clock() / gTicks;
>+   return(double)clock() / 1000000L;

Leaving this in production code feels fragile over the long term.  Unless there's some reason to believe that sysconf returns the wrong thing on the Mac, I'd suggest we assume that this is a holdover from the OS 9 days and go back to using gTicks.
(In reply to comment #13)
> >+#if defined(XP_MAC)

XP_MAC is System 9; only XP_MACOSX is defined for something that currently actually builds.
Comment on attachment 360250 [details] [diff] [review]
v1 adaptive gloda indexing

since we're not NSPR, can we use mFoo as member var names instead of fFoo? :-)
Attachment #360250 - Flags: superreview?(dmose)
Attachment #360250 - Flags: superreview+
Attachment #360250 - Flags: review?(dmose)
Attachment #360250 - Flags: review+
Comment on attachment 360250 [details] [diff] [review]
v1 adaptive gloda indexing

If it's not too hard to write a test for this functionality, that would be a fine thing.  Given where we are in the cycle, if it is hard, please at least set in-testsuite? on this bug so that it's not completely forgotten.

It would help code readability to add comments that explain why this.perfSampling is touched in the indexing and suppressIndexing setters rather than in relevant timer callbacks.  (Presumably the reason is that it's easier that way, and the amount of time superfluously counted gets lost in the noise?)

>+  /**
>+   * The maximum amount of time we will sleep between firings.  The reason we

Mentioning milliseconds here would be nice.

>@@ -833,17 +992,21 @@ var GlodaIndexer = {
>     // leave the folder if we haven't explicitly left it.
>     if (this._indexingFolder !== null) {
>       this._indexerLeaveFolder();
>     }
>     
>     this._indexingGlodaFolder = GlodaDatastore._mapFolderID(aFolderID);
>     this._indexingFolder = this._indexingGlodaFolder.getXPCOMFolder(
>                              this._indexingGlodaFolder.kActivityIndexing);
>-    
>+
>+    // the processor utilization required to enter a folder is not our
>+    //  fault; don't sample this.
>+    this.perfSampling = false;

I'm confused; why is the work entrained by opening a folder treated differently than the rest of the work caused by the indexer?  I suspect the reasoning is worth mentioning in the comments.

>@@ -870,16 +1033,18 @@ var GlodaIndexer = {
>       }
>       // we get an nsIMsgDatabase out of this (unsurprisingly) which
>       //  explicitly inherits from nsIDBChangeAnnouncer, which has the
>       //  AddListener call we want.
>       if (this._indexingDatabase == null)
>         this._indexingDatabase = this._indexingFolder.msgDatabase;
>       if (aNeedIterator)
>         this._indexerGetIterator();
>+      // re-enable performance sampling; we're responsible for our actions again
>+      this.perfSampling = true;

Does this want to be in a finally block in case there's an error?

r+sr=dmose with the comments addressed; thanks for the patch!
Whiteboard: [no l10n impact][needs review dmose][likely to slip] → [no l10n impact][has reviews; needs updated patch]
(In reply to comment #16)
> >+    // the processor utilization required to enter a folder is not our
> >+    //  fault; don't sample this.
> >+    this.perfSampling = false;
> 
> I'm confused; why is the work entrained by opening a folder treated differently
> than the rest of the work caused by the indexer?  I suspect the reasoning is
> worth mentioning in the comments.

Gloda is unable to throttle the processor utilization caused by opening an msf.  In fact, we can expect Thunderbird to eat as much processor as I/O will allow.  The adaptive logic gains no information from this; it would just cause the algorithm to back-off, possibly to an extreme.

For example, my gmail All Mail folder is huge.  Thunderbird takes several seconds to open this.  Although one could claim that Thunderbird then has a karmic debt to the rest of the system to make up for this, if we want to do that, it might be better to address that separately.

Yes, the comment should be expanded to discuss this.  (Assuming you don't propose changing our behaviour here.)
I think there is some sort of karmic debt, but it's not clear to me how important it is to address it.  It would certainly be fine to spin off addressing it to a separate bug that can be prioritized/ignored independently of this one.
Does it takes several seconds to open the .msf file, or does it take several seconds from when you update the imap folder until you get the folder loaded notification? The latter may involve a ton of network traffic while we sync all the flags.
(In reply to comment #19)
> Does it takes several seconds to open the .msf file, or does it take several
> seconds from when you update the imap folder until you get the folder loaded
> notification? The latter may involve a ton of network traffic while we sync all
> the flags.

Per discussion with you, gloda doesn't call UpdateFolder anymore.  In the case of IMAP, we just call GetMsgDatabase.  I feel like that was the slow guy, but I don't actually know why he was slow, just that he was slow.  If he wasn't the slow guy, then it was getTotalMessages on the folder, but I think it was the former.  I wrote this code and dealt with this a while ago now, so I can't say for sure from memory.

Does GetMsgDatabase do all the network traffic before returning?  That is something I do need for the indexing-a-potentially-duplicate message logic; gloda wants to check if a message it is seeing with a duplicate message-id still has copies of those messages at their stated location.  (We need to know if the message moved or is just a duplicate.  Right now that logic just checks to see if there is a message with that key in that other folder, and whether the returned message has the same message-id.)
When in offline mode, I get multi-second delays dealing w/ All Mail, so I don't think it's IMAP.
(In reply to comment #20)
> 
> Per discussion with you, gloda doesn't call UpdateFolder anymore.

ah, I remember that now. I saw "folderLoaded" in the method name and wondered if you were waiting for that notification.


  In the case
> of IMAP, we just call GetMsgDatabase.  I feel like that was the slow guy,

If you have lot of messages (in the high five figures, or six figures), it can take a couple seconds on a slow machine to load the .msf file.

> Does GetMsgDatabase do all the network traffic before returning? 

No, none at all - it's synchronous, and we don't do any networking synchronously on the ui thread.
(In reply to comment #21)
> When in offline mode, I get multi-second delays dealing w/ All Mail, so I don't
> think it's IMAP.

It's useful to distinguish between just opening the db, and opening a folder in the UI; the latter does a lot of stuff on top of opening the database.
(In reply to comment #23)
> (In reply to comment #21)
> > When in offline mode, I get multi-second delays dealing w/ All Mail, so I don't
> > think it's IMAP.
> 
> It's useful to distinguish between just opening the db, and opening a folder in
> the UI; the latter does a lot of stuff on top of opening the database.

Right.  I'm actually noticing the delay with the beachball coincident with gloda mentioning All Mail, when I'm actually in the Inbox.  So it's not opening the folder, although it could be other things than just opening the db.
An other possible factor is that the .msf file is even bigger than it needs to be, i.e., we haven't done a compress commit on it, which gets rid of the change logs at the end of the file.  How many messages are in all mail, and how big is the .msf file? I've heard reports of large .msf files with small folders, so huge msf files with large folders are also a possibility.
(In reply to comment #25)
> An other possible factor is that the .msf file is even bigger than it needs to
> be, i.e., we haven't done a compress commit on it, which gets rid of the change
> logs at the end of the file.  How many messages are in all mail, and how big is
> the .msf file? I've heard reports of large .msf files with small folders, so
> huge msf files with large folders are also a possibility.

3.3G	imap.googlemail-37.com/[Gmail].sbd/All Mail
 60M	imap.googlemail-37.com/[Gmail].sbd/All Mail.msf

FYI
For how many messages? There is roughly a fixed overhead per message in the .msf file, after we've done a compress commit. For example, I have 1200 messages - after doing a compress commit, it was 670K, so roughly around 550 bytes per message. If you have a 60M .msf file, that would imply you have roughly 100K messages in your All Mail folder.
yup.  93197.
Note that I don't mean for this bug to become about my all mail folder.  =)  I've resigned myself to having that pause every now and then.
one problem (OT) worth noting: Gmail thinks I have 42870 messages.  IOW, I think that many of my mass deletions on gmail haven't been synchronized (yet?).
Heh, I just want to make sure nothing unexpected is happening - we know we don't handle folders that large very well, but I wouldn't want a 10 or 20K folder to have the same problem.

I'm not sure how GMAIL exposes All Mail to IMAP, but I have 6000 messages in TB, and 1637 in GMail, so there's something unexpected going on there. I'm not a heavy gmail user, so I don't thing it's a synchronization delay.  I don't mean to sound biased, but I doubt it's a problem in our IMAP support either - we do a full sync every time you open the folder.
requested actions taken.

pushed: http://hg.mozilla.org/comm-central/rev/a44909e6dcad
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Whiteboard: [no l10n impact][has reviews; needs updated patch] → [no l10n impact]
Depends on: 96669
Depends on: 549238
You need to log in before you can comment on or make changes to this bug.