Closed Bug 235086 Opened 16 years ago Closed 11 years ago

add a bit of jitter to the biff-interval

Categories

(MailNews Core :: Backend, defect, minor)

defect
Not set
minor

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 3.0b1

People

(Reporter: jo.hermans, Assigned: mkmelin)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 5 obsolete files)

At my work (I work for Alcatel), I was explaining the importance of adding a bit
of 'jitter' of every timer-process, to make sure processes don't start to
synchronize. As an example, I mentioned 1000 mail-clients that are are trying to
 poll the same mailserver, by using a fixed timer (let's say 10 minutes). Even
when they get started at random intervals, all clients will seem to synchronize
on eachother, until all of them are polling the mail server at the same time,
causing a huge spike in the load. This phenomenon can be easily defeated by
adding a bit of jitter to each timer, for example by adding +/- 5% to the
normally configured interval.

To my surprise, one of our mail admins claimed that this was exactly the case
with many mail-clients, including Mozilla and Thunderbird. He showed us his logs
(which I'm not allowed to reproduce here, sorry), and we could clearly see that
the load-level of the mailserver is always flat in the morning, but shows huge
spikes in the evening (most clients would be running for 8 hours with a fixed 10
minute timer), or the next morning (when the pc's are left running thru the
night). And it was even more clearly to see on a friday, because many people
leave their pc's running all week, but are forced to reboot during the weekend.

I looked in Mozilla's code, and I found nsMsgBiffManager::SetNextBiffTime().
That clearly uses a fixed timer (created from mail.serverX.check_time), and
never tries to add a bit of jitter.

The solution might be to replace :

biffEntry->nextBiffTime = startTime;
biffEntry->nextBiffTime += chosenTimeInterval;

by :

// calculate a jitter of +/-5% on biffInterval
// minimum 1 second (to avoid a modulo with 0)
// maximum 30 seconds (to avoid problems when biffInterval is very large)
nsInt64 jitter = (int)(chosenTimeInterval * 0.05);
jitter = MAX(1000000, MIN(30000000, jitter));
jitter = ((rand() % 2) ? 1 : -1) * (rand() % jitter);
biffEntry->nextBiffTime = startTime;
biffEntry->nextBiffTime += chosenTimeInterval + jitter;

Note that this calculation uses microseconds, so it needs nsInt64. Since the
default check_time is 10 minutes, this should result in a jitter of +/- 30
seconds. Even if we have 1000 mail-clients synchronized together, the peak would
still be spread over a period of 1 minute, which is on average 16.67 polls per
second. Much less.

This sounds like a very trivial issue, but it becomes very important in very
large scale environments. I routinely work with tens of thousands to many
millions of devices, which need to perform some work triggered by a timer. We
always take care of this jitter, otherwise the whole networks goes down. Please,
don't take this matter lightly. Every admin of a large scale installation would
be grateful.

Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7a) Gecko/20040203
I haven't found any duplicates of this.
Attached patch Patch v1 proposal (obsolete) — Splinter Review
Sounds like a good idea actually.

My proposal though is to make this a boolean preference, set false by default. 
So that people who have a 1 minute check or some small interval as most do,
don't get effected.  But in the workplace, it would be necessary to deploy as
set to True, so as you said, the burdon on infrustructure is kept to a minimum.


I think it's a minority that will actually make use of such a feature, but that
minority is 10's of thousands of computers, hence it's a worthwhile proposal to
investigate.

Anyhow... regarding the patch specifically.

This is my first C++ patch, and I'm still not to comfortable with Mozilla's
source ;)  It scares me.  Much less playing with preferences.  My system is
still not setup to compile at this time (though hopefully soon enough I will
get it going).	So this is untested.  

Anyway... 

Take a look, I'd bet it needs revising.
Assignee: sspitzer → robert
Status: NEW → ASSIGNED
Comment on attachment 141864 [details] [diff] [review]
Patch v1 proposal

David --> can you take a look at this?
Attachment #141864 - Flags: review?(bienvenu)
(In reply to comment #1)
> My proposal though is to make this a boolean preference, set false by default. 
> So that people who have a 1 minute check or some small interval as most do,
> don't get effected.  But in the workplace, it would be necessary to deploy as
> set to True, so as you said, the burdon on infrustructure is kept to a minimum.

That's why I used a 5% interval - it would translate to a 1 sec jitter. That
wouldn't affect them hardly (although they should use jitter too, but they've
already done enough damage with their 1 minute polling). The MAX macro was just
to limit it to a reasonable value, if you a biff interval of several hours or so.

> I think it's a minority that will actually make use of such a feature, but that
> minority is 10's of thousands of computers, hence it's a worthwhile proposal to
> investigate. 

The algorithm I used actually comes directly from several of the applications I
wrote and maintain. The point is that is should work for everybody, without
configuration. We already have enough preference-bloat. The nigthmare of an
admin is exactly that people aren't using the correct settings. I would prefer
that it's switched on by default (or can't be switched off), but so that's it's
hardly noticable : 1 minute pollers would see +/- 1 sec jitter (hardly
noticable), 10 minute pollers (default) would use +/ 30 sec (equally, hardly
noticable). Ofcourse, the exact values can be changed, these are just a proposal.
Most commercial ISP mail servers have a limit in which one can check for mail. 
The majority of mail users will go right up against it.  I've seen quite a few
at 10, some at 5, some at 1.

Go just one second early, and it gives you an error.  In a few more strict
environments, it will make you wait an additional penalty because you violated
the first time.

So if a user has a 10 minute limit, and your patch goes -1 second.  They get an
error.  This will be a regression.

In your environment, this is necessary to have that jitter, and you control the
mail server, and ensure this won't happen.

But for the vast majority of users, this would be a regression.

I don't think it would be wise to enable this by default, as it will bother to
many users.  And all of them won't need this ability.  It's strictly a large
company issue (and rightfully so, as you mentioned, it makes a HUGE difference).

But I don't see how we can justify causing problems for users, just to help
corporate users.  Especially when those who need this (large corporations), have
the resources to modify a 1 line preference.

Just seems to risky IMHO to do.  Ultimately it's up to a real MailNews person
like mscott or dbienvenu to decide.

CC--> Mscott
(In reply to comment #4)
> Most commercial ISP mail servers have a limit in which one can check for mail. 
> The majority of mail users will go right up against it.  I've seen quite a few
> at 10, some at 5, some at 1.
> 
> Go just one second early, and it gives you an error.  In a few more strict
> environments, it will make you wait an additional penalty because you violated
> the first time.
> 

I haven't heard about an ISP who is that strict. If this is really the case,
then a user wouldn't even be allowed to click on the 'Get Mail' button manually
a few seconds before the normal biff-interval, because the ISP would break the
connection !?!

To be fair, I programmed quite a few security systems myself (for some very
large ISP's who are hunting after spambots), but none of them would trigger on a
single message (POP, IMAP or any other protocol). The problem is that it's just
very common for a user to get his/hers mail by clicking a few times on the
button (most users don't even know the difference between single-click and
double-click).
 
It it really bothers somebody, let's change it to :

biffEntry->nextBiffTime += chosenTimeInterval + (rand() % 5)

or something similar.
ISP's are becoming more and more strict these days.  I know I used to have an
email account that was 5 minutes exactly.  I found one of those webmail services
that you could leave open and it would check on interval.  I set it to 5
obviously.  It apparantly had a jitter to it.  I didn't know that of course,
this was a while back.  Didn't pass the error on, just said connection was denied.

Lots of servers are strict on this.  It's a cheap easy way of preventing the
same problem your out to prevent.  Servers being overloaded.  By limiting users
to 1 check ever X minutes, they can control the load a bit.

Your looking to control it by distribution on the client side.

On a second thought, might be worth making it a user definable jitter.  Perhaps
I should updated that patch?

A Jitter on a borderline case will result in a error with a strict mail server.
 Thanks to viruses, and the economy, ISP's are often running looking for ways to
limit the burdon on their servers, and lower costs.  Limiting is a great way to
do so (from a strict Business perspective).

The UI states:  Check for new mail every ___ seconds. 
not
Check for new mail, aproximately every ___ seconds.


I personally see to much of a risk to compromise the reliablility of Thunderbird
by changing this behavior.

I don't think it would be good to be telling users, if checking at the interval
your ISP allows, add 1 minute because Thunderbird isn't exact.  That's just bad
marketing.

Granted, my bias I have a bit of a growing Business MIS background.  But I'm
still not seeing a good reason to make this a default.  Most likely 95%+ of
installations don't need this feature.  And a good number of them may be
negatively effected by this patch.  A large company can which would need such an
option can do so by either repackaging the Thunderbird client, modifying the
preference, or any other deployment tactic that the company uses.

Ultimately it comes down to giving options to those who need it, without harming
others.  Ideally, Mozilla would do many things to be more workplace savvy.  But
we can't forget our end users.  

Someone is going to have to deal with the fallout of a patch like this.  I'd say
the business who needs this feature, since they choose it, and are capable of
dealing with it.  Not someone who just wants to have their home computer check
their email every few minutes with no fuss.

This is what IT departments are for.
To add:

An ISP normally doesn't start the lockout timer until the connection is closed
or a similar trigger, so that if the user connects during an already open
connection (double clicking for example), it won't have a negative effect.  Or
else their tech support lines would be off the hook.
Attached patch Patch v2 (obsolete) — Splinter Review
Addresses copy/paste error pointed out by David Bienvenu.  

Decided to leave as a pref.

It's possible in the future to have it only add jitter, rather than +/-. 
Though that would require a larger jitter threshhold, to keep same
effectiveness, and make biff interval even longer potentially.

But it's a minority that nead such a feature.  One can update the user.js file.
Attachment #141864 - Attachment is obsolete: true
Comment on attachment 141911 [details] [diff] [review]
Patch v2

looks good, but can you fix the braces style to be 

if (a)
{
 ...
}
else
{
...
}
which is the prevalent style? thx!
May I suggest a compromise whereby
  - there is no pref (comment 3)
  - the jitter is always nonnegative (comment 4)
  - the jitter is limited to less than half a minute, so as to keep the biff
    interval within the precision indicated by the corresponding UI (minutes)
    (comment 6)
Attached patch Patch v3 (obsolete) — Splinter Review
Patch v3 addresses David's comments.

David:	I still can't compile, getting make errors, and haven't had a chance to
site down with it (spring break at home, so perhaps I will next week), so
please take a good look before the tree burns.

Regarding Comment 10, it's not necessary to do this all the time.  Any
situation where this feature is needed, can enable globally as per:
http://www.mozilla.org/catalog/end-user/customizing/briefprefs.html#modifying

Any organization with enough installs to need this feature, is going to have
the capability to perform this task, and will be doing it anyway for other
tweaks.  So having it as a default isn't needed.
Attachment #141911 - Attachment is obsolete: true
Attachment #143552 - Flags: review?(bienvenu)
Attachment #143552 - Flags: review?(bienvenu) → review+
Comment on attachment 143552 [details] [diff] [review]
Patch v3

Asking mscott for a sr, although this most likely won't get in for a while,
since the tree is closed.
Attachment #143552 - Flags: superreview?(mscott)
(In reply to comment #11)
Well, I was thinking along the lines of this perhaps being something of a
correctness issue that would conceivably affect also large-scale mail services
without centralised pref control (e.g. ISP <--> users). But since that's just
conjecture, and the patch reviewed, n/m
I'm always one to welcome debate, since that's what ultimately brings about the
best results.  

My theory is that large scale facilities like ISP's can tolerate this.  They
also don't have as many persisitant users, as most email users aren't online all
day (minus myself).  They tend to be online for short periods, and leave.  

In the office, Email is open from at least 9-5.  If not 24x7 when the employee
leaves for the day.  

ISP's also tend to have more servers in their pool than a business has, and more
resources.  

ISP's also take the practice of limiting how often one can check.  For example,
only once every 5 minutes.  This reduces the amounts of hits at the server/day
the server takes.  Businesses tend not to do this.

ISP's don't have direct control, but habits of the end user are very different.
Comment on attachment 143552 [details] [diff] [review]
Patch v3

I don't see a default value for mail.biff.jitter going into mailnews.js. I
think we should add that with a value of false.

I also wonder if we shouldn't call it:

mail.biff.add_interval_jitter
(In reply to comment #15)
> (From update of attachment 143552 [details] [diff] [review])
> I also wonder if we shouldn't call it:
> 
> mail.biff.add_interval_jitter
> 
Should?  or shouldn't?  right nwo I'm calling it:
mail.biff.jitter
Attached patch Patch v4 (obsolete) — Splinter Review
Addressing Scott's comments.  Easy enough.
Attachment #143552 - Attachment is obsolete: true
Attachment #144009 - Flags: superreview?(mscott)
Comment on attachment 144009 [details] [diff] [review]
Patch v4

your change to mailnews.js shows some of my unchecked in junk mail pref changes
:)

since you'll have someone else actually check this in, how about a new patch
without that. 

I'll carry over r=bienvenu, sr=mscott on the new patch
Attachment #144009 - Flags: superreview?(mscott) → superreview-
Attached patch Patch v4.1 (obsolete) — Splinter Review
Address Scott's comments
Attachment #144009 - Attachment is obsolete: true
Comment on attachment 144012 [details] [diff] [review]
Patch v4.1

great. carrying forward the r and sr.
Attachment #144012 - Flags: superreview+
Attachment #144012 - Flags: review+
Comment on attachment 141864 [details] [diff] [review]
Patch v1 proposal

patch is obsolete
Attachment #141864 - Flags: review?(bienvenu)
Comment on attachment 143552 [details] [diff] [review]
Patch v3

patch is obsolete
Attachment #143552 - Flags: superreview?(mscott)
I'm starting to have second thoughts about this patch. I fail to see how adding
jitter would alter server load on biff notifications. Is there not enough
entropy with regards to when users start up the client? I find it hard to
believe that everyone launches the app at the exact second as a bunch of other
users in a company. Isn't there enough entropy on when the user starts the
client as us adding a random 30 second variable to the biff notification?
well, the original report says "Even when they get started at random intervals,
all clients will seem to synchronize on eachother, until all of them are polling
the mail server at the same time".

Assuming there's nothing magical about this, it must indicate that the intervals
aren't regular currently.  Is the timer for the next interval counting from the
last time the timer fired, or from when the last check completed?  If it's
counting from the completion of the previous check, one can see a potential for
lots of clients falling into sync, as the interval will to some extent depend on
the server response, and won't be based on the "random" start time of the client.
Well, I've heard of this before.  Here's the explanation I heard a while back:

Say you have 1000 clients, and a server capable of 100 concurent connections. 
That's more than capable, since you only connect briefly at an interval.  you
don't need the capacity to allow all at once.  

Now when 1 makes a connection, it's very fast, since 100% CPU, and disk access
is devoted to the 1 client.  But that's not always the case, since clients
aren't aware if another is connected.

So you can get clusters, just by luck of the draw with large scale operations
(especially after an outage and everyone is waiting to reconnect).  Think about
it, when do most people open their email?  Most likely between 9 and 9:10 AM. 
Out of 1000 employees, what are the odds a few will poll at the same time?

So when this happens in groups, the server slows, making the connection take
longer.  Then more connections come on, as their biff just fired.  Slowing it
down more.

Eventually what you get is a bunch of clients lumping together, because they all
got stopped at the same time.


Ever go on a highway when they do pothole repair in the north east?  Cop goes a
few miles up the road, and drives 10MPH on a 65MPH road, slowing traffic, so the
guys can run out, quickly patch it.. that way they don't close an entire lane.

Each ar went on the road at a different time, and different speed, but because
each car had to slow down because of the ones before it, they lump up.  Next 20
minutes or so it feels like heavy traffic, even at 2:00 PM on a thursday.  


It's a snowballing effect.
My daytime job is for Alcatel, where I'm the RADIUS expert for our
authentication servers, which are used in a very large share of all access
networks over the world (often entire countries). The explanantion of Robert is
correct, it's the queueing inside the server that causes this phenomenom. It's
very well known in these environments, especially when you have to deal with
hundreds of packets per second.

I'm used to read remarks in RFC's about adding a bit of random delay to all
timers, to prevent this synchronous behaviour. The problem is that not all
clients (hardware and software) obey this law. For example, reboot an access
server, and you'll see that all 50.000 users will auto-login at exactly the same
time. Give all of them an ipaddress that is valid for 60 minutes, and they all
come back exactly 1 hour later. So, we often ADD a bit of random delay before
accepting an authentication (or add it to the expiration time of the ipaddress),
to spread the load a bit.

The larger the number of clients and server, the worse this becomes. Just ask
any authentication-admin or mail-admin of a large corporation (eh ... Netscape ?).

I noticed that I made the code a bit too complicated for this case (we use 2
different random number generators, that's why you see 2 rand's inside the
formula), so this might be a built easier to understand (from the patch) :
+    jitter = -jitter + (rand() % (2 * jitter));

You can also make it a positive only delay (comment 4) :
+    jitter = rand() % jitter;

The 5% rule was just an attempt to calculate a delay that was small for someone
with a small timer, and large for someone with a large one. The same thing with
the MAX operation (30 seconds here). It can easily be changed. A few seconds
might be enough for most users (10 minute default polling), a user with a 1
minute timer probably wouldn't want to see any difference. The point of this
algorithm was that it would be fair for everyone, so that it could be used by
default. Nobody is going to complain that their 10-minute timer is going to fire
a few seconds earlier or later. If you're really afraid about that, change the
maximum value of 4 seconds or so. What's the big deal about that ?




(In reply to comment #26)
> For example, reboot an access server, and you'll see that all 50.000 users 
> will auto-login at exactly the same time. 

I've seen this when AOL Instant Messenger has server problems.  Takes hours for
everyone to be able to get back online.
Scott,

Since 1.7 is the long life branch.  Will we be making this part?  As is? 
Modifications?
Product: MailNews → Core
Any change this will be used in Thunderbird 3 ? I'm not asking for a blocking? flag yet.
this just fell off the checkin radar - I'll check if the patch still applies, unless Robert wants to do that...
David, Robert apparently not interested.
Severity: trivial → minor
Keywords: checkin-needed
QA Contact: esther → backend
(In reply to comment #30)
> this just fell off the checkin radar - I'll check if the patch still applies,
> unless Robert wants to do that...
> 
The patch doesn't apply.
Flags: wanted-thunderbird3?
Keywords: checkin-needed
Whiteboard: [patch bitrotted]
Product: Core → MailNews Core
Attached patch proposed fixSplinter Review
Somehow I doubt the previous (reviewed) patch ever compiled.

Here's an unbitrotted version though. I set the pref to true by default.
Assignee: robert → mkmelin+mozilla
Attachment #144012 - Attachment is obsolete: true
Attachment #344354 - Flags: superreview?(bienvenu)
Attachment #344354 - Flags: review?(bienvenu)
Flags: wanted-thunderbird3? → wanted-thunderbird3+
Whiteboard: [patch bitrotted]
Comment on attachment 344354 [details] [diff] [review]
proposed fix

+  // Check if we should jitter.
+  nsCOMPtr<nsIPrefBranch> prefs = do_GetService(NS_PREFSERVICE_CONTRACTID, &rv);
+  if (NS_SUCCEEDED(rv) && prefs)

You don't need to check both - I'd just suggest checking prefs.
Attachment #344354 - Flags: superreview?(bienvenu)
Attachment #344354 - Flags: superreview+
Attachment #344354 - Flags: review?(bienvenu)
Attachment #344354 - Flags: review+
changeset:   811:17945ea26634
http://hg.mozilla.org/comm-central/rev/17945ea26634

->FIXED
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 3.0b1
(With nit fixed, and the jitter declaration moved down a few lines.)
Depends on: 488979
You need to log in before you can comment on or make changes to this bug.