Closed Bug 235086 Opened 21 years ago Closed 16 years ago

add a bit of jitter to the biff-interval

Categories

(MailNews Core :: Backend, defect)

defect
Not set
minor

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 3.0b1

People

(Reporter: jo.hermans, Assigned: mkmelin)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 5 obsolete files)

At my work (I work for Alcatel), I was explaining the importance of adding a bit of 'jitter' of every timer-process, to make sure processes don't start to synchronize. As an example, I mentioned 1000 mail-clients that are are trying to poll the same mailserver, by using a fixed timer (let's say 10 minutes). Even when they get started at random intervals, all clients will seem to synchronize on eachother, until all of them are polling the mail server at the same time, causing a huge spike in the load. This phenomenon can be easily defeated by adding a bit of jitter to each timer, for example by adding +/- 5% to the normally configured interval. To my surprise, one of our mail admins claimed that this was exactly the case with many mail-clients, including Mozilla and Thunderbird. He showed us his logs (which I'm not allowed to reproduce here, sorry), and we could clearly see that the load-level of the mailserver is always flat in the morning, but shows huge spikes in the evening (most clients would be running for 8 hours with a fixed 10 minute timer), or the next morning (when the pc's are left running thru the night). And it was even more clearly to see on a friday, because many people leave their pc's running all week, but are forced to reboot during the weekend. I looked in Mozilla's code, and I found nsMsgBiffManager::SetNextBiffTime(). That clearly uses a fixed timer (created from mail.serverX.check_time), and never tries to add a bit of jitter. The solution might be to replace : biffEntry->nextBiffTime = startTime; biffEntry->nextBiffTime += chosenTimeInterval; by : // calculate a jitter of +/-5% on biffInterval // minimum 1 second (to avoid a modulo with 0) // maximum 30 seconds (to avoid problems when biffInterval is very large) nsInt64 jitter = (int)(chosenTimeInterval * 0.05); jitter = MAX(1000000, MIN(30000000, jitter)); jitter = ((rand() % 2) ? 1 : -1) * (rand() % jitter); biffEntry->nextBiffTime = startTime; biffEntry->nextBiffTime += chosenTimeInterval + jitter; Note that this calculation uses microseconds, so it needs nsInt64. Since the default check_time is 10 minutes, this should result in a jitter of +/- 30 seconds. Even if we have 1000 mail-clients synchronized together, the peak would still be spread over a period of 1 minute, which is on average 16.67 polls per second. Much less. This sounds like a very trivial issue, but it becomes very important in very large scale environments. I routinely work with tens of thousands to many millions of devices, which need to perform some work triggered by a timer. We always take care of this jitter, otherwise the whole networks goes down. Please, don't take this matter lightly. Every admin of a large scale installation would be grateful. Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7a) Gecko/20040203 I haven't found any duplicates of this.
Attached patch Patch v1 proposal (obsolete) — Splinter Review
Sounds like a good idea actually. My proposal though is to make this a boolean preference, set false by default. So that people who have a 1 minute check or some small interval as most do, don't get effected. But in the workplace, it would be necessary to deploy as set to True, so as you said, the burdon on infrustructure is kept to a minimum. I think it's a minority that will actually make use of such a feature, but that minority is 10's of thousands of computers, hence it's a worthwhile proposal to investigate. Anyhow... regarding the patch specifically. This is my first C++ patch, and I'm still not to comfortable with Mozilla's source ;) It scares me. Much less playing with preferences. My system is still not setup to compile at this time (though hopefully soon enough I will get it going). So this is untested. Anyway... Take a look, I'd bet it needs revising.
Assignee: sspitzer → robert
Status: NEW → ASSIGNED
Comment on attachment 141864 [details] [diff] [review] Patch v1 proposal David --> can you take a look at this?
Attachment #141864 - Flags: review?(bienvenu)
(In reply to comment #1) > My proposal though is to make this a boolean preference, set false by default. > So that people who have a 1 minute check or some small interval as most do, > don't get effected. But in the workplace, it would be necessary to deploy as > set to True, so as you said, the burdon on infrustructure is kept to a minimum. That's why I used a 5% interval - it would translate to a 1 sec jitter. That wouldn't affect them hardly (although they should use jitter too, but they've already done enough damage with their 1 minute polling). The MAX macro was just to limit it to a reasonable value, if you a biff interval of several hours or so. > I think it's a minority that will actually make use of such a feature, but that > minority is 10's of thousands of computers, hence it's a worthwhile proposal to > investigate. The algorithm I used actually comes directly from several of the applications I wrote and maintain. The point is that is should work for everybody, without configuration. We already have enough preference-bloat. The nigthmare of an admin is exactly that people aren't using the correct settings. I would prefer that it's switched on by default (or can't be switched off), but so that's it's hardly noticable : 1 minute pollers would see +/- 1 sec jitter (hardly noticable), 10 minute pollers (default) would use +/ 30 sec (equally, hardly noticable). Ofcourse, the exact values can be changed, these are just a proposal.
Most commercial ISP mail servers have a limit in which one can check for mail. The majority of mail users will go right up against it. I've seen quite a few at 10, some at 5, some at 1. Go just one second early, and it gives you an error. In a few more strict environments, it will make you wait an additional penalty because you violated the first time. So if a user has a 10 minute limit, and your patch goes -1 second. They get an error. This will be a regression. In your environment, this is necessary to have that jitter, and you control the mail server, and ensure this won't happen. But for the vast majority of users, this would be a regression. I don't think it would be wise to enable this by default, as it will bother to many users. And all of them won't need this ability. It's strictly a large company issue (and rightfully so, as you mentioned, it makes a HUGE difference). But I don't see how we can justify causing problems for users, just to help corporate users. Especially when those who need this (large corporations), have the resources to modify a 1 line preference. Just seems to risky IMHO to do. Ultimately it's up to a real MailNews person like mscott or dbienvenu to decide. CC--> Mscott
(In reply to comment #4) > Most commercial ISP mail servers have a limit in which one can check for mail. > The majority of mail users will go right up against it. I've seen quite a few > at 10, some at 5, some at 1. > > Go just one second early, and it gives you an error. In a few more strict > environments, it will make you wait an additional penalty because you violated > the first time. > I haven't heard about an ISP who is that strict. If this is really the case, then a user wouldn't even be allowed to click on the 'Get Mail' button manually a few seconds before the normal biff-interval, because the ISP would break the connection !?! To be fair, I programmed quite a few security systems myself (for some very large ISP's who are hunting after spambots), but none of them would trigger on a single message (POP, IMAP or any other protocol). The problem is that it's just very common for a user to get his/hers mail by clicking a few times on the button (most users don't even know the difference between single-click and double-click). It it really bothers somebody, let's change it to : biffEntry->nextBiffTime += chosenTimeInterval + (rand() % 5) or something similar.
ISP's are becoming more and more strict these days. I know I used to have an email account that was 5 minutes exactly. I found one of those webmail services that you could leave open and it would check on interval. I set it to 5 obviously. It apparantly had a jitter to it. I didn't know that of course, this was a while back. Didn't pass the error on, just said connection was denied. Lots of servers are strict on this. It's a cheap easy way of preventing the same problem your out to prevent. Servers being overloaded. By limiting users to 1 check ever X minutes, they can control the load a bit. Your looking to control it by distribution on the client side. On a second thought, might be worth making it a user definable jitter. Perhaps I should updated that patch? A Jitter on a borderline case will result in a error with a strict mail server. Thanks to viruses, and the economy, ISP's are often running looking for ways to limit the burdon on their servers, and lower costs. Limiting is a great way to do so (from a strict Business perspective). The UI states: Check for new mail every ___ seconds. not Check for new mail, aproximately every ___ seconds. I personally see to much of a risk to compromise the reliablility of Thunderbird by changing this behavior. I don't think it would be good to be telling users, if checking at the interval your ISP allows, add 1 minute because Thunderbird isn't exact. That's just bad marketing. Granted, my bias I have a bit of a growing Business MIS background. But I'm still not seeing a good reason to make this a default. Most likely 95%+ of installations don't need this feature. And a good number of them may be negatively effected by this patch. A large company can which would need such an option can do so by either repackaging the Thunderbird client, modifying the preference, or any other deployment tactic that the company uses. Ultimately it comes down to giving options to those who need it, without harming others. Ideally, Mozilla would do many things to be more workplace savvy. But we can't forget our end users. Someone is going to have to deal with the fallout of a patch like this. I'd say the business who needs this feature, since they choose it, and are capable of dealing with it. Not someone who just wants to have their home computer check their email every few minutes with no fuss. This is what IT departments are for.
To add: An ISP normally doesn't start the lockout timer until the connection is closed or a similar trigger, so that if the user connects during an already open connection (double clicking for example), it won't have a negative effect. Or else their tech support lines would be off the hook.
Attached patch Patch v2 (obsolete) — Splinter Review
Addresses copy/paste error pointed out by David Bienvenu. Decided to leave as a pref. It's possible in the future to have it only add jitter, rather than +/-. Though that would require a larger jitter threshhold, to keep same effectiveness, and make biff interval even longer potentially. But it's a minority that nead such a feature. One can update the user.js file.
Attachment #141864 - Attachment is obsolete: true
Comment on attachment 141911 [details] [diff] [review] Patch v2 looks good, but can you fix the braces style to be if (a) { ... } else { ... } which is the prevalent style? thx!
May I suggest a compromise whereby - there is no pref (comment 3) - the jitter is always nonnegative (comment 4) - the jitter is limited to less than half a minute, so as to keep the biff interval within the precision indicated by the corresponding UI (minutes) (comment 6)
Attached patch Patch v3 (obsolete) — Splinter Review
Patch v3 addresses David's comments. David: I still can't compile, getting make errors, and haven't had a chance to site down with it (spring break at home, so perhaps I will next week), so please take a good look before the tree burns. Regarding Comment 10, it's not necessary to do this all the time. Any situation where this feature is needed, can enable globally as per: http://www.mozilla.org/catalog/end-user/customizing/briefprefs.html#modifying Any organization with enough installs to need this feature, is going to have the capability to perform this task, and will be doing it anyway for other tweaks. So having it as a default isn't needed.
Attachment #141911 - Attachment is obsolete: true
Attachment #143552 - Flags: review?(bienvenu)
Attachment #143552 - Flags: review?(bienvenu) → review+
Comment on attachment 143552 [details] [diff] [review] Patch v3 Asking mscott for a sr, although this most likely won't get in for a while, since the tree is closed.
Attachment #143552 - Flags: superreview?(mscott)
(In reply to comment #11) Well, I was thinking along the lines of this perhaps being something of a correctness issue that would conceivably affect also large-scale mail services without centralised pref control (e.g. ISP <--> users). But since that's just conjecture, and the patch reviewed, n/m
I'm always one to welcome debate, since that's what ultimately brings about the best results. My theory is that large scale facilities like ISP's can tolerate this. They also don't have as many persisitant users, as most email users aren't online all day (minus myself). They tend to be online for short periods, and leave. In the office, Email is open from at least 9-5. If not 24x7 when the employee leaves for the day. ISP's also tend to have more servers in their pool than a business has, and more resources. ISP's also take the practice of limiting how often one can check. For example, only once every 5 minutes. This reduces the amounts of hits at the server/day the server takes. Businesses tend not to do this. ISP's don't have direct control, but habits of the end user are very different.
Comment on attachment 143552 [details] [diff] [review] Patch v3 I don't see a default value for mail.biff.jitter going into mailnews.js. I think we should add that with a value of false. I also wonder if we shouldn't call it: mail.biff.add_interval_jitter
(In reply to comment #15) > (From update of attachment 143552 [details] [diff] [review]) > I also wonder if we shouldn't call it: > > mail.biff.add_interval_jitter > Should? or shouldn't? right nwo I'm calling it: mail.biff.jitter
Attached patch Patch v4 (obsolete) — Splinter Review
Addressing Scott's comments. Easy enough.
Attachment #143552 - Attachment is obsolete: true
Attachment #144009 - Flags: superreview?(mscott)
Comment on attachment 144009 [details] [diff] [review] Patch v4 your change to mailnews.js shows some of my unchecked in junk mail pref changes :) since you'll have someone else actually check this in, how about a new patch without that. I'll carry over r=bienvenu, sr=mscott on the new patch
Attachment #144009 - Flags: superreview?(mscott) → superreview-
Attached patch Patch v4.1 (obsolete) — Splinter Review
Address Scott's comments
Attachment #144009 - Attachment is obsolete: true
Comment on attachment 144012 [details] [diff] [review] Patch v4.1 great. carrying forward the r and sr.
Attachment #144012 - Flags: superreview+
Attachment #144012 - Flags: review+
Comment on attachment 141864 [details] [diff] [review] Patch v1 proposal patch is obsolete
Attachment #141864 - Flags: review?(bienvenu)
Comment on attachment 143552 [details] [diff] [review] Patch v3 patch is obsolete
Attachment #143552 - Flags: superreview?(mscott)
I'm starting to have second thoughts about this patch. I fail to see how adding jitter would alter server load on biff notifications. Is there not enough entropy with regards to when users start up the client? I find it hard to believe that everyone launches the app at the exact second as a bunch of other users in a company. Isn't there enough entropy on when the user starts the client as us adding a random 30 second variable to the biff notification?
well, the original report says "Even when they get started at random intervals, all clients will seem to synchronize on eachother, until all of them are polling the mail server at the same time". Assuming there's nothing magical about this, it must indicate that the intervals aren't regular currently. Is the timer for the next interval counting from the last time the timer fired, or from when the last check completed? If it's counting from the completion of the previous check, one can see a potential for lots of clients falling into sync, as the interval will to some extent depend on the server response, and won't be based on the "random" start time of the client.
Well, I've heard of this before. Here's the explanation I heard a while back: Say you have 1000 clients, and a server capable of 100 concurent connections. That's more than capable, since you only connect briefly at an interval. you don't need the capacity to allow all at once. Now when 1 makes a connection, it's very fast, since 100% CPU, and disk access is devoted to the 1 client. But that's not always the case, since clients aren't aware if another is connected. So you can get clusters, just by luck of the draw with large scale operations (especially after an outage and everyone is waiting to reconnect). Think about it, when do most people open their email? Most likely between 9 and 9:10 AM. Out of 1000 employees, what are the odds a few will poll at the same time? So when this happens in groups, the server slows, making the connection take longer. Then more connections come on, as their biff just fired. Slowing it down more. Eventually what you get is a bunch of clients lumping together, because they all got stopped at the same time. Ever go on a highway when they do pothole repair in the north east? Cop goes a few miles up the road, and drives 10MPH on a 65MPH road, slowing traffic, so the guys can run out, quickly patch it.. that way they don't close an entire lane. Each ar went on the road at a different time, and different speed, but because each car had to slow down because of the ones before it, they lump up. Next 20 minutes or so it feels like heavy traffic, even at 2:00 PM on a thursday. It's a snowballing effect.
My daytime job is for Alcatel, where I'm the RADIUS expert for our authentication servers, which are used in a very large share of all access networks over the world (often entire countries). The explanantion of Robert is correct, it's the queueing inside the server that causes this phenomenom. It's very well known in these environments, especially when you have to deal with hundreds of packets per second. I'm used to read remarks in RFC's about adding a bit of random delay to all timers, to prevent this synchronous behaviour. The problem is that not all clients (hardware and software) obey this law. For example, reboot an access server, and you'll see that all 50.000 users will auto-login at exactly the same time. Give all of them an ipaddress that is valid for 60 minutes, and they all come back exactly 1 hour later. So, we often ADD a bit of random delay before accepting an authentication (or add it to the expiration time of the ipaddress), to spread the load a bit. The larger the number of clients and server, the worse this becomes. Just ask any authentication-admin or mail-admin of a large corporation (eh ... Netscape ?). I noticed that I made the code a bit too complicated for this case (we use 2 different random number generators, that's why you see 2 rand's inside the formula), so this might be a built easier to understand (from the patch) : + jitter = -jitter + (rand() % (2 * jitter)); You can also make it a positive only delay (comment 4) : + jitter = rand() % jitter; The 5% rule was just an attempt to calculate a delay that was small for someone with a small timer, and large for someone with a large one. The same thing with the MAX operation (30 seconds here). It can easily be changed. A few seconds might be enough for most users (10 minute default polling), a user with a 1 minute timer probably wouldn't want to see any difference. The point of this algorithm was that it would be fair for everyone, so that it could be used by default. Nobody is going to complain that their 10-minute timer is going to fire a few seconds earlier or later. If you're really afraid about that, change the maximum value of 4 seconds or so. What's the big deal about that ?
(In reply to comment #26) > For example, reboot an access server, and you'll see that all 50.000 users > will auto-login at exactly the same time. I've seen this when AOL Instant Messenger has server problems. Takes hours for everyone to be able to get back online.
Scott, Since 1.7 is the long life branch. Will we be making this part? As is? Modifications?
Product: MailNews → Core
Any change this will be used in Thunderbird 3 ? I'm not asking for a blocking? flag yet.
this just fell off the checkin radar - I'll check if the patch still applies, unless Robert wants to do that...
David, Robert apparently not interested.
Severity: trivial → minor
Keywords: checkin-needed
QA Contact: esther → backend
(In reply to comment #30) > this just fell off the checkin radar - I'll check if the patch still applies, > unless Robert wants to do that... > The patch doesn't apply.
Flags: wanted-thunderbird3?
Keywords: checkin-needed
Whiteboard: [patch bitrotted]
Product: Core → MailNews Core
Attached patch proposed fixSplinter Review
Somehow I doubt the previous (reviewed) patch ever compiled. Here's an unbitrotted version though. I set the pref to true by default.
Assignee: robert → mkmelin+mozilla
Attachment #144012 - Attachment is obsolete: true
Attachment #344354 - Flags: superreview?(bienvenu)
Attachment #344354 - Flags: review?(bienvenu)
Flags: wanted-thunderbird3? → wanted-thunderbird3+
Whiteboard: [patch bitrotted]
Comment on attachment 344354 [details] [diff] [review] proposed fix + // Check if we should jitter. + nsCOMPtr<nsIPrefBranch> prefs = do_GetService(NS_PREFSERVICE_CONTRACTID, &rv); + if (NS_SUCCEEDED(rv) && prefs) You don't need to check both - I'd just suggest checking prefs.
Attachment #344354 - Flags: superreview?(bienvenu)
Attachment #344354 - Flags: superreview+
Attachment #344354 - Flags: review?(bienvenu)
Attachment #344354 - Flags: review+
changeset: 811:17945ea26634 http://hg.mozilla.org/comm-central/rev/17945ea26634 ->FIXED
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 3.0b1
(With nit fixed, and the jitter declaration moved down a few lines.)
Depends on: 488979
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: