Closed Bug 564256 Opened 14 years ago Closed 9 years ago

possible prompted major update gotcha for enterprises, of high resource impact to mail servers and/or file servers, if autosync and gloda (global indexing) enabled

Categories

(Thunderbird :: General, defect)

defect
Not set
normal

Tracking

(blocking-thunderbird3.1 rc1+)

RESOLVED WORKSFORME
Tracking Status
blocking-thunderbird3.1 --- rc1+

People

(Reporter: dmosedale, Assigned: wsmwk)

References

(Blocks 1 open bug)

Details

(Whiteboard: [non-code])

I was just talking to someone who works at a large enterprise site that has Thunderbird deployed, and he pointed out an interesting possible gotcha with our major update plan.

Once a major update from Tb2 is offered, either by us (or, for the case of sites that change their update URLs to be local) or site admins, a large number of users are likely to updated at once.  Most of those installs will then likely download all the users' mail for indexing, which is likely to exert a significantly heavier load on the mail servers than they normally have to bear all at once.  If the servers aren't sufficiently powerful, this could bring them to their knees, possibly until most of the clients have downloaded most of the mail, which could conceivably be many hours.

It's not obvious to me that we'd figure out how likely that scenario is for how many installations.  Do folks have ideas there?

One way I could imagine addressing it would be to try and detect slow performance or repeated connection loss and do exponential backoffs if we don't already.  I suspect there are other mitigation / prevention strategies as well, and would love to hear them.
I am under the impression we can offer the upgrade to only some percentage of the pingers at a given time.  Perhaps we can provide tips to large enterprises with their own update server on how to do that too or suggest they do that too.

We haven't had the resources to verify various suspicions about bad behaviour by autosync as is.  Trying to change it to back off is likely to just lead to bizarre and horrible problems, assuming we even had the time for that.
Adding gozer to find out more about how the distribution algorithm works and what our options are there.  Is this just round-robin DNSing, or is there more going on?
(In reply to comment #2)
> Adding gozer to find out more about how the distribution algorithm works and
> what our options are there.  Is this just round-robin DNSing, or is there more
> going on?

There is a throttle on the update server side of things. Basically, you can chose to randomly return the update offer to only 10% of users.

That's the only mechamism that's there on our end.
asuth is right that autosync has enough problems as is, and I'm struggling to write tests for the fixes for those as is - and some of those fixes aren't going to make 3.1 as it is.

Autosync is a lot less aggressive about consuming server resources than it could be. It only does stuff on user idle, and it only downloads from one folder at a time. The main complaints we've heard have been about buggy behavior, like trying to download the same message over and over again, which should be fixed in 3.1.

Ironically, non-modal alerts mean that we will keep trying to download, even after connection failures.
xref bug 541209
Seems we try to autosync all folders, and not just a subset.
Another mechanism we could mention where the mail server daemons themselves are not capable of avoiding pathological behavior would be to use traffic shaping to slow down the message retrieval process and thereby slow down disk accesses.  For example, on linux, the "tc" command can be used to accomplish this without needing to alter the daemon configuration.  A token-bucket based implementation could effectively throttle only the autosyncing Thunderbirds with no penalty to other clients.

Given that autosync only operates on a single folder at a time, and autosync having an IMAP connection sitting in a folder is no different from the UI having an IMAP connection sitting in a folder, the continuous message streaming is the only thing that differentiates autosync from normal interactive activity.  These are characterized by disk reads and network streaming, and so by throttling the streaming and thereby throttling the reads, we effectively reduce the load.

That covers the IMAP server side of the equation.  If the user's profile is on network storage, the I/O is also going to be related to gloda activity.  If gloda is waiting on autosync to feed it, the IMAP throttling above will also cover gloda.  If gloda has plenty on its plate, it will likely run at the limits of the network file system.  Similar throttling could be performed there, although there would likely be more collateral damage to other non-gloda filesystem activities.  Ideally organizations for whom that is a major problem will just have disabled gloda.
I've added Roland to the CC, as I'd be very interested to know if he's heard of any cases of folks upgrading from Thunderbird 3 and hitting this problem.

Given the comments above, I think it's reasonable to assume that this is unlikely to be a problem in most cases, and to block on documenting the alternative for admins so they are aware ahead of time and know what their options are, so giving to jenzed.
Assignee: nobody → jenzed
blocking-thunderbird3.1: ? → rc1+
Whiteboard: [non-code]
Would also be nice if in addition to the documentation we would point to it from a blog post early enough so that sysadmin can look at their options early enough to be ready when 3.1.0 or 3.1.1 is released.
Assignee: jenzed → vseerror
searched for denialofservice and "denial of service" tags and text but didn't find anything in GS, will search for more and leave another comment
The ideas and logic above are reasonable, and perhaps the addition results in a reasonable outcome for most institutions. I'm not sure either way, because we don't know an awful lot about enterprise deployment in practice, particularly, how tightly enterprises control major updates. But so far the best options do seem to be education/documentation, and if necessary update server throttling mentioned by gozer. 

   A scenario to consider (not worst case, but reasonable) - a single 
   site (same timezone) institution of 4,000 that doesn't control 
   major updates via pref or update server. What happens in the time 
   frame of 8am-8:30am when some percentage, say 25%, of the workforce 
   starts Thunderbird and the update is being offered? 

To summarize some of the information above...

Communication: 
- blog post, for 3.1.0, for 3.1.1
- release note?
- want more references and tips for our enterprise documentation, like https://developer.mozilla.org/en/Setting_up_an_update_server. Central doc locations are:
 * https://developer.mozilla.org/en/Thunderbird/Deploying_Thunderbird_in_the_Enterprise
 * https://wiki.mozilla.org/Thunderbird/Enterprise
While the documentation uplift is in progress I will dig a little to see if we can get a better picture of what enterprises think and do.

Assumptions:
- autosync and gloda are ENABLED
- changing autosync code to affect streaming rate seems not wise
- autosync is not aggressive, active only on user idle (but surely for most people thunderbird is mostly idle?)
- autosync does one folder at a time
- disk/network file load affected by both sync activity and gloda activity, throttling sync effectively throttles gloda

Variables controlled by the institution
 app.update.mode setting 
    2 prompt about major releases
    3 prompt about minor and major releases
    default is 1, no prompt, unless add-on incompatibility
 app.update.enabled
    (hopefully true)
 profile location
 
Mitigation ideas
- traffic shaping at client
- enterprise serves thunderbird updates via their own update server
OS: Mac OS X → All
Summary: possible prompted major update gotcha → possible prompted major update gotcha for enterprises, of high resource impact to mail servers and/or file servers, if autosync and gloda (global indexing) enabled
Rafael what would the best approach be ?
We've decided not to force offline upgrade by default - see bug 569161, which I believe makes this bug fairly moot.
(In reply to comment #12)
> We've decided not to force offline upgrade by default - see bug 569161, which I
> believe makes this bug fairly moot.

Does this mean we don't need any documentation, relnotes, otherwise?
I would say we need documentation/relnotes etc for the fact that new profiles will by default have their imap folders configured for offline use, and that new imap accounts added to existing profiles will also be configured for offline use, meaning we will try to download all your mail. This is the same as 3.0, but a change from 2.x
This is noted in "New in Thunderbird 3" (http://support.mozillamessaging.com/en-US/kb/New+in+Thunderbird+3, "With new IMAP accounts, offline folders are enabled by default. When necessary, existing IMAP folders are configured to be offline. Disable either in the individual folder's properties or in Account Settings | Synchronization & Storage. (Click the Advanced button on the Synchronization & Storage page to access a folder list that allows you to specify a group of folders for offline storage.)")

I am working on a "New in Thunderbird 3.1" page, but I don't want to include changes that we already documented for 3.0 - it's too much information for people upgrading from 3.0. I link to the related 3.0 pages from the 3.1 page - is that good enough?
Sounds good enough to me.
So is here anything left to do?
Hardware: x86 → All
Version: unspecified → 3.0
(In reply to :aceman from comment #17)
> So is here anything left to do?

Today, I don't think so. Several modifications were made to the migration assistant and friends (now removed of course). And https://developer.mozilla.org/en-US/docs/Mozilla/Thunderbird/Thunderbird_in_the_Enterprise has lots of good information, some of it relevant to this bug. And most of this ship has sailed long ago.

As it turned out I don't think we got much feedback about overloaded mail servers.  There was however about disk usage for enterprises who put their profiles and disk shares.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.