Closed Bug 409259 Opened 15 years ago Closed 13 years ago

Thunderbird timeouts of IMAP operations

Categories

(MailNews Core :: Networking: IMAP, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 3.0b3

People

(Reporter: thom, Assigned: Bienvenu)

References

(Blocks 1 open bug)

Details

Attachments

(5 files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11
Build Identifier: 2.0.0.9 (20071031)

There is apparently an internal default timeout in Thunderbird for IMAP
operations - after 60 seconds, if any particular IMAP operation has not
completed, Thunderbird simply drops the IMAP connection, re-opens
another one, and immediately issues the same command again. This is very
bad behaviour on numerous counts. I'm not sure if the particular timeout
encountered here is listed on this site, but it might be:
   http://kb.mozillazine.org/Modify_Thunderbird_settings

For example, one customer of ours is testing deleting large numbers of messages.
Thunderbird by default does a message "Delete" by actually *copying* the
message to the Trash folder, then setting the \Delete flag on the
messages in the original folder (for example, the INBOX). In doing this,
it is issuing IMAP copy instructions on a large range of messages in a
single operation. The IMAP protocol does not allow for any
"processing..." response (mid-transaction) in this, only OK, NO, or BAD upon completion/termination of the operation.

If these operations take more than 60 seconds to complete, Thunderbird simply
gives up and issues the same command again, over and over. As the server
may not even now at this point that the connection is dropped, it may
actually have multiple simultaneous COPY or APPEND operations initiated.

Following is an example from a CommuniGate Pro log for a copy of messages to the mailbox "INBOX/TestMailbox", notice the 1-minute intervals:

10:53:53.76 5 IMAP-45600([192.168.5.2]) inp: 25 uid copy 5708:14166
"INBOX/TestMailbox"
10:54:56.24 5 IMAP-45609([192.168.5.2]) inp: 7 uid copy 5708:14166
"INBOX/TestMailbox"
10:56:00.09 5 IMAP-45610([192.168.5.2]) inp: 7 uid copy 5708:14166
"INBOX/TestMailbox"

There are probably a few recommended behavioural changes here:

1. IMAP has no timeouts defined at all, in the protocol:
     http://www.faqs.org/rfcs/rfc3501.html

It would probably be better/more correct to use timeouts only at the TCP
level, and not at the IMAP application layer, since there is no support in IMAP to notify either party (server/client) of timing.

2. If you really think that timeouts are needed, the default ones should be
increased significantly (5 minutes or longer at a minimum). However, logically, there will always be some operation which could be issued which would exceed whatever timeout was set. IMAP COPY or APPEND operations can be very resource expensive and time consuming, for obvious reasons. Modern mailboxes may be many GBs in size.

3. If a timeout is signaled, the client should not simply issue the same
command again, as it can only make server performance degrade in this scenario.

4. It might be better if Thunderbird performed a copy operation instead
on each message, rather than on ranges of messages. This would increase
the number of transactions, but would ensure that each operation was
acknowledged individually (and would allow Thunderbird to display to progression of messages being moved to the destination mailbox).


Reproducible: Always

Steps to Reproduce:
1. Create a very large mailbox - one in which performing an IMAP COPY or APPEND will take longer than 60 seconds on the server to complete.
2. Copy the messages in this mailbox to another mailbox (or alternately, just use the Thunderbird Delete operation, which by default copies the messages to the Trash mailbox).
3. In most cases, the operation will never complete, or Thunderbird will lock-up/freeze.
Actual Results:  
Operation never completes successfully, and the command may be automatically repeated indefinitely by Thunderbird (thereby potentially greatly increasing the load on the IMAP server).

Expected Results:  
The IMAP COPY or APPEND should complete eventually.

This bug is probably related to many others I found in the database, though none exactly laying out the underlying problem, which is that the IMAP protocol does not provide for timeouts or mid-transaction dialog. Therefore, these timeouts should probably be removed from Thunderbird. Here are some other bugs likely related:

https://bugzilla.mozilla.org/show_bug.cgi?id=406929
https://bugzilla.mozilla.org/show_bug.cgi?id=327312
https://bugzilla.mozilla.org/show_bug.cgi?id=325063
https://bugzilla.mozilla.org/show_bug.cgi?id=390257


Another optimization that could be made in this area, though with more discussion really needed because of the change in behaviour, is that the "Trash" mailbox should really be a "virtual" mailbox listing all of those messages marked \Delete, without actually having to perform an IMAP COPY operation to a real mailbox called "Trash". This is highly inefficient as a default behaviour. (Note: some users may prefer to have an actual "Trash" mailbox as a storage space, so this behaviour could be retained as an option.)

Thank you very much. Please feel free to let me know if you have any questions, or need more information.
Assignee: nobody → bienvenu
Component: General → Networking: IMAP
Product: Thunderbird → Core
QA Contact: general → networking.imap
I can produce this same behavior when Thunderbird waits for user confirmation of an expired security certificate. Check of any other IMAP mailboxes being checked while waiting for the confirmation will timeout if the user doesn't act within a minute.
I can reproduce this at work also. 

We have a few work accounts that have a rather high flow of messages both in and out. Any sort of bulk operation will normally cause Thunderbird to crash after awhile. The requested operation never seems to complete.

Also as a side note, if you try to adjust the timeout setting it doesn't keep your changes:

user_pref("mail.server.server2.timeout", 600);

once you auth with the IMAP server, it reverts to this:

user_pref("mail.server.server2.timeout", 29);

and it saves it to the prefs.js. 
Please note that Mark Crispin, the author of the IMAP RFC, weighed in on this issue:

http://mailman1.u.washington.edu/pipermail/imap-use/2007-December/000340.html
http://mailman1.u.washington.edu/pipermail/imap-use/2007-December/000336.html

Basically, the general opinion seems to be that an IMAP-level timeout should be eliminated altogether, and simply allow the user to hit "Stop" if they get tired of waiting. Sending a stop to the client will not likely actually send an abort to the server, so the already-issued server operation itself is likely to continue (perhaps if the client logged off and logged back in, this might inform the server to discontinue the operation?). And no matter what, the requested operation should not be executed again automatically, because this is likely to result in duplicated COPY or APPEND operations (which will show as duplicate messages and which may seriously hamper server performance, if the operation is repeated over and over).

Thank you, sincerely.
David, is it worth filing a new bug to discuss comment 3?

Mike, you list several issues. Please file separate bugs so they can be efficiently addressed. And we'll need a talkback id for your crash.
 http://kb.mozillazine.org/Talkback#Getting_an_incident_ID

Colin, see Bug 382815.
Severity: normal → major
I don't think so - requiring the user to hit stop/cancel would be a step backwards in usability, imo. People who want that can set the timeout as large as they want.
(for the impact this seems to have on our IMAP servers, this bug is awfully quiet? could it at least get moved to some other status?).


This is just to confirm that thunderbird 2.0.14 and 2.0.16 (all platforms, i.e. at least Windows & Linux) still have this, and occasionally cause our mail server (Exchange 2003) to go bananas:


The sequence of events is as follows:

* a user uses Thunderbird (TB) version 2.0.0.14+ to COPY a large number
of messages from a SOURCE folder to a DESTINATION folder within his
mailbox

* when a large number of messages to be copied have contiguous ID's, TB
is able to issue a single mass-copy IMAP command for a range of messages
from ID xxxx to yyyyy, instead of copying each message one-by-one.
Because this can include hundreds or even thousands of messages in a
single command, this operation can obviously take a long time.

* after 60 seconds waiting for the copy to complete (60s = the default
timeout in TB), TB drops the connection to the server. However, the copy
operation is still taking place on the server: dropping the connection
does not interrupt the processing, and certainly does not roll back the
operation.

* after the timeout, TB re-tries to copy messages. It checks the SOURCE
folder to see if messages are still there; they do. And TB immediately
sends the mass-copy command again for the same range of 100's/1000's of
messages 

* new timeout with the 2nd mass-copy. The server is now processing 2
mass-copy operations in the background, so the copy operation takes even
longer. Duplicate messages being to appear in the destination folder.

* repeat ad nauseam -- TB does not notify the user of any problem and
just keeps repeating the copy command every 60 seconds

First problem: users get dozens of copies of the same messages they
tried to copy once! In some cases we've seen up to 2 million messages in
the destination folder, and TB is unable to handle that many messages --
users cannot delete them.

Second problem: this eventually affects all users on the server. After a
couple of hours of TB copy-loop, the server is running dozens of
mass-copy operations. It is still handling the load, but other users on
the same server begin to see performance degradation. After a couple
more hours, the maximum number of concurrent IMAP operations is reached.
New IMAP operations are rejected, resulting in other IMAP clients not
being able to access the mailboxes.


Older versions of TB are not affected because when the timeout occurs, they show an error message to the user and do not retry the copy.


Here is a server-side IMAP trace that shows the problem (in our repro
environment) -- NB: I set the connection timeout in TB to 30 seconds
(instead of default 60) to reproduce the problem more easily. I'm
copying 19000 carefully fragmented messages from "looptest1" folder to
"looptest3" folder.

The IMAP ID's of the messages in the source folder must be fragmented in
the following way:

* there must be enough messages with contiguous ID's so that the first
"copy" command takes more than 30 seconds.

* there must be enough sets of ID ranges in all messages to be copied so
that not all ranges can be specified in a single copy and TB must issue
several copy commands. If all messages to be copied can be specified in
a single command, then problem does not seem to show up.


[Will try to attach the server-side IMAP logs as an extra file]
definitely wanted, maybe blocking - I have not been able to reproduce this, however.
Flags: wanted-thunderbird3+
Blocks: 332309
I can confirm this behavior when Thunderbird issues a COPY command with a large
number of messages (e.g. 100 000) to a Courier IMAP server. This results in a
growing number of duplicate messages on the server being created (with
excessive server resources being consumed) until the server quota has been
exceeded or the client is terminated.

Perhaps Thunderbird could (as Windows Mail) display a dialog box to the user
asking whether it should wait for the server to respond, or stop waiting,
instead of unilaterally closing the connection.
(In reply to comment #9)
> I can confirm this behavior when Thunderbird issues a COPY command with a large
> number of messages (e.g. 100 000) to a Courier IMAP server....
What version of TB you are using?
Thunderbird 2.0.0.16 (20080708). We have other users with the same problem, likely using the same version.
Could somebody put this into more reasonable status, please? It looks like a big deal to me.
Flags: blocking-thunderbird3.0b1?
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Windows XP → All
Hardware: PC → All
As requested per #12: we have reproduced the bug in 
Shredder, 3.0b1pre. Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1b1pre) Gecko/20081003 Shredder/3.0b1pre
Version: unspecified → Trunk
if problem exists on 2.x, version is more useful as unspec or 2.0 rather than trunk
Version: Trunk → unspecified
Pre and post timeout patch server logs, illustrating the effect of the timeout patch.
Patch that attempts to solve the client timeout on copy messages
I have written a small patch that when applied and tested, appears to solve the timeout problem, at least on our test setup at Cern.

The value of the 'tcptimeout' property in mailnews.js and that of two variables in the source were changed from 60 seconds to 40 minutes, a value that is larger than the server timeout. The modifications are detailed in the attached .patch file. 

I have attached output from the server logs. The first section (when timeout values are set to 60 seconds) shows an un-patched Thunderbird client timing out after approximately 60 seconds and repeating the request. The second section (when timeout values are increased) shows a patched client waiting over 7 minutes for the server to notify of a completed COPY request. Both copy operations were attempted with more than 20,000 e-mails

I appreciate that these changes might have other implications or create undesired affects (this wasn't the case with my tests), but would it be possible to make a similar or equivalent modification to prevent these timeouts?

John Hefferman
While the patch in comment 17 is just a ugly workaround (replaces one arbitrary timeout with a larger one), it does indeed fix the problem for us. At least the new timeout is larger than the minimal server-side timeout (30mins) specified in RFC2060, so the server has a chance to terminate eventually-outstanding things more gracefully.

 Could this patch please be applied to the next errata release (2.0.0.X), while the "proper" way of handling this in a later release is being discussed (notify via popup as in TBird-1.5, notify via status bar, silently abandon command ,..)? 

I hope that there is general agreement that the current behaviour indeed is plain wrong and a RFC violation to boot: TBird is not doing LOGOUT - where it would have to wait for the server confirmation - it just drops connection, so repeating a COPY results in a COPY+COPY sequence explicitly mentioned in RFC2060 as a no-no.. and TBird does this by design, so this is not an accidental/occasional violation.

Bug 332309 ("major") and bug 446411 ("critical") are about the same issue, btw.
Downstream (Red Hat) bug with yet another reproducer is at
https://bugzilla.redhat.com/show_bug.cgi?id=465152#c3
I will try to fix this for b2
Flags: blocking-thunderbird3.0b1?
Flags: blocking-thunderbird3.0b1-
Flags: blocking-thunderbird3+
Duplicate of this bug: 332309
Duplicate of this bug: 446411
Product: Core → MailNews Core
Target Milestone: --- → Thunderbird 3.0b3
I never intended that we retry a url more than once, but the state we were using to know that we'd already retried a url was stored in the protocol object, but we were using a new connection to retry, so that wasn't so helpful. This patch makes it so we also pay attention to the per-url flag, which should prevent endless retries.

This will help a bit, but I still need to deal with the underlying issue of operations on a large number of messages on slow servers.  I could simply not retry move/copies of multiple messages, but since they now tend to be stored as pseudo-offline operations, we will retry them at a higher level. I could increase the timeout when running a move/copy url on lots of messages - I'll probably do that since avoiding the initial timeout is really the only convenient way to avoid duplicates.
Attachment #368382 - Flags: superreview?(neil)
Attachment #368382 - Flags: review?(bugzilla)
(In reply to comment #24)
>
> This will help a bit, but I still need to deal with the underlying issue of
> operations on a large number of messages on slow servers. 

And slow connections?

Which raises an interesting test issue - would be helpful for fakeserver to have  "slow server mode" to simulate such conditions?
(In reply to comment #25)

> 
> And slow connections?
Not so much, because the timeout only waits for the first little bit of data to come back.
> 
> Which raises an interesting test issue - would be helpful for fakeserver to
> have  "slow server mode" to simulate such conditions?

It would be helpful to be able to simulate a slow server, yes.
Attachment #368382 - Flags: superreview?(neil) → superreview+
Attachment #368382 - Flags: review?(bugzilla) → review+
Comment on attachment 368382 [details] [diff] [review]
fix, part 1 - checked in.

Yep, let's give this a try (r=me).
Attachment #368382 - Attachment description: fix, part 1 → fix, part 1 - checked in.
Attached patch proposed fixSplinter Review
this simply checks if we're running an online imap move copy, and if we're copying a large number of messages, increases the timeout proportionally to the number of messages being moved/copied.
Attachment #371517 - Flags: superreview?(bugzilla)
Attachment #371517 - Flags: review?(bugzilla)
Status: NEW → ASSIGNED
Whiteboard: [has patch for review]
Comment on attachment 371517 [details] [diff] [review]
proposed fix

I've not explicitly tested the > 2400 case in this patch, but it looks to be the right thing to do. r/sr=me.
Attachment #371517 - Flags: superreview?(bugzilla)
Attachment #371517 - Flags: superreview+
Attachment #371517 - Flags: review?(bugzilla)
Attachment #371517 - Flags: review+
Whiteboard: [has patch for review] → [ready to land]
fix checked in.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: [ready to land]
If I could make a suggestion, there is also the case of not just large numbers of messages on a slow server, but also fewer numbers of large messages running into this problem.  Could we modify this patch to increase the timeout proportional to the greater of both large numbers of messages and total MBs to move?  I could see 1000 2MB messages taking almost as long to move as 10,000 200KB messages (and this doesn't count the occasional 50-100MB powerpoints-in-email, customer databases, or solidworks models that sometimes float around).

Or would you prefer I open a new bug for that?

BTW, to all involved: THANK YOU SO MUCH FOR WORKING ON THIS!!!!!  ;-)
Just reporting into the _correct_ bug my experience with the patch using Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.9.1b5pre) Gecko/20090509 Shredder/3.0b3pre - please let me know if the patch actually landed in a different build, and I'll retest:

I did a large move of over 1k messages.  I see the "moving messages" status at the bottom for _exactly_ 60 seconds and then it goes away.  Visiting the destination folder for the move shows me only one copy of (as far as I can tell) each message to be moved, so it looks successful.  Reselecting the source folder restarts the move command and a duplicate set of messages appears in the destination folder, and I need to reindex the source folder to remove the move command.  So it appears that although it runs only once, it is still considered a failed command.

I think this is the right build, because this behavior is different than prior behavior - before, it would continually repeat, now it is only running once per visit to a folder so this is an improvement!

However, if I understand the patch correctly, this is not the full expected behavior for the patch.  If I have misunderstood, I'll put my comments into a new bug.

Thanks!
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
It is the expected behavior of the patch - the patch only kicks in if you have over 2400 messages, since we already have a 60 second timeout, and we figured 40 messages per second was fairly slow on average.  It sounds like your server is *really* slow. Looking at the total message size might deal with your issue - perhaps file a new bug on that...
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Why are you repeating the commonds again at all?  The commands were already submitted and the server is acting on them.  If you hit the time out sending the command, what good it resubmitting the exact same commands again?  Logically won't they fail again?  It is not the time out that is the main problem, it is the repeat of the commands.  I did some testing with Exchange 2007 in our test environment.  I had a folder with 58782 messages in it.  I deleted all of them at once.  **Before you say that is an unreasonable number of messages in a folder and an unreasonable number of message to delete at once, yes it is a rare scenario, but we have users with that number of messages in a folder (and much much more) and users often do a select all delete, copy or move.**  I had already set the timeout to 1500 (this is more then your code would add - 1469.55) and the entoer process completed in 22 minutes, just under the 25 minutes that 1500 second represents.  So there was no repeat, because the timeout was not reached.  Just for fun I preformed the same 58000 message dleete on two users at the same time on the same Exchange 2007 Client Acess Server (CAS).  The repeat bug hit, but it seems to have happened some time during the purge process, since the actions eventually did complete, but now look at the message counts for hte users:

DisplayName   Mailbox Size(    ItemCount LastLogonTim LastLogoffTi
LastLoggedOn
                        MB)              e            me
UserAccount
-----------   -------------    --------- ------------ ------------
------------
Jool Xythos           10616       174270 6/4/2009 ... 6/4/2009 ...
TESTEXAD\...
Stark Xythos           9706       161063 6/4/2009 ... 6/4/2009 ...
TESTEXAD\...


The starting point for both was - 3470 MB 58782 messages.

This bug cannot be fixed by modifying the time out, you need to stop the command repeat.


---
 Chris Andrews
 Boston College
 Collaboration Team
Another option is why not have Thuderbird break unreasonable groups of messages into reasoble ones.  Why not break massive move or copy commands into groups of 1000 or 100?
Breaking it up sure would solve my slow mail server problem - since each command would be under the 1 minute timeout.  Even better would be if the "block" of moves/deletes were configurable, you could default to something like 10k for most servers, but I could set it to 100 if someone were on a particularly overburdened system.
I tried to use b3 against my Exchange server (my INBOX contains about 120000 messages) and Thunderbird just gives up trying to fetch the mails.

In fact none of the mail clients seem to work except for Alpine and KMail.  I'll try to post Thunderbird debug log soon.
Duplicate of this bug: 476338
Unfortunately this issue does not seem to be fixed in 3.0.4. Please see my comment to bug 403603.
It is quite a serious issue: I have to fight with it when archiving a month's messages almost every month. 
Which release version contains the patch?
(In reply to comment #40)
> Unfortunately this issue does not seem to be fixed in 3.0.4. Please see my
> comment to bug 403603.
> It is quite a serious issue: I have to fight with it when archiving a month's
> messages almost every month. 
> Which release version contains the patch?

This particular bug was fixed for Thunderbird 3 beta 3. If you are seeing a similar bug please file a new bug rather than commenting on closed bugs - comments on closed bugs are likely to be lost - getting an imap log of the timeouts would also be useful: https://wiki.mozilla.org/MailNews:Logging
(In reply to Mark Banner (:standard8) from comment #41)
> (In reply to comment #40)
> > Unfortunately this issue does not seem to be fixed in 3.0.4. Please see my
> > comment to bug 403603.
> > It is quite a serious issue: I have to fight with it when archiving a month's
> > messages almost every month. 
> > Which release version contains the patch?
> 
> This particular bug was fixed for Thunderbird 3 beta 3. If you are seeing a
> similar bug please file a new bug rather than commenting on closed bugs -
> comments on closed bugs are likely to be lost - getting an imap log of the
> timeouts would also be useful: https://wiki.mozilla.org/MailNews:Logging

I disagree with the notion that "this particular bug" was fixed for any version of Thunderbird.  A workaround for a restricted set of circumstances in which it was exhibited was committed, but that only solved a small number of instances of the problem described in the original problem report, which is this: the IMAP protocol does not allow client timeouts, IMAP clients *should not timeout* but should wait for the server to complete (preferably while providing feedback for what is happening and allowing the user an option to cancel), as described in the description and comment #3.  The original report gives COPY operations on large numbers of messages *as an example*, but this is the only part of the problem that was actually fixed.  However, there has never been a solution for SELECT commands taking a long time to respond (as they may well do with large mailboxes stored in formats where the server must scan them to build an index during the SELECT operation... I can show you a log of that happening, if you like), or for *any other operation that may take a long time to complete* except copy operations.

Bug should be reopened, as it has never been fixed, only a patch for one of the more common symptoms applied.
Blocks: 492672
You need to log in before you can comment on or make changes to this bug.