Closed Bug 392889 Opened 17 years ago Closed 16 years ago

Race condition in interval-checking of large mailserver mailboxes

Categories

(MailNews Core :: Networking: IMAP, defect)

1.8 Branch
x86
Windows 2000
defect
Not set
critical

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: ken2006, Unassigned)

Details

(Keywords: hang, regression, Whiteboard: closeme 2008-10-25)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6
Build Identifier: 2.0.6

Mail server with many, large email boxes, and an inbox-check interval of 1 minute, causes an apparent race condition, whereby CPU spins and blocks the UI, while the unread-message count (folder pane) on one of the folders can be seen re-counting all the messages at one minute intervals.

-Only tested with 1 minute checking interval - did not try longer ones.
-Only occurred after upgrading from 1.5 to 2.0
-Only observed with IMAP, did not test POP
-I am not certain that the described message-recounting is the cause of the race condition, however it occurs synchronous with that process.

I am able to provide stack and network traces upon request.

Reproducible: Always

Steps to Reproduce:
1. Connect to a IMAP server with ~20 mailboxes and average size of >300MB per mailbox.
2. Set inbox check interval to 1 minute.
3. Start TB, symptom occurs immediately (upon stat of inboxes)

It may be easier for me to provide stack/net traces to virtually reproduce this.
Actual Results:  
CPU Spin, UI blocks input. Presumably the inbox checking layer is in a race condition while the last check in completing, after th interval, or the UI is racing waiting on these condition to complete.


ONLY occurs on TB 2.x; I had to roll back to 1.5 to avoid this issue (I am able to reload 2 as-needed to perform diagnostics)
(In reply to comment #0)
> Mail server with many, large email boxes, and an inbox-check interval of 1
> minute, causes an apparent race condition, whereby CPU spins and blocks the UI,

"blocks the UI" part involves similar issue to Bug 384360. 
> Bug 384360 Message move blocks whole app, due to long re-sync time (move to not-opened large IMAP folder) 

I think "interval of 1 minute" for large IMAP mail folder is similar to scheduling of "Defrag request of HDD" every minute, although I believe CPU spin(thrashing like)/blocking UI should be avoided, even if user requested too many works which will cause "apparent race condition".
I don't know whether rejection is better or queuing is better in this case. 

To Ken Johanson(bug opener):
Are you simply reporting bug when special situation?
Or do you want solution of your problem or help to solve/bypas your problem?
If latter, read thru at least next MozillaZine Knowledge base articles.
  http://kb.mozillazine.org/Keep_it_working_%28Thunderbird%29
  http://kb.mozillazine.org/Performance_%28Thunderbird%29
And take actions to avoid unwanted problems. 
WADA, I am "reporting bug".

The analogy to defrag is.. incorrect, for being an extreme exaggeration. Mail servers (and mail clients) cache state (and OS's cache file-stats), and use callbacks to detect when folder state has changed. They don't scan every block of their harddrive. Moreover that analogy is implies that the mailserver of size of messages is the (albeit innocent) cause of this --BUT this problem doesn't NOT happen on TB 1.5 or any of several other mail client I checked.

The only suggestion in the FAQ that could be implied to be helpful in this case is 'keep inbox empty'. I do that, and my other folder are where the large message counts are. However it would be incorrect to assume or imply that one should keep other folders small (even the inbox FTM):

-In memory state engines, callbacks/listeners, and hashtables obviate it.
-With any quality mail server (and client) one can EASILY have inbox sizes that exceed 1 GB and, except for searches and initial startup, barely notice a performance slowdown (And my TB 1.5 DOES work this well).
-Many people want to (and do) keep a running archive of communications they have made, for legal and referential (etc) purposes. (I have OVER 10GB total of email boxes for various business and legal accounts)
-Good mail clients wont corrput the inbox file even during a crash (all modern filesystem provide journaling, and IO APIs allows transactions/recovery)
(In reply to comment #2)
> WADA, I am "reporting bug".

I see.

> inbox-check interval of 1 minute, causes an apparent race condition

I think there is at least one problem in scheduling of periodical download, and it is one of main reasons of contention (then unwanted other problems).
  Scheduled like setInterval(Download action,N minutes); 
I think this is probably improved by;
  Schedule like setTimeout(Download action,N minutes);
  At end of Download action, schedule next by setTimeout(...,N minutes);
  (Long download time won't cause contention with next download) 
There is no need to reject request or to enqueue request when this way.
I think above is also applicable to other interval scheduling such as Junk Purge.

> CPU spins and blocks the UI

As written in Bug 384360, I main cause is long re-sync time when big mail box. And, if "Inbox" only is involved in mail check, "keeping open Inbox folder" can be a practical workaround.
Are many large mail folder checked on each interval when your case?
Do you have filters which move a mail to a huge mail folder? (similar to Bug 384360)

"blocks the UI" part is different problem(task dispatching problem), although it  may not occur when open of large mail folder if Bug 384360 is resolved.
> ONLY occurs on TB 2.x
Much frequent mail db close(then re-open and re-sync when IMAP) by design/logic change in Tb 2.0 is probably one of main reasons.
Read Bug 347837 Comment #2 for above, please.  
Another possibility of workaround for "race condition when 1 minute interval".
Does your IMAP server support IDLE?
http://email.about.com/od/emailbehindthescenes/g/imap_idle.htm
> Only tested with 1 minute checking interval - did not try longer ones.

What is minimum interval (larger than 1 min) at which it functions properly?
Keywords: hang
Version: unspecified → 2.0
(In reply to comment #7)
> 
> What is minimum interval (larger than 1 min) at which it functions properly?
> 

I am not sure*, but but I think that measurement (minimum interval) will depend wholly on the speed of client/server and the number and size of mailboxes. I CAN say that I have about 20 accounts (mailing lists etc) and I do not delete messages from them (for archival/reference). Each account contains many thousands of messages on average.

I am not sure yet but I don't think my IMAP server supports IDLE.

I think it's safe to say that the design change above (apparently explicit closure of connection instead of old / session caching) is the cause of this. And that this can affect anyone with large mailboxes, a slow server or connection, or slow client CPU, and the symptom's threshold be completely dependent on those factors -- if their server does not support IDLE certainly.

Does anyone know why the old behavior (session-pooling, if you will) was intentionally replaced by this db+connection re-opening? Was it to better handle concurrent mailbox use by multiple clients?
I wouldn't presume that one of these bugs might be equivalent to your issue - hence my question of at what interval value the problem goes away - IF it even does go away.  And, given odd bug reports about 2.0.0.6, do you know that this doesn't happen in 2.0.0.5?  Lastly, I would test a trunk build to see if it isn't already fixed on trunk, and if it is we might identify the fix to get it moved into 2.0.

regardless, this smells of regression
Keywords: regression
The one other reason I hesitate to test by changing the check interval is that most of my ~20 mailboxes have an interval of 1 minute. How I can effectively test and then my results be meaningful to you (given variances in mb size and server and client), I am not sure.... Suggest a strategy (given my 20 mbs @ 1 minute each) and I will do my best to test it.

Also by nightly trunk, does that suggest trying the nightly/3.x build? Would it suffice if I test with that and not a 2.x branch?
use nightly.
test at 1, 2, 5 minutes.
post results please

refresh my memory please, in a sentence - is cpu high, how high?
Ken, please post results from trying an early release. thanks. http://www.mozillamessaging.com/en-US/thunderbird/early_releases/
Component: General → Networking: IMAP
Product: Thunderbird → Core
QA Contact: general → networking.imap
Whiteboard: closeme 2008-10-25
Version: 2.0 → 1.8 Branch
RESO INCO per lack of response to last question. If you feel this change was made in error, please respond to this bug with your reasons why.
Status: UNCONFIRMED → RESOLVED
Closed: 16 years ago
Resolution: --- → INCOMPLETE
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.