Closed Bug 538283 Opened 15 years ago Closed 12 years ago

CPU went to 100% while sending message or while uploading messeage via TLS/SSL or StartTLS connection.

Categories

(MailNews Core :: Networking, defect)

1.9.0 Branch
x86
All
defect
Not set
major

Tracking

(Not tracked)

RESOLVED WORKSFORME
Thunderbird 11.0

People

(Reporter: nemanja.simovic, Unassigned)

References

()

Details

(Keywords: perf, Whiteboard: [fixed by bug 674147])

Attachments

(2 files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; sr; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; sr; rv:1.9.1.5) Gecko/20091204 Thunderbird/3.0

When I finish composing message and press send button CPU utilization goes to 100%. It happens every time. I'm using Serbian version of Thunderbird. If I send a larger message it holds CPU on 100% at all time. For example, if it takes 5 minutes to send message he keep CPU at 100%.

Reproducible: Always
Nemanjan cdoes this happens in -safe-mode (http://kb.mozillazine.org/Safe_mode) ?
Keywords: perf
(In reply to comment #1)
> Nemanjan cdoes this happens in -safe-mode (http://kb.mozillazine.org/Safe_mode)
> ?

This is happening in safe mode to. 

I've recorded it with Screen2Exe, and you can see it at this address: http://download.cacafaca.gotdns.com/temp/thunderbird_ocupies_cpu_while_sending.exe
(In reply to comment #2)

> I've recorded it with Screen2Exe, and you can see it at this address:
> http://download.cacafaca.gotdns.com/temp/thunderbird_ocupies_cpu_while_sending.exe

Can't read that :-(. Are you using an antivirus ? and did you by any chance set it to scan outgoing emails ?
(In reply to comment #3)
> (In reply to comment #2)
> 
> > I've recorded it with Screen2Exe, and you can see it at this address:
> > http://download.cacafaca.gotdns.com/temp/thunderbird_ocupies_cpu_while_sending.exe
> 
> Can't read that :-(. Are you using an antivirus ? and did you by any chance set
> it to scan outgoing emails ?

I've tested it both with and without antivirus and it's the same result. Anyway I didn't set antivirus to scan outgoing mail. In that case in process list antivirus should take most of the time. But that's not the case.

Here is the a same video, but on another location: http://www.kvartnekretnine.co.yu/down/thunderbird_ocupies_cpu_while_sending.exe (~800KB).
This should work.

Regard
I's easier on YouTube, for people to view it: http://www.youtube.com/watch?v=8VIn98cGAJ8
(In reply to comment #5)
> I's easier on YouTube, for people to view it:
> http://www.youtube.com/watch?v=8VIn98cGAJ8

Yes the issue is clearly visible. Mark any idea what could be going on here, or shall I ask david ?
Component: Message Compose Window → Mail Window Front End
QA Contact: message-compose → front-end
Only thing that is coming to my mind in that something wrong with thread when Thunderbird is sending data to server. I don't believe that SSL/TSL type of connection that I'm using has some something to do with it.
Version: unspecified → 3.0
Component: Mail Window Front End → Message Compose Window
QA Contact: front-end → message-compose
This occurs for me too, using English Thunderbird 3.0.1 with an SSL/TLS SMTP server.
Mail sending roughly consists of;
1-a. (Generate mail data stream in %temp% and save in Drafts, if requested)
1-b. Generate final mail data stream in %temp%
2-a. Pass mail data in %temp% to SMTP server
2-b. Copy sent mail data to Sent folder
At which step is CPU 100% observed?
(1) Compose a mail, "Send Later"
(2) At Outbox of Local Folders, "Send unsent messages"
Just now, I tested again, and it seams to be working normally. I can't remember if I upgrade TB, but current version is 3.0.1. I was using GMail as my POP3 server before and now, and didn't change anything.
The problems only seems to occur for me when sending large messages with attachments to the SMTP server, so probably strep 2b. I'm also using GMail as my SMTP, but a different serer for POP3.
I have got the same problem with TB 3.0.3:
When sending large e-mails (with large attachments, e. g. 5 MB), the CPU goes up to 100 % and freezes Thunderbird.
Normal (small) e-mails (without attachments) don't cause this problem.
Same problem for me with Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1.8) Gecko/20100227 Lightning/1.0b1 Thunderbird/3.0.3.

Is there a reason this bug has not yet been confirmed? The comments suggest lots of people are having this problem.
I can confirm this behaviour. It is (in my case, that needs to be tested by someone else) restricted to sending e-mails with SSL/TLS or STARTTLS encryption method.

I demonstrate this behaviour in the video below (links to blip.tv a free video site, no login needed):

http://blip.tv/file/3395250

TB3 is 3.03, OS is WinXP SP3 and CPU is AMD Athlon64 3500+, RAM is 1,5GB.
I have the same problem with TB 3.0.3 when I try to send email over an OpenVPN link to SMTP server. However accessing the same server over a local LAN link causes no problems. Previously I had been using TB 2.x for over a year with the same network setups with no problems.
Exactly the same behavior as 3r3b0s is seeing. TB 3.0.4, Win XP SP3. Come on guys, this bug is now 4 months old and *still* unconfirmed!
Ho can set it to CONFIRMED? :)
When we can easily reproduce and thus fix it.
Same here on Win XP SP3 TB 3.0.4, problem exists since upgrade from TB2 to TB3.

Please fix this, this is really annoying! Just google "thunderbird attachment cpu" and you'll see that this is quite common!
(In reply to comment #19)
> google "thunderbird attachment cpu" 

so this happens only if there is an attachment?
or, does it happen without an attachment and happens only on large messages?
(In reply to comment #20)
> so this happens only if there is an attachment?

yes.

> or, does it happen without an attachment and happens only on large messages?

only on attachments. No troubles without attachments, but as soon as I attach something, CPU goes up for minutes until the message is finally sent. As soon as the mail is delivered, CPU usage is down again.
"With attachment or not" itself is irrelvant for Mail client(Tb) who simply passes whole mail data to SMTP server and simply passes copy of the sent mail to IMAP server. Mail client(Tb) simply passes whole/large mail data to SMTP server or IMAP server.
If "existence of large attachment in mail" is rellevant to your problem, suspected culprit is usually;
 - Server side delay simly because of large mail data.
 - Mail scanning by mail scanner of Anti-virus software for outbound mail takes
   very long.
 - Mail server side(SMTP or IMAP) process simply takes long, if mail data is
   large.
 - Mail server side(SMTP or IMAP) process which checks received mail data takes
   long for attached data check.
Any of them can produce "long time to process mail data at server" and it easily causes timeout due to the "long time to process large mail data".
And, as I wrote in comment #9, composed mail sending consists of several steps.
Please check next before adding comment/complaint about phenomenon of "CPU went to 100% when sending message".
  - At which step of comment #9 your problem of "CPU 100%" occurs.

To bug opener and all "me too" posters:
If your problem occurs while generating large mail data, get Process Monitor log(with timestamp) for file I/O, if you are MS Win user.
If your problem occurs during large mail data sending to SMTP server, get SMTP log with timestamp.
If your problem occurs during large mail data sending to SMTP server, get SMTP log(NSPR log) with timestamp.
If your problem occurs during large mail data sending to IMAP server for sent mail copy, get IMAP log(NSPR log) with timestamp.

According to some comments, especially comment referes to VPN, problem of "CPU 100%" sounds to occur while mail data passing to SMTP server. However, it's still not so clear. Please provide data(evidence) which clearly states that "CPU 100%" occurs during large mail data sending to SMTP server, please.
(In reply to comment #22)
> suspected culprit is usually;
>  [...]

thunderbird.exe takes 100% CPU. Anti-virus is at 0-1% CPU, no idling for IP from mail server, but plain 100% CPU on thunderbird.exe. 

> Any of them can produce "long time to process mail data at server" and it
> [...]

Bug does not state "long time to process" but 100% CPU. 

> Please check next before adding comment/complaint about phenomenon of "CPU went
> to 100% when sending message".
>   - At which step of comment #9 your problem of "CPU 100%" occurs.

Ok, how to do that? How can I find out in which step TB hogs the CPU? Any debugging switch or log file available?

> To bug opener and all "me too" posters:
> If your problem occurs while generating large mail data, get Process Monitor
> log(with timestamp) for file I/O, if you are MS Win user.
> If your problem occurs during large mail data sending to SMTP server, get SMTP
> log with timestamp.
> If your problem occurs during large mail data sending to SMTP server, get SMTP
> log(NSPR log) with timestamp.
> If your problem occurs during large mail data sending to IMAP server for sent
> mail copy, get IMAP log(NSPR log) with timestamp.

How to get these IMAP, SMPT, NSPR logs? Please advice.
 
> According to some comments, especially comment referes to VPN, problem of "CPU
> 100%" sounds to occur while mail data passing to SMTP server. However, it's
> still not so clear. Please provide data(evidence) which clearly states that
> "CPU 100%" occurs during large mail data sending to SMTP server, please.

No VPN involved in my case, but I do suspect STMP phase. Please advice how to get these logs from TB.
(In reply to comment #23)
> Ok, how to do that? How can I find out in which step TB hogs the CPU?

Read comment #9, please.
(0) Disable "save sent mail copy to Sent" option.
(1) Compose a mail of large attachments, and "Send Later".
(2) Copy mail in Outbox of "Local Folders" to a local mail folder(say XX).
(3) Send mail by "Send Unsent messages". (no copy of sent mail is saced)
(4) Copy mail in XX to Sent folder you usually uses as sent folder.
Ordinal mail seding is roughly (1) + (3) + (4).  

> How to get these IMAP, SMPT, NSPR logs?

See next documents.
> https://wiki.mozilla.org/MailNews:Logging
> http://www.mozilla.org/projects/nspr/reference/html/prlog.html#25328
> SET NSPR_LOG_MODULES=timestamp,smtp:5,imap:5
As mail data is written in log, check log file content by yourself before open log data to public, if log analysis by developers will be required. As log data is large, never paste long data, please.

For CPU utilization data, "Performance Log(Counter Log)" of MS Win is available if you use MS Win.
> http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/nt_smlogcfg_files.mspx?mfr=true
> http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/nt_command_logman.mspx?mfr=true
If problem can be recreated repeatedly, reliably, get data with relatively large sampling interval first and check log. And, get data with small sampling interval(minimum 1 sec) for specific step.
Please never paste nor attach log, unless data check by yourself is completed and you remove irrelevant log lines in log data file.
(In reply to comment #23)
> thunderbird.exe takes 100% CPU. Anti-virus is at 0-1% CPU, no idling for IP
> from mail server, but plain 100% CPU on thunderbird.exe. 

Virus check upon file writing is executed under process of Thunderbird instead of AV software's main task for UI to control AV software's behaviour. Mail scanner of outgoing mail may be executed under Thunderbird task, with hoooking Thunderbird's activity. So, "Anti-virus is at 0-1% CPU" only can not be sufficient evidence that Anti-virus is irrelevant.
Hello to all,
I can confirm this bug on:
 Thunderbird 3.0.4 US English
 Windows XP SP3 English

I use:
 - clean non-admin (restricted) account under windows XP
 - new thunderbird profile (thunderbird never run before under this account)
 - my gmail.com account with SMTP server using SSL/TLS
 - no antivirus

Thunderbird doesn't copy sent mail to sent-mail folder.
Gmail does it automatically.

Thanks forward for repairing bug.

Vaclav

PS. I decrease priority of thunderbird process using procexp to make system responsive during sending large emails.
(In reply to comment #24)
ok, logging results:

started TB with:
SET NSPR_LOG_MODULES=timestamp,smtp:5,imap:5
thunderbird.exe -safe-mode

results:
High CPU usage in steps 3+4, 99.x% by thunderbird.exe
in step 3: procmon shows ReadFile access to file %TMP%\nsqmail.tmp
in step 4: procmon shows WriteFile access to log file set in NSPR_LOG_FILE so I've tried again without imap logging: now ReadFile access to nscpmsg.txt

Logfile excerpts:
2010-05-01 14:43:14.687000 UTC - 0[1824140]: SMTP entering state: 6
2010-05-01 14:43:14.687000 UTC - 0[1824140]: SMTP Send: DATA
2010-05-01 14:43:14.718000 UTC - 0[1824140]: SMTP entering state: 0
2010-05-01 14:43:14.718000 UTC - 0[1824140]: SMTP Response: 354 mail.gmx.net Go ahead {mp063}
2010-05-01 14:43:14.718000 UTC - 0[1824140]: SMTP entering state: 7
2010-05-01 14:43:14.718000 UTC - 0[1824140]: SMTP entering state: 8

[now CPU goes up, procmon shows ReadFile access to file %TMP%\nsqmail.tmp]

2010-05-01 14:45:10.906000 UTC - 0[1824140]: SMTP Send: .
2010-05-01 14:45:15.406000 UTC - 0[1824140]: SMTP entering state: 0
2010-05-01 14:45:15.406000 UTC - 0[1824140]: SMTP Response: 250 2.6.0 Message accepted {mp063}
2010-05-01 14:45:15.406000 UTC - 0[1824140]: SMTP entering state: 9
2010-05-01 14:45:15.406000 UTC - 0[1824140]: SMTP Send: QUIT

IMAP log is difficult to show because it seems to copy message payload.
Relevant steps as far as I can see:

2010-05-01 14:45:35.031000 UTC - 2336[585f440]: 48b5c00:imap.gmx.net:A:SendData: 9 append "Gesendet" (\Seen) "01-May-2010 16:42:39 +0200" {6188620}
2010-05-01 14:45:35.078000 UTC - 2336[585f440]: ReadNextLine [stream=680c0c8 nb=33 needmore=0]
2010-05-01 14:45:35.078000 UTC - 2336[585f440]: 48b5c00:imap.gmx.net:A:CreateNewLineFromSocket: + Ready to read 6188620 octets.
2010-05-01 14:45:35.078000 UTC - 2336[585f440]: 48b5c00:imap.gmx.net:A:SendData: 

[message payload follows]
[now CPU goes up, procmon shows ReadFile access to file %TMP%\nscpmsg.txt]

2010-05-01 14:47:29.578000 UTC - 2336[585f440]: --------------010004020308040603050400--
2010-05-01 14:47:30.250000 UTC - 2336[585f440]: 48b5c00:imap.gmx.net:A:SendData: 
2010-05-01 14:47:33.703000 UTC - 2336[585f440]: ReadNextLine [stream=680c0c8 nb=16 needmore=0]
2010-05-01 14:47:33.703000 UTC - 2336[585f440]: 48b5c00:imap.gmx.net:A:CreateNewLineFromSocket: 9 OK Completed
2010-05-01 14:47:51.921000 UTC - 2580[4782040]: 48b1c00:imap.gmx.net:S-INBOX:SendData: DONE

Seems CPU load occurs during sending mail or reading temp file from disk.

SMTP+IMAP connection security was set to SSL/TLS.

If I set SMTP connection security to "None", CPU load is ok (<5%), for STARTTLS CPU also goes up.

So to me it seems CPU load is connected to message payload encryption.
(In reply to comment #27)
> SMTP+IMAP connection security was set to SSL/TLS.
> If I set SMTP connection security to "None", CPU load is ok (<5%),
> for STARTTLS CPU also goes up.
> So to me it seems CPU load is connected to message payload encryption.

Nice analysis.
Similar issue to Bug 559676?
> Bug 559676 nsSocketTRansportServuce::DoPollIteration hogging the CPU
> while uploading a large file to an SSL web server

Can you do quick check of NSPR log with nsSocketTransport:5?
> https://developer.mozilla.org/en/HTTP_Logging
Note: Log file becomes huge, and log analysis is hard for people other professional of TCP/IP, Socket, Necko implementation and/or Security. Please don't attach native log file unless requested by professonals.
(In reply to comment #28)
> Similar issue to Bug 559676?
> > Bug 559676 nsSocketTRansportServuce::DoPollIteration hogging the CPU
> > while uploading a large file to an SSL web server
> 
> Can you do quick check of NSPR log with nsSocketTransport:5?

I've got the log file (>1GB for 4MB attachment), but what now? 

Sending file seems to consist of repeating blocks of something like
[...]
1136[1824640]: nsSocketInputStream::Read [this=5464d18 count=4096]
1136[1824640]:   calling PR_Read [count=4096]
1136[1824640]:   PR_Read returned [n=-1]
1136[1824640]: nsSocketInputStream::AsyncWait [this=5464d18]
1136[1824640]: STS poll iter [0]
1136[1824640]:   active [2] { handler=5464c80 condition=0 pollflags=7 }
1136[1824640]:   active [1] { handler=37bfbe0 condition=0 pollflags=5 }
1136[1824640]:   active [0] { handler=5464ac0 condition=0 pollflags=5 }
1136[1824640]:   idle [1] { handler=37bff60 condition=0 pollflags=0 }
1136[1824640]:   idle [0] { handler=37c09e0 condition=0 pollflags=0 }
1136[1824640]:   calling PR_Poll [active=3 idle=2]
1136[1824640]: poll timeout: 100
1136[1824640]:     timeout = 0 milliseconds
1136[1824640]:     ...returned after 0 milliseconds
1136[1824640]: STS poll iter [0]
1136[1824640]:   active [2] { handler=5464c80 condition=0 pollflags=7 }
1136[1824640]:   active [1] { handler=37bfbe0 condition=0 pollflags=5 }
1136[1824640]:   active [0] { handler=5464ac0 condition=0 pollflags=5 }
1136[1824640]:   idle [1] { handler=37bff60 condition=0 pollflags=0 }
1136[1824640]:   idle [0] { handler=37c09e0 condition=0 pollflags=0 }
1136[1824640]:   calling PR_Poll [active=3 idle=2]
1136[1824640]: poll timeout: 100
1136[1824640]:     timeout = 0 milliseconds
1136[1824640]:     ...returned after 0 milliseconds
1136[1824640]: STS poll iter [1]
1136[1824640]:   active [2] { handler=5464c80 condition=0 pollflags=7 }
1136[1824640]:   active [1] { handler=37bfbe0 condition=0 pollflags=5 }
1136[1824640]:   active [0] { handler=5464ac0 condition=0 pollflags=5 }
1136[1824640]:   idle [1] { handler=37bff60 condition=0 pollflags=0 }
1136[1824640]:   idle [0] { handler=37c09e0 condition=0 pollflags=0 }
1136[1824640]:   calling PR_Poll [active=3 idle=2]
1136[1824640]: poll timeout: 100
1136[1824640]:     timeout = 100000 milliseconds
1136[1824640]:     ...returned after 0 milliseconds
1136[1824640]: STS poll iter [1]
1136[1824640]:   active [2] { handler=5464c80 condition=0 pollflags=7 }
1136[1824640]:   active [1] { handler=37bfbe0 condition=0 pollflags=5 }
1136[1824640]:   active [0] { handler=5464ac0 condition=0 pollflags=5 }
1136[1824640]:   idle [1] { handler=37bff60 condition=0 pollflags=0 }
1136[1824640]:   idle [0] { handler=37c09e0 condition=0 pollflags=0 }
1136[1824640]:   calling PR_Poll [active=3 idle=2]
1136[1824640]: poll timeout: 100
1136[1824640]:     timeout = 100000 milliseconds
1136[1824640]:     ...returned after 0 milliseconds
1136[1824640]: nsSocketTransport::OnSocketReady [this=5464c80 outFlags=3]
1136[1824640]: nsSocketOutputStream::OnSocketReady [this=5464d38 cond=0]
1136[1824640]: STS dispatch [389b348]
1136[1824640]: nsSocketInputStream::OnSocketReady [this=5464d18 cond=0]
[... more to follow until it repeats...]

Any tool at hand to analyse the logfile? Shall I extract some parts, zip and upload?
(In reply to comment #29)

As we are suspecting "DoPollIteration hogging the CPU", timestamp is required to see poling is executed with reasonable time interval or poling is executed as if loop.  
Can you check Socket log with timestamp during above three minutes?

You use MS Win. So you can get/check NSPR log for specific step only(by Start/Stop Capture) and discard huge log data immediately after ad-hoc checking(by Clear display) with DebugView of Sysinternal. See bug 402793 comment #6 for it. As DebugView adds timestamp(any of time and time-offset), remove "timestamp" parameter if you use DebugView for getting/checking NSPR log. 

> Any tool at hand to analyse the logfile?

I don't know such tools. I usually create simple Rexx or PHP script to remove irrelevant log lines after rough manual checking of characteristics of required log lines and characteristics of irrelevant log lines. And, repeat removing of irrelevant lines until manual checking of extracted data becomes possible.
(In reply to comment #30) 
> As we are suspecting "DoPollIteration hogging the CPU", timestamp is required
> to see poling is executed with reasonable time interval or poling is executed
> as if loop.  
> Can you check Socket log with timestamp during above three minutes?

ok, got log with DebugView and timestamps. 
SET NSPR_LOG_MODULES=nsSocketTransport:5
SET NSPR_LOG_FILE=WinDebug
thunderbird.exe -safe-mode

But what shall I check for, I can't find "DoPollIteration" in log? Should I upload log from DebugView (~1MB compressed).

> > Any tool at hand to analyse the logfile?
> 
> I don't know such tools. I usually create simple Rexx or PHP script to remove
> irrelevant log lines after rough manual checking of characteristics of required
> log lines and characteristics of irrelevant log lines. And, repeat removing of

Please give definition of "relevant" for our case, see above question: what should I look for (or upload log file)?
(In reply to comment #32)
> I can't find "DoPollIteration" in log?

I think "STS poll iter [0]" corresponds to it.	

Following is a set of log for sending single data segment. Simply send data segment, and waits for ACK. (As log line number is added by DebugView, you can know that no line is removed when I pasted.)
> 00147887	8:27:33.744	[268] 2464[182c3c0]: nsSocketOutputStream::Write [this=4a82e58 count=2999]	
> 00147888	8:27:33.744	[268] 2464[182c3c0]:   calling PR_Write [count=2999]	
> 00147889	8:27:33.744	[268] 2464[182c3c0]:   PR_Write returned [n=-1]	
> 00147890	8:27:33.744	[268] 2464[182c3c0]: nsSocketOutputStream::AsyncWait [this=4a82e58]	
> 00147891	8:27:33.744	[268] 2464[182c3c0]: STS poll iter [0]	
> 00147892	8:27:33.744	[268] 2464[182c3c0]:   active [1] { handler=4a82da0 condition=0 pollflags=6 }	
> 00147893	8:27:33.744	[268] 2464[182c3c0]:   active [0] { handler=1838200 condition=0 pollflags=5 }	
> 00147894	8:27:33.744	[268] 2464[182c3c0]:   calling PR_Poll [active=2 idle=0]	
> 00147895	8:27:33.744	[268] 2464[182c3c0]: poll timeout: 20	
> 00147896	8:27:33.744	[268] 2464[182c3c0]:     timeout = 0 milliseconds	
> 00147897	8:27:33.744	[268] 2464[182c3c0]:     ...returned after 0 milliseconds	
> 00147898	8:27:33.744	[268] 2464[182c3c0]: nsSocketTransport::OnSocketReady [this=4a82da0 outFlags=2]	
> 00147899	8:27:33.744	[268] 2464[182c3c0]: nsSocketOutputStream::OnSocketReady [this=4a82e58 cond=0]	
> 00147900	8:27:33.744	[268] 2464[182c3c0]: STS dispatch [3a72fc8]	
> 00147901	8:27:33.744	[268] 2464[182c3c0]: nsSocketInputStream::Read [this=4a82e38 count=4096]	
> 00147902	8:27:33.744	[268] 2464[182c3c0]:   calling PR_Read [count=4096]	
> 00147903	8:27:33.744	[268] 2464[182c3c0]:   PR_Read returned [n=-1]	
> 00147904	8:27:33.744	[268] 2464[182c3c0]: nsSocketInputStream::AsyncWait [this=4a82e38]	
> 00147905	8:27:33.744	[268] 2464[182c3c0]: STS poll iter [0]	
> 00147906	8:27:33.744	[268] 2464[182c3c0]:   active [1] { handler=4a82da0 condition=0 pollflags=5 }	
> 00147907	8:27:33.744	[268] 2464[182c3c0]:   active [0] { handler=1838200 condition=0 pollflags=5 }	
> 00147908	8:27:33.744	[268] 2464[182c3c0]:   calling PR_Poll [active=2 idle=0]	
> 00147909	8:27:33.744	[268] 2464[182c3c0]: poll timeout: 20	
> 00147910	8:27:33.744	[268] 2464[182c3c0]:     timeout = 0 milliseconds	
> 00147911	8:27:33.744	[268] 2464[182c3c0]:     ...returned after 0 milliseconds	
> 00147912	8:27:33.744	[268] 2464[182c3c0]: nsSocketTransport::OnSocketReady [this=4a82da0 outFlags=1]	
> 00147913	8:27:33.744	[268] 2464[182c3c0]: nsSocketInputStream::OnSocketReady [this=4a82e38 cond=0]	
> 00147914	8:27:33.744	[268] 2464[182c3c0]: STS dispatch [3a72f88]
> (next nsSocketOutputStream::Write [this=4a82e58 count=2999] for next data segment)

I couldn't see "flood of DoPollIteration" nor "loop like DoPollIteration".
Log I attached looks for me very normal flow. I can't find something wrong at IMAP flow level and Socket flow level. (Sorry but I didn't get CPU usage data while uploading to Gmail IMAP which uses SSL) 

Can you see something wrong or funny in your log?

> Please give definition of "relevant" for our case, (snip)

There is not clear definition. "relevant" is merely "not apparently irrelevant" at first analysis step.

When imap:5,nsSocketTransport:5, IMAP log for other IMAP server or other IMAP folder is an example of "apparently irrelevant". Many polling logs while no activity after end of mail upload comletion is also an example of "apparently irrelevant". IMAP log lines for the server and the folder in question is example of "apparently relevant". 
Set of nsSocketTransport logs for "the server and the folder in question" is "relevant". Set of nsSocketTransport logs for other than "the server and the folder in question" is "irrelevant". But it's very very hard to distinguish them manually. And, if logs for different folder or different server is also included, it's very tough work to determine "a set of logs for single send of a data segment to the folder".
(In reply to comment #33)
> Following is a set of log for sending single data segment. Simply send data
> segment, and waits for ACK. 

corresponding part from my log:

00002082	8.08654213	[832] 2636[1824640]: nsSocketOutputStream::Write [this=37c08d8 count=4096]	
00002083	8.08657455	[832] 2636[1824640]:   calling PR_Write [count=4096]	
00002084	8.08661175	[832] 2636[1824640]:   PR_Write returned [n=-1]	
00002085	8.08664322	[832] 2636[1824640]: nsSocketOutputStream::AsyncWait [this=37c08d8]	
00002086	8.08667850	[832] 2636[1824640]: STS poll iter [0]	
00002087	8.08671093	[832] 2636[1824640]:   active [2] { handler=37c0820 condition=0 pollflags=6 }	
00002088	8.08675194	[832] 2636[1824640]:   active [1] { handler=37bfb00 condition=0 pollflags=5 }	
00002089	8.08678436	[832] 2636[1824640]:   active [0] { handler=545ed00 condition=0 pollflags=5 }	
00002090	8.08681583	[832] 2636[1824640]:   calling PR_Poll [active=3 idle=0]	
00002091	8.08684635	[832] 2636[1824640]: poll timeout: 100	
00002092	8.08687782	[832] 2636[1824640]:     timeout = 0 milliseconds	
00002093	8.08692646	[832] 2636[1824640]:     ...returned after 0 milliseconds	
00002094	8.08695984	[832] 2636[1824640]: nsSocketInputStream::Read [this=37c08b8 count=4096]	
00002095	8.08699131	[832] 2636[1824640]:   calling PR_Read [count=4096]	
00002096	8.08707428	[832] 2636[1824640]:   PR_Read returned [n=-1]	
00002097	8.08709908	[832] 2636[1824640]: nsSocketInputStream::AsyncWait [this=37c08b8]	
00002098	8.08713531	[832] 2636[1824640]: STS poll iter [0]	
00002099	8.08716774	[832] 2636[1824640]:   active [2] { handler=37c0820 condition=0 pollflags=7 }	
00002100	8.08720016	[832] 2636[1824640]:   active [1] { handler=37bfb00 condition=0 pollflags=5 }	
00002101	8.08723164	[832] 2636[1824640]:   active [0] { handler=545ed00 condition=0 pollflags=5 }	
00002102	8.08726311	[832] 2636[1824640]:   calling PR_Poll [active=3 idle=0]	
00002103	8.08729458	[832] 2636[1824640]: poll timeout: 100	
00002104	8.08732510	[832] 2636[1824640]:     timeout = 0 milliseconds	
00002105	8.08736038	[832] 2636[1824640]:     ...returned after 0 milliseconds	
00002106	8.08739185	[832] 2636[1824640]: nsSocketTransport::OnSocketReady [this=37c0820 outFlags=1]	
00002107	8.08742428	[832] 2636[1824640]: nsSocketInputStream::OnSocketReady [this=37c08b8 cond=0]	
00002108	8.08745575	[832] 2636[1824640]: STS dispatch [38cc708]	


> I couldn't see "flood of DoPollIteration" nor "loop like DoPollIteration".
> Log I attached looks for me very normal flow. I can't find something wrong at
> IMAP flow level and Socket flow level. (Sorry but I didn't get CPU usage data
> while uploading to Gmail IMAP which uses SSL) 

When I did the test with logging to DebugView, CPU usage was more or less normal, but sending took about 2 times longer. Maybe CPU cycles just got spread over 4 minutes instead of 2 minutes due to logging overhead.

> Can you see something wrong or funny in your log?

Nothing obvious, but I don't have anything to compare to. Step "calling PR_Read" seems to take longest timespan.
 
Shall I upload compressed log for further analysis?
(In reply to comment #34)
> Shall I upload compressed log for further analysis?

As we, you and me, are not professional of networking, we probably can't do correct and sufficient log analysis. If you can see obvious difference from my log data in your log, and if it's relevant to your problem, and if you believe the defference is evidence of this bug, attach your log file data for specific part to compare my log data and your log data, with attaching CPU utilization data corresponds to your log which is evidence of this bug, please.

By the way, CPU 100% and large virtual/real memory use is evidence of efficient code in some cases.
 (a) File write of several Giga bytes using 4KB write buffer
 (b) File write of several Giga bytes using 1MB write buffer
 (c) File write of several Giga bytes using 8MB write buffer
CPU utilization and virtual/real memory usage : (a) < (b) < (c)
Time to take to write several Giga bytes data : (a) > (b) > (c)
If CPU utilization of 100% in your case is evidence of efficient use of CPU and large virtual/real memory, problem is "no responsiveness of Tb while such efficient or too eficient file I/O".
ok, (In reply to comment #35)
> As we, you and me, are not professional of networking, [...]

ok, so is any "professional of networking" taking care of this or am I just wasting my time here?

I'm a little bit tired of putting tons of evidence together and this bug still sits at status unconfirmed...
(In reply to comment #36)
> this bug still sits at status unconfirmed...

UNCONFIRMED never mean your problem doesn't exist. I couldn't reproduce "CPU 100% and unresponsive while the CPU 100%" yet, so I can't confirm. 
By your great effort, phenomenon became clearer than initial, and by your check with socket log, TCP/IP level problem seems to have been successfully ruled out. As you suspect, I also suspect that CPU load is connected to message payload encryption. And, bug 564366 reported on 2010-05-06 sounds same problem as yours.
I'll try to find what kind of data is required for diagnosis and what kind of tools can be used for it, and wait for participation of professionals.
To bug opnener and all problem reporters:

Do you enable mail scanner for outgoing mail of anti-virus software?
Is there any difference between next 2 cases?
  - 100 attachments, 100 KB/file
  -   1 attachment,   10 MB/file
(In reply to comment #38)
> Do you enable mail scanner for outgoing mail of anti-virus software?

AV has been disabled during test, makes no difference.

> Is there any difference between next 2 cases?

No.
Following is NSS bug which has "perf" in keyword field.
> bug 196880 Excessive malloc() calls
> bug 265620 Use buffer to reduce C_DigestUpdate calls to Cryptoki during SSL handshake
> bug 266948 streaming bi-directional test mode for selfserv and strsclnt
> bug 283763 memsets in libssl affect performance ; SSL buffer also starts too small
> bug 285525 Inefficiency in SSL model socket feature for servers

GWu, is required CPU power O(data_size)? Or higer than linear like O(data_size**2)? Rough your external observation is sufficent.

Total of required CPU power to pass mail data to server :
  ( required CPU steps per a segment of data ) * ( number of segments in data )
is nearly equall to
  ( CPU utilization per sec ) * ( seconds required to pass data to server )
and, total seconds required to pass data to server is ordinally O(data_size) or O(number_of_IP_packets), because rate-determining step is "passing data to server in IP packet format via network" instead of pre/post process of IP packet sending.
Is "total seconds required to pass whole mail data to SMTP" different very much in next two cases? Nearly same? Or different but not so big difference?
  - SMTP without SSL/TLS or StartTLS
  - SMTP with    SSL/TLS or StartTLS
GWu, there is no need to answer questions in Comment #40 any more. 
This bug is probably a part of problem of Bug 559676, and perhaps cause is same.
Depends on: 559676
Confirming, per Bug 559676 Comment #5 and Bug 559676 Comment #10.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Severity: normal → major
Component: Message Compose Window → Networking
OS: Windows XP → All
Product: Thunderbird → MailNews Core
QA Contact: message-compose → networking
Hardware: x86 → All
Version: 3.0 → 1.9.0 Branch
Summary: CPU went to 100% when sending message. → CPU went to 100% while sending message or while uploading messeage via TLS/SSL or StartTLS connection.
Copy of Bug 559676 Comment #2.

> http://mxr.mozilla.org/mozilla-central/source/security/nss/lib/ssl/ssl3con.c#2313
> 2313 ssl3_SendApplicationData(sslSocket *ss, const unsigned char *in,
> 2314                          PRInt32 len, PRInt32 flags)
>(snip)
> 2342     while (len > totalSent) {
> 2343         PRInt32   sent, toSend;
> 2344 
> 2345         if (totalSent > 0) {
> 2346             /*
> 2347              * The thread yield is intended to give the reader thread a
> 2348              * chance to get some cycles while the writer thread is in
> 2349              * the middle of a large application data write.  (See
> 2350              * Bugzilla bug 127740, comment #1.)
> 2351              */
> 2352             ssl_ReleaseXmitBufLock(ss);
> 2353             PR_Sleep(PR_INTERVAL_NO_WAIT);      /* PR_Yield(); */
> 2354             ssl_GetXmitBufLock(ss);
> 2355         }
> 2356         toSend = PR_MIN(len - totalSent, MAX_FRAGMENT_LENGTH);
> 2357         sent = ssl3_SendRecord(ss, content_application_data, 
> 2358                                in + totalSent, toSend, flags);
>(snip) 
> 2373     }

Set of ssl_ReleaseXmitBufLock, PR_Sleep(PR_INTERVAL_NO_WAIT), and ssl_GetXmitBufLock is executed for each data fragment of MAX_FRAGMENT_LENGTH.
The unlock/lock upon each send of a data_fragment(4K?) can efficiently consume whole CPU power, because network is vary fast than before and because OS code, network code, NSPR code, ..., are efficient than before.
ssl_ReleaseXmitBufLock/PR_Sleep(PR_INTERVAL_NO_WAIT)/ssl_GetXmitBufLock(ss) looks irrelevant to CPU 100%, CPU hog, unresponsiveness of Thunderbird and other application.

According to bug 559676 #19, following part seems relevant to problem.
> http://mxr.mozilla.org/mozilla1.9.2/source/netwerk/base/src/nsSocketTransportService2.cpp#524
> See lines 538-539.
And following code is seen source.
> 535 DoPollIteration(PR_FALSE);
> 538 while (mayWait && NS_SUCCEEDED(thread->HasPendingEvents(&val)) && !val)
> 539 DoPollIteration(PR_TRUE);
It looks;
  #535 corresponds to STS poll iter [0]   
  #539 corresponds to STS poll iter [1]
In log file I attached, I can see only two separated "STS poll iter [1]". But there is two concecutive "STS poll iter [1]" in log you pasted.
Are there many "STS poll iter [1]" in your socket log?

Following comment is seen in source. Do you use proxy for SMTP server or IMAP server? 
> 528 // (snip) This case
> 529 // should only happen when someone (e.g., PSM) is issuing a synchronous
530     // proxy call from this thread to the main thread.
Possibly similar issue to bug 541367 comment #17 or bug 566548 comment #10.
  bug 541367 comment #17 :
    No ACK for Tb sooner or later, then session lost sooner or later. 
    No ACK for Tb was found router's bug.
  bug 566548 comment #10 :
    No ACK for Tb started to occur after a while. After that, retransmission
    and no ACK continued for 60 seconds, then session was killed
    after 60 seconds based on max retry count or max time of no ACK.
  This bug seems :
    Retransmission happens frequently due to no ACK, but retransmission
    is successfull, then CPU 100% happens due to retransmission, 
    but session is not killed because retransmission is successful.
Bug 566548 comment #10 is socket log of "send error and retry for 60 seconds, then timeout at TCP/IP level".
Is similar socket log like next seen in your log at many places?
  write 4096 bytes => OK, write 4096 bytes => error, write 4096 bytes(retry)
  => error, retry/error continues for a while, write 4096 bytes(retry) => OK
Following is a log of "write 4096 bytes => error" in your data.
> 00002082    8.08654213    [832] 2636[1824640]: nsSocketOutputStream::Write [this=37c08d8 count=4096]    
> 00002083    8.08657455    [832] 2636[1824640]:   calling PR_Write [count=4096]  
> 00002084    8.08661175    [832] 2636[1824640]:   PR_Write returned [n=-1]
xref to bug 566548, for ease of tracking, ease of problem analysis.
Depends on: 566548
I could successfully reduce data size that we have to see manually by simple script of; 
(1) Extract "calling PR_Write [count=xxxx]" / "PR_Write returned [n=yyyy] or [-zzzz]" in order to see network write operation only.
(2) Merge repeatedly appear concecutive log lines of (1), by simple/quick log viewing, and several tries to reduce log size that we have to view.
For example, 12MB original socket log data was reduced to 500KB by above (1)/(2) of simple script. It makes log analysis like bug 566548 comment #10 easier. I can do log analysis like bug 566548 comment #10 easier than before.

GWu, can you attach your log to this bug?
(old log is sufficient, because of first/quick log analysis before asking help to professionals of TCP/IP).
GWu, can you check with next Tb's prefs value? (Read thru bug 566548, please)
> network.tcp.sendbuffer=65536 or 32768 (default is 131072)
Tested "network.tcp.sendbuffer=32768", but it has not solved the issue for me - see Comment 15 above.
in version 3.0.5 the problem was gone for me.
In version 3.1 this appeared again, but now not thunderbird eats all 100%, thunderbird eats 60% system eats 40%.

Can somebody confirm this?

As workaround the avast antivirus free edition can take over handling SSL connection, and then thunderbird uses only 1 % in the same situation and avast max 2% so a big difference...
(In reply to comment #52)
Avast Free Antivirus version 5 (version 4 does not have this feature)
I can confirm that on my system the CPU load problem persists with version 3.1 and the 60%/40% ratio for thunderbird and system load is similar to what I am seeing.

Whether the problem was gone in 3.0.5 I am not sure: it is easy to miss when a problem disappears...
(In reply to comment #50)
Tested "network.tcp.sendbuffer=32768" in TB 3.1, but it has not solved the issue for me, too. bug 566548 is about a hang/time out, which isn't exactly the problem here.

I did some other test: sending messages with and without SSL takes about the same time (+/- 2%), just sending with SSL hogs the CPU completely.
(In reply to comment #55)
> I did some other test: sending messages with and without SSL takes about the
> same time (+/- 2%), just sending with SSL hogs the CPU completely.

At network level, sending of data looks to be executed samely for both non-SSL connection and SSL connection, and network level issue such as delay/re-transmission is not problem.
Do you enable "mail scanner for outbound mail" of anti-virus software? If yes, port scan type? Or SMTP proxy type?
On the computer in question Symmantec Endpoint Protection is running. However, on its configuration page it says explicitly:

Note: Messages sent over encrypted connections cannot be scanned.
(In reply to comment #57)
> on its configuration page it says explicitly:
> Note: Messages sent over encrypted connections cannot be scanned.

"Cannot be scanned" does not always mean "nothing is done even if mail scaner is enabled for SSL port or connection". Please confirm that "mail scanner for outboud mail" is never enabled for the SMTP port or connection in your environment, to surely rule out "interfere by mail scaner" case.
(In reply to comment #56)

By Avast it is a new feature, not seen by other avirus software.
It detects the encrypted connection, makes an entry in the config, after that the connection must be set from smpts/imaps/pop3s to non encrypted and standard port (smtp/imap/pop3) in thunderbird, and than avast doing the ssl tunneling. 
It also cannot scan the connection if the SSL is done by TB.

But with this it can be verified, that this is a TB SSL issue.

Note: for testing use a very slow connection to send messages! ( eg. 128 kbit)
(In reply to comment #59)
> But with this it can be verified, that this is a TB SSL issue.

Thanks for checking. (I upgrade to avast! 5 today after reading your comment :-)
 
> Note: for testing use a very slow connection to send messages! ( eg. 128 kbit)

I'm suspecting slow line speed relevant isuse too.
  - Send completion of a buffer doesn't occur shortly because of slow speed.
    Send completion check for each buffer is executed too quickly/feequently.  
  - Free TCP buffer is not available for long time because of slow speed.
    Retry due to no free TCP buffer is executed too quickly, too freequently.
See data attached to bug 577545, please.

If such problem, very small network.tcp.sendbuffer may be a workaround.
What will happen with network.tcp.sendbuffer=4096 or 8192?
(Restart Tb after setting change to avoid unwanted peoblem, please)
(In reply to comment #56)
> At network level, sending of data looks to be executed samely for both non-SSL
> connection and SSL

Well, something is definetely different, on whatever level. I can reproduce this consistently: with and without SSL takes about the same time, sending with SSL hogs the CPU. 

> Do you enable "mail scanner for outbound mail" of anti-virus software? If yes,
> port scan type? Or SMTP proxy type?

No AV on outbound mail. I even disabled AV for these tests, but it makes no difference.

By the way, this does not only account for SMTP but also for IMAP "sending", if SSL is enabled. Copying the sent message to IMAP "Sent folder" hogs CPU with SSL and behaves normally without SSL. 

Strange thing is: getting messages via IMAP behaves totally normal, independent of SSL setting.
I wanted to say "same SMTP send time of non-SSL and SSL for same data size"=="same transmission speed of IP packets". i.e. At TCP/IP level, there is no difference between non-SSL case and SSL case in your environment.

From Socket and TCP/IP view, SMTP, IMAP upload, HTTP upload, are same;
  Server is data receiver(decides TCP Window size),  Client(Tb) is data sender.
From Socket and TCP/IP view, POP3, IMAP download, HTTP GET, are same;
  Client(Tb) is data receiver(decides TCP Window size), Server is data sender.
As Dynamic TCP Window Scaling of client(as receiver) is enabled by default from Win Vista, download is usually not affected by Dynamic TCP Window Scaling.
As IMAP can produce logs for other folders via other cached connections while getting log for uploaing, and imap:5 produces log for fetched/appended mail data, it's hard to analyze log or trace data when IMAP. So, I focused on simple SMTP case first.

We are checking difference of SMTP sending to same SMTP server(Gmail's SMTP, SSL is used) between PC of opener of bug 577545(says CPU 100% occurs) and my PC(can't observe high CPU utilization). See bug 577545, please.
Can you check with Gmail's SMTP(Gmail's SMTP requests SSL/TLS)?
Can you see CPU 100% with Gmail's SMTP?
Can you see CPU 100% with network.tcp.sendbuffer=4096?

Following is time to take send mail of same size to Gmail's SMTP in my environment.
(1) network.tcp.sendbuffer=4096, 50 seconds.
> 00000467 15:24:50.247 [1240] 0[192c140]: SMTP Send: MAIL FROM:<yatter.one@gmail.com> SIZE=1436482  
> 00000759 15:24:50.981 [1240] 0[192c140]: SMTP Response: 354  Go ahead 23sm6052832wfa.22 
> 00037899 15:25:39.997 [1240] 0[192c140]: SMTP Response: 250 2.0.0 OK 1279520740 23sm6052832wfa.22 
(2) network.tcp.sendbuffer=65536, 11 seconds.
> 00000471 15:45:31.544 [528] 0[192c140]: SMTP Send: MAIL FROM:<yatter.one@gmail.com> SIZE=1436482  
> 00000763 15:45:32.544 [528] 0[192c140]: SMTP Response: 354  Go ahead b1sm16646840rvn.2 
> 00015627 15:45:43.684 [528] 0[192c140]: SMTP Response: 250 2.0.0 OK 1279521943 b1sm16646840rvn.2 
(3) network.tcp.sendbuffer=131072, 11 seconds.
> 00000464 15:47:14.825 [880] 0[192c140]: SMTP Send: MAIL FROM:<yatter.one@gmail.com> SIZE=1436482  
> 00000756 15:47:15.528 [880] 0[192c140]: SMTP Response: 354  Go ahead r27sm16646506rvq.9 
> 00013033 15:47:24.215 [880] 0[192c140]: SMTP Response: 250 2.0.0 OK 1279522044 r27sm16646506rvq.9 
(4) network.tcp.sendbuffer=262144, 8 seconds.
> 00000474 15:49:59.747 [1908] 0[192c140]: SMTP Send: MAIL FROM:<yatter.one@gmail.com> SIZE=1436482  
> 00000737 15:50:00.497 [1908] 0[192c140]: SMTP Response: 354  Go ahead 23sm6074710wfa.22 
> 00011540 15:50:08.575 [1908] 0[192c140]: SMTP Response: 250 2.0.0 OK 1279522208 23sm6074710wfa.22 
(5) network.tcp.sendbuffer=524288, 7 seconds.
> 00000458 15:52:49.559 [2316] 0[192c140]: SMTP Send: MAIL FROM:<yatter.one@gmail.com> SIZE=1436482  
> 00000750 15:52:50.247 [2316] 0[192c140]: SMTP Response: 354  Go ahead y16sm6078200wff.14 
> 00010577 15:52:57.450 [2316] 0[192c140]: SMTP Response: 250 2.0.0 OK 1279522377 y16sm6078200wff.14

Gmail's SMTP probably uses TCP Window Size=128KB. So, if network.tcp.sendbuffer<=128KB, TCP/IP probably can't fully utilize 128KB TCP Window Size, and far larger network.tcp.sendbuffer than 128KB doesn't increase transmission speed.
As I read through this bugreport, I become afraid of misunderstanding. 100% CPU usage is present only when you try to send an attachment (DOC, PDF, XLS etc.) Mails without attachments aren't affected by this CPU hogging at all!

>Can you check with Gmail's SMTP(Gmail's SMTP requests SSL/TLS)?
-gmail SSL/TLS hogs the CPU 100%

>Can you see CPU 100% with Gma il's SMTP?
YES of course, I see it with all smtp servers I try (100% CPU with SSL/TLS, no hogging without encryption)

>Can you see CPU 100% with network.tcp.sendbuffer=4096?
-doesn't resolve the problem, nothing changed after applying (and after restart of TB) also 32768 setting has no effect :-(

Tested on brand new TB 3.1.1...

I couldn't believe you don't see the CPU hogging, because in our company it occurs on every machine without any exception regardless the anti-virus is running or not. Maybe your PC is very powerful so you can see this so clearly, try to use it on the slowest machine etc.

On my home computer is the CPU hogging 60% of core1 and 0% of core2. Because of this the system is not so unresponsive as on single core processor couple of years older is. But at this dualcore machine the hogging is still observed, system responsiveness is slightly decreased anyway.

Mbiba
-of course hogging is also observed when copying message to send folder.

On the other hand manually copying the messages between different IMAP folders doesn't hog the CPU...

Mbiba
(In reply to comment #63)
> >Can you see CPU 100% with Gma il's SMTP?
> YES of course, I see it with all smtp servers I try (100% CPU with SSL/TLS, no
> hogging without encryption)
> >Can you see CPU 100% with network.tcp.sendbuffer=4096?
> -doesn't resolve the problem, nothing changed after applying (and after restart of TB)

In my environment, as previously wrote, it merely took 50 seconds to send 1.5MD mail(text mail with a PDF) to Gmai's SMTP with network.tcp.sendbuffer=4096, and CPU utilizaion was several %(value shown by Task Manager).
I use Core2 Duo, T5500@1.66GHz, 1GB Note PC, MS Win-XP SP3. Not slow, but no so fast nowadays. Big difference of my PC is probably line speed. PC <-100Mbps LAN-> Router Type Hub/VDSL modem <-70Mbps-> distributer <-1Gbps Fiber cable-> Provider.

Can you get SMTP+Socket log for sending to Gmail's SMTP with network.tcp.sendbuffer=4096? There is no need of log for full sendig. After CPU hog started, kill Tb, save log, please.
> SET NSPR_LOG_MODULES=timestamp,smtp:5,nsSocketTransport:5,nsHostResolver:5
If sendbuffer=4096, Tb(SSL/Socket) & TCP/IP repeats next.
 (a) Tb passes 4KB to TCP, (b) Tb waits or retries till send completion of the
 4KB data because of sendbuffer=4096, (c) TCP/IP sends packests for the 4KB.
And, (d) Tb(SSL/Socket) continuously executes "Receive" operation at same time, if SSL is used(SSL looks to support both send and receive at same time on a connection or a socket).
If (c) takes long, and if (b) is retry type with small sleep time, high CPU consumtion by (b) may occur, even with sendbuffer=4096. Contension between (b) and (d) may produce high CPU usage.
>Big difference of my PC is probably line speed. PC <-100Mbps
> LAN-> Router Type Hub/VDSL modem <-70Mbps-> distributer <-1Gbps Fiber cable->
> Provider.

Yes at home, I have also 100 MBit down and up and as I wrote I observed the little hogging of core1 60%. I know, when the upload of the connection is about 80 MB/s this bug is not so devastating. But in many cases (like in work etc. the upload is limited to something like 256-512 kbps and that makes TB totally unusable :-( Colleague have to use Windows Live Mail or so... I am also using TB at home only because of this unusability of "slow" upload...


> 
> Can you get SMTP+Socket log for sending to Gmail's SMTP with
> network.tcp.sendbuffer=4096? There is no need of log for full sendig. After CPU
> hog started, kill Tb, save log, please.
> > SET NSPR_LOG_MODULES=timestamp,smtp:5,nsSocketTransport:5,nsHostResolver:5
> If sendbuffer=4096, Tb(SSL/Socket) & TCP/IP repeats next.
>  (a) Tb passes 4KB to TCP, (b) Tb waits or retries till send completion of the
>  4KB data because of sendbuffer=4096, (c) TCP/IP sends packests for the 4KB.
> And, (d) Tb(SSL/Socket) continuously executes "Receive" operation at same time,
> if SSL is used(SSL looks to support both send and receive at same time on a
> connection or a socket).
> If (c) takes long, and if (b) is retry type with small sleep time, high CPU
> consumtion by (b) may occur, even with sendbuffer=4096. Contension between (b)
> and (d) may produce high CPU usage.

I can do this but unfortunatelly don't know how. Can you give me a little help how to prepare the log for you, which program to use for it etc.? Could the Ethereal or something similar be used for this?

Mbiba
(In reply to comment #66)
> I can do this but unfortunatelly don't know how.

Documents for "How to get NSPR log" is already pointed by Comment #24. Which steps in the documents to get next log data are difficult for you?
> SET NSPR_LOG_MODULES=timestamp,smtp:5,nsSocketTransport:5,nsHostResolver:5
Have you read thru this bug and other bugs pointed in this bug?
I have the same issue and am happy to provide more information a novice as myself is able to provide, if it is deemed useful, let me know.

Win XP
TB 3.1.3
Dell M90 (2 CPUs)
IMAP set up for gmail access

Sending emails sends TB up to ~50% CPU usage, System to ~50%. Hangs after 7-70% sent. Issue narrowed down to sending emails with attachments (+10Mb?) while using secure (SSL) SMTP server (works ok without encryption, but that is to a similar but different SMTP server). Anti-virus and Zone Alarm turned off (although, before being turned off, Zone Alarm intriguingly used ~25% CPU through the send,and system goes to ~25%). Sent emails are COPIED TO LOCAL ONLY, not to gmail via IMAP. Drafts also saved locally only. 

Workaround: don't use SSL.
As I wrote before, my ISP uses Router Type Hub/modem. It means ordinal home LAN, "TCP on Ethernet", is used in my network environment. And, I can't reproduce problem.
> PC <-100Mbps LAN-> Router Type Hub/VDSL modem <-70Mbps-> ... 
To all problem reporters:
What kind of network do you use?
  TCP on Ethernet
  TCP on PPP(PPP on Ethernet)
  TCP on PPP(PPP on Dialup Line)
(In reply to comment #69)
> To all problem reporters:
> What kind of network do you use?

I use TCP on Ethernet.
TB -> TCP on ethernet (100baseTX-FD LAN) -> router -> ADSL/MPLS -> router -> Gigabit TCP Lan -> postfix with TLS

It's a 60.000 hosts intranet with many LANs connected via ADSL with MPLS transport. All TB with a less or equal 512kb/s ADSL can't send 1Mb attachement. It was OK with TB2.

Sending without TLS is OK and we get about 50 KB/s throughput. But we do need TLS to protect credentials on this quite big intranet.

This bug is a drama.
> To all problem reporters:
> What kind of network do you use?

Ethernet into an ADSL router. My traffic and SMTP is handled by Internode.on.net
(In reply to comment #71)
> TB -> TCP on ethernet (100baseTX-FD LAN) -> router -> ADSL/MPLS -> router ->
> Gigabit TCP Lan -> postfix with TLS
> 
> All TB with a less or equal 512kb/s ADSL can't send 1Mb attachement.
> It was OK with TB2.

This bug is never for "can't send". This bug is for next;
- CPU hog while mail sending/uploading via SMTP/IMAP with SSL.
- unresponsiveness of Tb while the CPU hog.
- unresponsiveness of any other application while the CPU hog
  which is generated by Tb's mail sending/uploading with SSL".

> Sending without TLS is OK and we get about 50 KB/s throughput.
> But we do need TLS to protect credentials on this quite big intranet.

samsam: 
Same server(and same route) with TLS and without TLS?
Which Tb's setting do you call TLS? "SSL/TLS"? "StartTLS"?
If different server, did you check with small network.tcp.sendbuffer size in order to rule out problem due to router's bug such as bug 566548?

(In reply to comment #72)
> Ethernet into an ADSL router.
> My traffic and SMTP is handled by Internode.on.net

I'm not asking physical equipment/physical connection type only.
Following is a support document by Internode.on.net.
http://www.internode.on.net/support/guides/general_settings/#adsl
> Broadband ADSL
> ADSL Username: username@internode.on.net
> Encapsulation: PPPoA / VC MUX - Standard ADSL
>                PPPoE / LLC - ADSL2+
sebling@gmail.com, TCP over PPP over Ethernet type, isn't it?
  Router/ADSL model is connected via LAN cable. 
  Dialar is invoked before internet connection use.
(In reply to comment #70)
> I use TCP on Ethernet.

As bug 572397 was problem report with TCP over PPPoE, I guessed PPPoE is relevant. But bug opener's case was not with TCP over PPPoE. PPPoE seems irrelevant to problem.
What is difference between PC/Tb problem occurs and PC/Tb problem never occurs?
Number of bits of cypher key? Modules used for cyphering is different? 
Why "Tb 1% + avast 2%" if avast! 5 takes over SSL? (comment #52)
Apologies WADA, I'm a little over my head but keen to try and help with this bug.
Yes my router is connected via an (Ethernet) LAN cable to my router which (theoretically) is running ADSL2+ on the Internode "extreme plan" so my best guess is that I am running as per the info you found on that internode page.

My router reports 
"Network Type      PPPoE
DownStream Connection Speed  	5715 kbps
UpStream Connection Speed 	894 kbps
VPI 	8
VCI 	35"

I usually use securemail.internode.on.net which provides the SSL connection, but now I am using the server listed on that page due to this bug.

The router remains active 24/7. It had not occurred to me to reset my router and try again. Would that be helpful? Would you like me to try TB with SSL on another computer on the network? or even, would it be helpful if I tried it on my neighbours network with the same PC?
(In reply to comment #75)
> Apologies WADA, (snip)

No need of aplogy. As objective of my question is to know "TCP over PPP" is relevant to issue or not, I merely corfirmed about it in your case.
Thanks for supported Connection Speed information. Because of performance related problem, such data of each environment is important.

I never doubt "no problem without SSL, problem only with SSL" by you, but please rule out problem like bug 566548(router's bug. CPU hog can occur by frequesnt re-transmission of large volume of data) with small network.tcp.sendbuffer value.

According to bug opener of bug 577545, "CPU 100%" seems next phenomenon.
  "CPU utilization of Tb" + "CPU utilization of process of System" (on MS Win)
  is always 100%, during Tb is sending/uploading mail data with SSL.
We are tring to find which tasks of Thunderbird.exe/System consume CPU and what priority is assigned to tasks who eat CPU power in that bug. Watch that bug, and provide required data if you can obtain data such as Counter log of Win, please.

By the way, my ISP started to support SSL and IMAP recently. So I can check with next servers, like you.
  (incoming) pop.ops.dti.ne.jp(POP3,110,non SSL),
     imap.cm.dream.jp(IMAP,143,non SSL), imap.cm.dream.jp(IMAPS,993,SSL)
  (outgoing) smtp.ops.dti.ne.jp(SMTP,587,non SSL),
     vsmtp.cm.dream.jp(SMTP,587,non SSL), vsmtp.cm.dream.jp(SMTP,465,SSL)
Problem of my environment and my servers;
  Problem of this bug can't be reproduced :-)
According to next document, avast! 5's SSL support by Mail Shield sounds;
> http://support.avast.com/index.php?_m=knowledgebase&_a=viewarticle&kbarticleid=458
(1) Mailer's account setting : non SSL(POP3=110,IMAP=143,SMTP=25)
(2) Register hostname/username requires SSL to avast! 5's Mail Shield
(3) Mail proxy of avast! 5 intercepts connection request to host without SSL,
    and comunicates with mail client via non-SSL connection,
    and comunicates with real host using SSL.
Because of this mechanism, "no problem with avast! 5's SSL support of Mail Shield" can not be sure evidence of "CPU hog due to re-transmission caused by bug of router doesn't exist".
Please surely rule out "CPU hog due to re-tranmission caused by router's bug" by small network.tcp.sendbuffer value in any environment.
CCing to Michael A. Pasek, who analyzed bug 541367 and bug 566548.
How can we easily/surely rule out "CPU hog due to re-transmission caused by router's bug" in this bug(bug 538283), bug 572397, and bug 577545?
(In reply to comment #73)
> This bug is never for "can't send". This bug is for next;
> - CPU hog while mail sending/uploading via SMTP/IMAP with SSL.

Yes, it's the case. It's not a "can't send" bug, but TB frozzen sending big attachement with 100% CPU

> Same server(and same route) with TLS and without TLS?

Yes, exactly the same.

> Which Tb's setting do you call TLS? "SSL/TLS"? "StartTLS"?

StartTLS

Regards,
(In reply to comment #78)
> CCing to Michael A. Pasek, who analyzed bug 541367 and bug 566548.
> How can we easily/surely rule out "CPU hog due to re-transmission caused by
> router's bug" in this bug(bug 538283), bug 572397, and bug 577545?

The "router bug" would cause eventual session failure (connection timed out).  Retransmission seen in those cases is likely not due to router bug, but due to congested network (retransmissions at end, which causes the retransmission timeout, are due to router bug).  

I would suspect this problem to be at the SSL layer, looping while waiting to send/receive more data, instead of "giving up control" to other processes.  OR, it could be due to encryption, although I would then expect the slow-speed links (e.g., 50Kb) to NOT exhibit the 100% CPU utilization.  I say this because of two observations:
  1) Problem does not occur when "Connection Security" = "None";
  2) Problem does not occur when Avast does SSL to server.
(In reply to comment #80)
> I would suspect this problem to be at the SSL layer, looping while waiting to
> send/receive more data, instead of "giving up control" to other processes.  
>(snip)

Problem like Bug 559676?
(In reply to comment #81)
> Problem like Bug 559676?

Yes.  The "shark profile" data attached to Bug 559676 does seem to implicate the "PR_Poll" and/or "_ZL16nsSSLIOLayerPollP10PRFileDescsPs" processes.
Mixed blessing: I have lost the ability to replicate the issue that last week was so predictable. I've turned SSL back on and things are working fine now. TB is taking about 10-25% CPU as it sends a large attachment, but this time System is virtually idle. It does lock up for just 5 seconds under high disk activity as, i assume, it's creating the file to send, but then works as per TB2

What's changed:
- rebooted my machine several times (although I had done that between issues last time)
- Uninstalled Adobe CS4 and installed CS5
- defragged
- de-selected "crash while sending" (joke)
- freed up more HD space (went from 5% space to 14%)

Go figure. Will let you know if it comes back.
(In reply to comment #80)

Michael A. Pasek, thanks for your analysis of "shark profile" of Bug 559676.

> http://mxr.mozilla.org/comm-central/source/mozilla/security/manager/ssl/src/nsNSSIOLayer.cpp#2043
> 2043 nsSSLIOLayerPoll(PRFileDesc *fd, PRInt16 in_flags, PRInt16 *out_flags)
nsSSLIOLayerPoll calls nsSSLThread::requestPoll. 
> 2065   return nsSSLThread::requestPoll(socketInfo, in_flags, out_flags)

nsSSLThread::requestPoll is;
> Defined as a function in:
>  * mozilla/security/manager/ssl/src/nsSSLThread.cpp (View Hg log or Hg annotations)
>    o line 225, as member of class nsSSLThread --
>      PRInt16 nsSSLThread::requestPoll(nsNSSSocketInfo *si, PRInt16 in_flags, PRInt16 *out_flags)
> http://mxr.mozilla.org/comm-central/source/mozilla/security/manager/ssl/src/nsSSLThread.cpp#225
Comment in source.
> 286             else
> 287             {
> 288               // Unfortunately we do not have a pollable event
> 289               // that we could use to wake up the caller, as soon
> 290               // as the previously requested I/O operation has completed.
> 291               // Therefore we must use a kind of busy wait,
> 292               // we want the caller to check again, whenever any
> 293               // activity is detected on the associated socket.
> 294               // Unfortunately this could mean, the caller will detect
> 295               // activity very often, until we are finally done with
> 296               // the previously requested action and are able to
> 297               // return the buffered result.
> 298               // As our real I/O activity is happening on the other thread
> 299               // let's sleep some cycles, in order to not waste all CPU
> 300               // resources.
> 301               // But let's make sure we do not hold our shared mutex
> 302               // while waiting, so let's leave this block first.
> 303 
> 304               want_sleep_and_wakeup_on_any_socket_activity = PR_TRUE;
> 305               break;
> 306             }

> http://mxr.mozilla.org/comm-central/source/mozilla/security/manager/ssl/src/nsSSLThread.cpp#375
> 375   if (want_sleep_and_wakeup_on_any_socket_activity)
> 376   {
> 377     // This is where we wait for any socket activity,
> 378     // because we do not have a pollable event.
> 379     // XXX Will this really cause us to wake up
> 380     //     whatever happens?
> 381 
> 382     PR_Sleep( PR_MillisecondsToInterval(1) );
> 383     return PR_POLL_READ | PR_POLL_WRITE | PR_POLL_EXCEPT;
> 384   }

Stopping logic tracing...

It looks BUSYWAIT what Nelson Bolyard(:MisterSSL) referred in Bug 559676 Comment #19.
I thought such BUSYWAIGHT occurs oly when "STS poll iter [1]" is logged, because almost all "STS poll iter [?]" was "STS poll iter [1]" in NSPR log attached to Bug 559676. But it seems wrong. BUSYWAIT probably can happen also when "STS poll iter [0]" is logged.
Note: "STS poll iter [1]" is less than 1/3 in NSPR log attached to Bug 577545.
Michael A. Pasek, where can we get shark enabled Thundrbird build?
(In reply to comment #84)
> > 288               // Unfortunately we do not have a pollable event
> > 289               // that we could use to wake up the caller, as soon
> > 290               // as the previously requested I/O operation has completed.

Well, that explains a lot.  It IS (effectively) a "busy wait" loop....

(In reply to comment #85)
> Michael A. Pasek, where can we get shark enabled Thundrbird build?

I don't know the answer to that question.  Best to ask Ehsan Akhgari [:ehsan].
CCing to Nelson Bolyard.
How can we know this kind of "BUSY WAITING" happens or not in this bug(bug 538283), bug 572397, and bug 577545?
(In reply to comment #34)
> Shall I upload compressed log for further analysis?

GWu:
Do you still have log data? 
As I wrote comment #49, I wrote Rexx script for quick checking of log. See attache data to bug 577545.
If you still have log data, attach to this bug, please.
CCing to Ehsan Akhgari [:ehsan], who is opener of Bug 559676 and Bug 588952.
Can we get and test wit shark build of Thunderbird?
(In reply to comment #89)
> CCing to Ehsan Akhgari [:ehsan], who is opener of Bug 559676 and Bug 588952.
> Can we get and test wit shark build of Thunderbird?

I spoke to gozer last month about shark builds, because we could use them for the likes of 
 Bug 470001 - Thunderbird consumes very low CPU though being idle and no accounts [Mac] and
 Bug 488275 – CPU races to about 50% when online (0 when offline)
and many others.

the upshot is we stopped daily shark enabled builds earlier this year, but they could be built again. perhaps now is a good time to do that, and start hitting some of our performance problems http://bit.ly/dv9efE in the nose?
To all problem reporters:
Is there any user of OS other than MS Win XP?

To all problem reporters with MS Win XP:
Will "forcing single CPU use of Tb" by CPU affinity setting affect on problem?
Hardware: All → x86
I am using tb 3.1.x at home on the machine with a little bit slow dualcore E5200 with Win 7 64-bit and the problrms occured, buat because of dualcore, system is much more responsible during sending of attachment, it looks it hogs only core1...

On XP with the old singlecore CPU is the PC unusable while sending and I have to lower the process priority of TB to make the system very little more responsive...

Mbiba
I've seen this on Mac.  And since this problem cannot be reproduced reliably, I'm not sure how useful a shark build would be.

On a related note, AFAIK a shark build only gives you the shark API (connectShark, et al.) so that you can profile a bunch of code by instrumenting it.  I'm not sure how it could be useful in this case.
(In reply to comment #92)
> I am using tb 3.1.x at home on the machine with a little bit slow dualcore
> E5200 with Win 7 64-bit and the problrms occured, (snip)
> On XP with the old singlecore CPU is the PC unusable while sending and I have
> to lower the process priority of TB to make the system very little more
> responsive...
(In reply to comment #93)
> I've seen this on Mac.

Oh, thanks to Mbiba and Ehsan Akhgari for used OS information.
I suspected problem on Win-XP only by "TSC skew on multi CPU with ACPI", because I'm not sure that halmacpi.dll update for the problem was shipped by XP SP3, because problem seems "many busy waiting with sleep(1 msec)" and such problem is relevant to scheduling.
But it's irrelevant, because such problem of Win is already resolved in Win7, and because problem occurs with Win-XP on single CPU, and because such problem didn't occur on Mac OS X versions(and Linux verions) usually used by Tb users.
When I tried to use ubuntu linux (actual) I found this behaviour of thunderbird is the same here as in windows! So I think TB running on linux is also affected by this. But I used the same profile folder as in windows xp is used for this test, cause I didn't have enough time to try it with a zero start with clean profile.

Could it be helpful to try it on some linux without "imported" windows profile?
Maybe I could try it during next week if it will be reasonable?

Mbiba
Changing back to Platform All, per comments on used OS. Thanks for reports.
(In reply to comment #50)
> GWu, can you check with next Tb's prefs value? (Read thru bug 566548, please)
> > network.tcp.sendbuffer=65536 or 32768 (default is 131072)

network.tcp.sendbuffer set to 32768 is much better with many hosts. TB can send attachements with this value without CPU to 100%, and quite fast, on single core Celeron CPU.
(In reply to comment #97)
> network.tcp.sendbuffer set to 32768 is much better with many hosts. TB can send
> attachements with this value without CPU to 100%, and quite fast, on single
> core Celeron CPU.

samsam, if network.tcp.sendbuffer<=64KB is effective workaround, your problem with many hosts is not this bug and friends of ths bug. It's bug 541367 comment and/or bug 566548.
Please report peoblem of "CPU hog + SSL + Uploading(SMTP or IMAP) even ater network.tcp.sendbuffer<=64KB" only to this bug.
I have the same issue as the OP. The thing is, I kind of remember when it started happening. It was after some kind of bug fix around Thunderbird 2.0.0.16 or so. There is simply something wrong with the code that polls the socket and it goes into a 100% CPU loop until the file hits 100% sent.

I have 2 connections for my PC, 100MBit fiber at home and 64kbps wireless when on the road. I can send a huge file (say 100Mb attachment) from home and it will only take about 10 secs (at 100% CPU), so it's not like it is doing a whole lot of necessary processing. On the 64 kbps connection, however, it will take 40 minutes to send with the CPU at 100% the entire time.
+1 in having this problem with every PC (more than 5 different) I ever used on slow (upload < 128kBit) internet connections and TB 3.x with SSL encryption enabled and attachments to an email.

Is there any forecast in solving this issue? Will there be any progress?
As a temporary solution, would it be possible to set the priority of the mail sending thread to below normal, so that the 100% CPU utilization does not interfere with performing other tasks. At the moment, I can't do anything with the PC while Thunderbird is sending (and sometimes while it is receiving, but that is a bit harder to isolate because it receives asynchronously of any user interaction). I have tried setting the entire Thunderbird process to below normal priority (which works, but is a pain in the ass to do everytime), but even then I can't read anything in my Inbox while TB is sending because the sending thread still has the same priority as the UI thread. So please, could you set the sending/SSL communication thread to a lower priority than the other threads.
(In reply to comment #101)
> As a temporary solution, would it be possible to set the priority of the mail
> sending thread to below normal,..

not feasible/wise to do from the application. And although you can do it manually, that's not a good situation either. 

The only real solution is to fix bug(s).
Can someone check with large mailnews.tcptimeout value like 100000==100*1000, 600000=600*1000?

See bug 570293 for reason why large mailnews.tcptimeout value.
If mailnews.tcptimeout=100 means "100 milliseconds", and if final timeout error never occurs but needless "internal timeout + retry, re-scheduling etc." frequntly happens by the request of "100 milliseconds", it merely increases overhead in TCP networking and wastes CPU power by non-productive works.
WADA, I just tried changing mailnews.tcptimeout to a large value (10000) and it makes no difference. The CPU still goes to 100%.

The problem seems to be that TB is using non-blocking calls to the socket functions and is not calling select() to wait for the socket to be ready, but is instead entering a non-yielding polling loop. I tried downloading the source code (because really the problem is somewhere near the select()) but the source is so complicated I didn't even know where to start looking.
(In reply to comment #104)
> but is instead entering a non-yielding polling loop.

Matt, phenomenon like next in Bug 559676 comment #19?
> 63.5% of the time is being spent BUSY WAITING using PR_Poll, and only
> 26.7% of the time is being spent productively working on sockets.
How did you see "non-yielding polling loop"?
Hi WADA, thanks for the links. I was able to see the source code, but it is quite difficult to understand where it is going wrong. Anyway, I ran some tests with NSPR_LOG_MODULES=nsSocketTransport:5 and I can reproduce a non-yielding polling loop. The sequence runs something like this:

1) PR_Write continually succeeds until the kernel write buffer is full.

2) The following loop repeats several thousands times:

10712[1a29280]: nsSocketOutputStream::AsyncWait [this=86bdc98]
10712[1a29280]: STS poll iter [0]
10712[1a29280]:   active [0] { handler=86bdbe0 condition=0 pollflags=6 }
10712[1a29280]:   calling PR_Poll [active=1 idle=0]
10712[1a29280]: poll timeout: 100
10712[1a29280]:     timeout = 0 milliseconds
10712[1a29280]:     ...returned after 0 milliseconds
10712[1a29280]: nsSocketInputStream::Read [this=86bdc78 count=4096]
10712[1a29280]:   calling PR_Read [count=4096]
10712[1a29280]:   PR_Read returned [n=-1]
10712[1a29280]: nsSocketInputStream::AsyncWait [this=86bdc78]
10712[1a29280]: STS poll iter [0]
10712[1a29280]:   active [0] { handler=86bdbe0 condition=0 pollflags=7 }
10712[1a29280]:   calling PR_Poll [active=1 idle=0]
10712[1a29280]: poll timeout: 100
10712[1a29280]:     timeout = 0 milliseconds
10712[1a29280]:     ...returned after 0 milliseconds
10712[1a29280]: STS poll iter [0]
10712[1a29280]:   active [0] { handler=86bdbe0 condition=0 pollflags=7 }
10712[1a29280]:   calling PR_Poll [active=1 idle=0]
10712[1a29280]: poll timeout: 100
10712[1a29280]:     timeout = 0 milliseconds
10712[1a29280]:     ...returned after 0 milliseconds
10712[1a29280]: STS poll iter [1]
10712[1a29280]:   active [0] { handler=86bdbe0 condition=0 pollflags=7 }
10712[1a29280]:   calling PR_Poll [active=1 idle=0]
10712[1a29280]: poll timeout: 100
10712[1a29280]:     timeout = 100000 milliseconds
10712[1a29280]:     ...returned after 0 milliseconds
10712[1a29280]: nsSocketTransport::OnSocketReady [this=86bdbe0 outFlags=3]
10712[1a29280]: nsSocketOutputStream::OnSocketReady [this=86bdc98 cond=0]
10712[1a29280]: STS dispatch [8a886c8]
10712[1a29280]: nsSocketInputStream::OnSocketReady [this=86bdc78 cond=0]
10712[1a29280]: STS dispatch [8a88708]
10712[1a29280]: nsSocketOutputStream::Write [this=86bdc98 count=4096]
10712[1a29280]:   calling PR_Write [count=4096]
10712[1a29280]:   PR_Write returned [n=-1]

3) Eventually, the kernel sends enough packets to clear 4k of buffers and we get a successful send:

10712[1a29280]: STS poll iter [1]
10712[1a29280]:   active [0] { handler=86bdbe0 condition=0 pollflags=7 }
10712[1a29280]:   calling PR_Poll [active=1 idle=0]
10712[1a29280]: poll timeout: 100
10712[1a29280]:     timeout = 100000 milliseconds
10712[1a29280]:     ...returned after 0 milliseconds
10712[1a29280]: nsSocketTransport::OnSocketReady [this=86bdbe0 outFlags=3]
10712[1a29280]: nsSocketOutputStream::OnSocketReady [this=86bdc98 cond=0]
10712[1a29280]: STS dispatch [8a886c8]
10712[1a29280]: nsSocketInputStream::OnSocketReady [this=86bdc78 cond=0]
10712[1a29280]: STS dispatch [8a88708]
10712[1a29280]: nsSocketOutputStream::Write [this=86bdc98 count=4096]
10712[1a29280]:   calling PR_Write [count=4096]
10712[1a29280]:   PR_Write returned [n=4096]
10712[1a29280]: nsSocketTransport::SendStatus [this=86bdbe0 status=804b0005]
10712[1a29280]: nsSocketOutputStream::Write [this=86bdc98 count=4096]
10712[1a29280]:   calling PR_Write [count=4096]
10712[1a29280]:   PR_Write returned [n=-1]
10712[1a29280]: nsSocketOutputStream::AsyncWait [this=86bdc98]

Note that this is immediately followed by a failed send. TB then repeats the failed sending again another several thousand times before the next packet gets away.

The interesting thing in (2) is that you can see two distinct sequences for the call to DoPollIteration (http://mxr.mozilla.org/comm-central/source/mozilla/netwerk/base/src/nsSocketTransportService2.cpp#617) depending on whether the single boolean parameter (wait) is true or false. When wait is false, we get this sequence:

10712[1a29280]: STS poll iter [0]
10712[1a29280]:   active [0] { handler=86bdbe0 condition=0 pollflags=7 }
10712[1a29280]:   calling PR_Poll [active=1 idle=0]
10712[1a29280]: poll timeout: 100
10712[1a29280]:     timeout = 0 milliseconds
10712[1a29280]:     ...returned after 0 milliseconds

When wait is true we get:

10712[1a29280]: STS poll iter [1]
10712[1a29280]:   active [0] { handler=86bdbe0 condition=0 pollflags=7 }
10712[1a29280]:   calling PR_Poll [active=1 idle=0]
10712[1a29280]: poll timeout: 100
10712[1a29280]:     timeout = 100000 milliseconds
10712[1a29280]:     ...returned after 0 milliseconds
10712[1a29280]: nsSocketTransport::OnSocketReady [this=86bdbe0 outFlags=3]

That is, the PR_Poll function is only returning a false-positive when wait=true.
Drilling down into the functions, wait=true/false controls whether timeout=100 or PR_INTERVAL_NO_WAIT (0) in the Poll function (http://mxr.mozilla.org/comm-central/source/mozilla/netwerk/base/src/nsSocketTransportService2.cpp#329). Drilling down further, the only difference this makes in Win32 is in the _PR_MD_PR_POLL function in http://mxr.mozilla.org/comm-central/source/mozilla/nsprpub/pr/src/md/windows/w32poll.c#270.

Now, what I guess is happening here is that when select is called with a timeout of zero, it is behaving exactly as it should, but when it is called with a non-zero timeout, it is having an error and returning without doing anything (no waiting, and no modifying the FD lists, which is why outFlags is non-zero and nsSocketTransport::OnSocketReady is getting called despite the fact that the socket is not ok). I can't really see anything in the code that would cause this. The only things that spring to mind are:
1) Virus on the host machine that is modifying the behavior of select slightly.
2) Microsoft have modified some of their SDK headers as part of adding 64-bit support and somewhere along the line the prototype for select has gotten screwed up.
3) Something else I have no idea about has gone wrong.

Anyway, I hope this helps because it's killed my brain looking at the code.
(In reply to comment #107)

Matt, thanks for your analysis.

> with NSPR_LOG_MODULES=nsSocketTransport:5 and I can reproduce a non-yielding
> polling loop. The sequence runs something like this:
> 1) PR_Write continually succeeds until the kernel write buffer is full.
> 2) The following loop repeats several thousands times:
>(snip)
> 10712[1a29280]: STS poll iter [0]
>(snip)
> 10712[1a29280]: poll timeout: 100
> 10712[1a29280]:     timeout = 0 milliseconds
> 10712[1a29280]:     ...returned after 0 milliseconds

"...returned after 0 milliseconds" after "STS poll iter [0]" is possibly by "PR_Sleep(PR_INTERVAL_NO_WAIT);" in source code I pointed by my comment #44.

I suspected loop like execution of the "PR_Sleep(PR_INTERVAL_NO_WAIT);" is a cause of "CPU 100%", but such log sequence was seen only periodically in log(checked with log of timestamp by DebugView on Win).  
I couldn't find evidence of "CPU 100% by such events" by NSPR log, because such unit of log sequence appeared only periodically in log, and because I couldn't know CPU usage by a such unit of log sequence.
I couldn't know "which log sequence corresponds to BUSY WAITING using PR_Poll".

How can I obtain evidence of such events is cause of "CPU 100%" of this bug?

(Off-Topic)
> 2) Microsoft have modified some of their SDK headers as part of adding 64-bit
support
> and somewhere along the line the prototype for select has gotten screwed up.

I always suspect MS Win first when I meet funny problem because I know too many problems due to MS's stupid code, even though I know MS Win was not culprit in almost all funny problems I met :-)
(In reply to comment #107)
> That is, the PR_Poll function is only returning a false-positive when
> wait=true.

Matt, can it produce problem like bug 570293?
  To avoid timeout, mailnews.tcptimeout=100*1000 was required.
  mailnews.tcptimeout=10*1000 was insufficient.
Matt, timer routine of NSPR uses timeGetTime of Win(resolution is 10ms) instead of QueryPerformanceCountr of Win(uses TSC if multi-CPU, uses PMT if single CPU). See bug 563082 comment #4, please.
Can such thing be relevant to issue of "PR_Poll function is only returning a false-positive when wait=true" which you found?
Following is summarized log of Bug 437494, with non-SSL SMTP server.  
  Even though Tb logs "SMTP Send: . ", Tb sends a part of ".[CRLF]" only,
  so server waits for last part of mail data forever.  
> 00000374 11.51639557 [4748] 0[192c140]: SMTP Send: MAIL FROM:<soarex@ops.dti.ne.jp> SIZE=8191  
> 00000531 11.77045441 [4748] 0[192c140]: SMTP Send: .  
> 00000535 11.77111435 [4748] @type=11(2),{2x(Write[count=4096]->[n=4096])}
> 00000543 11.77172184 [4748] @type=50,STS_poll_iter[0],timeout...returned[0/0]
> 00000551 11.77216244 [4748] @type=50,STS_poll_iter[0],timeout...returned[0/0]
> 00000557 11.77243233 [4748] @type=50,STS_poll_iter[1],timeout...returned[2619/100000]
> 00000564 14.39095974 [4748] @type=50,STS_poll_iter[0],timeout...returned[0/0]
> 00000570 14.39163494 [4748] @type=50,STS_poll_iter[0],timeout...returned[0/0]
> 00000576 14.39218521 [4748] @type=50,STS_poll_iter[1],timeout...returned[14998/98000]
> 00000583 29.39087296 [4748] @type=50,STS_poll_iter[0],timeout...returned[0/0]
> 00000589 29.39126205 [4748] @type=50,STS_poll_iter[0],timeout...returned[0/0]
> 00000595 29.39163208 [4748] @type=51,STS_poll_iter[1]
> 00000599 29.39199257 [4748] 3820[192c280]:     timeout = 84000 milliseconds 

@type=11 corresponds to next log sequense:
>    calling PR_Write [count=4096]	
>    PR_Write returned [n=4096]	
@type=11(2) is two consecutive above logs.

@type=50 corresponds to next log sequence:
> STS poll iter [X]
>(other log is discarded)
>(calling PR_Poll [active=N idle=M] is omitted too) 	
>     timeout = MMM milliseconds	
>     ...returned after NNN milliseconds	
Format in summary is "timeout...returned[NNN/MMM]", Returned after NNN milliseconds for request with timeout of MMM milliseconds.

Logs for "STS poll iter [1]" are;
> 00000557 11.77243233 [4748] @type=50,STS_poll_iter[1],timeout...returned[2619/100000]
> 00000576 14.39218521 [4748] @type=50,STS_poll_iter[1],timeout...returned[14998/98000]
> 00000595 29.39163208 [4748] @type=51,STS_poll_iter[1]
> 00000599 29.39199257 [4748] 3820[192c280]:     timeout = 84000 milliseconds 

It looks for me next.
 1. 11.77243233 (timestamp by DebugView, so accurate/fine timetamp of event)
    Request with timeout=100000, and returned after 2619 millseconds.
 2. 14.39218521 (after 2.6 seconds)
    Request with timeout=100000-{(2619/1000)*1000}=98000 in integer,
    and returned after 14998 millseconds.
 3. 29.39199257 (after 15 seconds) 
    Request with timeout=98000-{(14498/1000)*1000}=84000 in integer,
    and not returned, because I killed never ending SMTP send.
This may be "correct use of 'BUSY WAITING using PR_Poll' in order to wait for specific time".
As similar "reducing of timeout value" was seen in SSL case too, and "...returned 0 milliseconds for timeout=100000" is also seen SSL case in my environment(CPU hog with SSL can't be reproduced), it's not non-SSL only phenomenon.   

If this is true, next timeout value is always same as before when response time is less than 1000 milliseconds, and 'BUSY WAITING using PR_Poll' never ends until main task(sending mail data) completes normally or abnormally, if such BUSY WAITING is used to sleep for TIMEOUT milliseconds by SSL too. 
If "...returned after 0 milliseconds", it may be worse than ordinal "less than 1000" case.   
I could see negative timeout value too, including log of my case(no CPU hog).
> timeout = -1 milliseconds

Pattern of log sequence itself is not so different between "SSL and CPU hog always occurs" case and "SSL but no CPU hog problem" case(my environment), although amount of log for "4KB write failure"(no free TCP buffer available) is larger than my case because "no free TCP buffer available" status keeps longer and occurs frequently than my environment due to slower connection than my environment.
And, I could see next log sequence in my log with SSL(CPU hog can't be reproduced).
> 10712[1a29280]: STS poll iter [1]
> 10712[1a29280]:   active [1] ...
> 10712[1a29280]:   calling PR_Poll [active=1 idle=0]
> 10712[1a29280]: poll timeout: 100
> 10712[1a29280]:     timeout = 100000 milliseconds
> 10712[1a29280]:     ...returned after 0 milliseconds
If this phenomenon only is cause of CPU hog, CPU hog should occur in my environment or CPU utilization should be higher in my environment.
As number of above log sequence is not so many(rather few) in my case(no CPU hog), "number of such requests executed concurrently/simultaneously" may be relevant to problem.

If SSL, many "4KB read failure followed by AsyncWait" are seen in log while sending 4KB data and retrying send due to send failure by no free TCP buffer.
> STS dispatch [41d2508]	
> nsSocketInputStream::Read [this=43e65b8 count=4096]	
> calling PR_Read [count=4096]	
> PR_Read returned [n=-1]	
> nsSocketInputStream::AsyncWait [this=43e65b8]	
> STS poll iter [X]
It may be relevant to problem, because this kind o log is not seen in non-SSL case.
It may produce "...returned after 0 milliseconds" too.

With any suspect, I still can't guess why CPU hog can't be reproduced in my environment. Why "slower connection than mime" can reproduce CPU hog and can always/reliably produce CPU.
WADA, a slow connection could be effectively simulated with a proxy 
program such as ssltap and a small window size.
Or use mod_bandwidth with Apache..?
(In reply to comment #112)
> WADA, a slow connection could be effectively simulated with a proxy 
> program such as ssltap and a small window size.

Oh, Nelson has come at last :-) Thanks for your participation to this bug. I'll try proxy to emulate slow connection.

To Nelson Bolyard (:MisterSSL):
Can you explain why CPU hog occurs only with SSL and slow connection?
As "STS poll iter[1]" followed by "...returned after 0 milliseconds" is seen in my log(CPU hog can't be reproduced), I still can't guess why "BUSY WAITING using PR_Poll" by it causes "CPU hog while sending via SSL". Scheduling priority related issue? Problem in TCP or NSPR?
NSPR log(all:5), with sequence number/timestamp by DebugView.
  SMTP server : smtp.gmail.com (SSL)
  Mail size : 1.5MB
  network.tcp.sendbuffer=4096. mailnews.tcptimeout=100(default)

As network.tcp.sendbuffer=4096, next pattern continues.
  PR_Write [count=4096] -> OK
  PR_Write [count=4096] -> fail
  PR_Read  [count=4096] -> fail
  PR_Write [count=4096] -> fail
  PR_Read  [count=4096] -> fail
  PR_Read  [count=4096] -> fail
  PR_Write [count=4096] -> fail
  PR_Read  [count=4096] -> fail
  (send of 4KB data by TCP ends, and buffer is free'ed)
  PR_Write [count=4096] -> OK
  PR_Write [count=4096] -> fail
  (similar pattern is repeated until all data is sent)

Above is same as "slow connection" case.
Difference from usual "slow connection" case is;
 "slow connection" case:
  Many successful "PR_Write [count=4096]" is queued at TCP level,
  because network.tcp.sendbuffer=large(default:131072),
  and server probably avertises large TCP Window size.
  (1f 128KB, 32*4KB_buffer is queued at TCP level)

I can't guess why network.tcp.sendbuffer=4096 can't be workaround of "CPU 100%" in "slow connection" environment. As network.tcp.sendbuffer=4096, Tb requests single 4KB only at once, and Tb & SSL merely repeats retry of send request and repeats 4KB request, with requesting STS poll iter[0 or 1], until the single 4KB buffer is freeed. If slow connection, I think merely "number of such retries per a 4KB" is increased.

Matt, can you check with network.tcp.sendbuffer=4096?
If poosible, check also with smtp.gmail.com, please.

What is difference of your case from my test?
Re-transmission of some packets happens and flow is disturbed like next?
  Packet loss happens and retransmission occurs, so only a part of 4KB buffer
  is freed(4KB-N). Then upon next 4KB send request, only 4KB-N is accepted
  by TCP. As N bytes is not sent yet, Tb retries N bytes send request.
  After N bytes is accepted, Tb requets next 4KB, and only 4KB-N is accepted.
  This disturbed pattern may continue until total bytes of packet loss and
  re-transmission reaches 4KB.
In my environment, this kind of pattern was seen several times but very few.
Even if such disturbed pattern was cause, I can't guess why CPU hog with SSL started to occur from Tb 3.0.
> Can you explain why CPU hog occurs only with SSL and slow connection?

Perhaps the application never experiences eWouldBlock on fast connections,
and so never busy-waits on those.  I have no other explanation at this time.
Hi WADA,

Thanks for your work on this bug.

I tried setting network.tcp.sendbuffer=4096 but it had no effect. I still got 100% cpu utilization.

I can only test on one computer, because it is when I am using a PCMCIA PHS card that I get the slow connection. I don't have compatible adapters on my other laptop or on my desktop, so I cannot even demonstrate that it is not simply a problem with my computer (maybe some kind of virus hooking into winsock that is disrupting some messages).

Anyway, when I have time I will try getting hold of the source and debugging on my PC. I have VisualStudio 2005, so I should be able to compile and see what is going on, but atm my day job is keeping me busy.

Anyway, the other piece of possibly useful info is that the computer I am using is a single-core, single-cpu, no hyper-threading.
(In reply to comment #117)
> I tried setting network.tcp.sendbuffer=4096 but it had no effect. I still got
> 100% cpu utilization.

It's same result as bug 577545. See that bug, please.
Matt, is similar logs to bug 577545 seen in your log?
If too large file to read for human being, attach your log file to that bug or separate bug for log data analysis of your case.
(In reply to comment #116)
> > Can you explain why CPU hog occurs only with SSL and slow connection?
> Perhaps the application never experiences eWouldBlock on fast connections,
> and so never busy-waits on those. I have no other explanation at this time.

Nelson Bolyard (:MisterSSL), can "PR_Read and returned data to it" while Mozilla is continuously sending 4KB data(successful/unsuccessful PR_Write for 4KB data) produce the eWouldBlock, if SSL via slow connection?
Such log sequence of PR_Read is seen at end of log file attached to bug 577545, even though "SMTP Send ." is not logged(indicator of last data will be sent soon and will wait for response from SMTP server). Such log sequence of PR_Read is seen at end of log file attached to bug 559676 too.
(Some corrections about comment #111)
Log sequence like next was seen even when SSL, after sending all data.
(no response from server due to Bug 437494, then timeout occurs)
  1. STS poll iter[1] -> calling PR_Poll -> poll timeout: 100
     timeout 100000 milliseconds -> ...returned after 15000 milliseconds
  2. STS poll iter[1] -> calling PR_Poll -> poll timeout: 85 (100-15)
     timeout 85000 milliseconds -> ...returned after 15000 milliseconds  
  3. STS poll iter[1] -> calling PR_Poll -> poll timeout: 70 (85-15)
     timeout 70000 milliseconds -> ...returned after 15000 milliseconds 
  4. timeout value is reduced at each step, and finally timeout occurs.

Log sequence of "STS poll iter[1] -> timeout 100000 milliseconds -> ...returned after 0 milliseconds" was seen in non-SSL case after 4KB send error(no free send buffer, calling PR_Write count[4096] -> PR_Write returned [-1]). However, after a while, "STS poll iter[1] -> timeout 100000 milliseconds -> ...returned after XX milliseconds" appeared if non-SSL case.
  ISP's no-SSL SMTP: XX was 1 to 5. It looks that XX is increased each time.  
  Gmail's SMTP via SSL tunnel by Avast! 5(port 25 in Tb's definition):
     XX was around 50 in some logs, XX was around 100 in other some logs.

I guess, if SSL, No-Wait is always requested upon request which invokes "STS poll iter[1]" after "PR_Write count[4096] -> PR_Write returned [-1]".
If not, frequent receive operation(calling PR_Read count[4096] -> PR_Read returned [-1]) after send error may be a cause of always "...returned ater 0 milliseconds", because such log for "4KB read" is not seen in log of non-SSL case. "Receive" may reset counter like "total wait time during sending in the past".
This kind of phenomenon can produce relatively higher CPU utilization than non-SSL, but I think it can't produce "CPU hog" by itself, although it ay be a triger o problem which causes "CPU hog".
I checked socket log attached to bug 577545 again, and non-zero "...returned after XX miliisoconds" for "STS poll iter[1]" were found in log. XX was initially ZERO, but was increased to "1 to 5" during "send 4KB OK -> send 4KB fail -> retries of send 4KB" is excecuted reapeatedly.

Matt, "...returned after 0 miliisoconds" for "STS poll iter[1]" looks "returned after less than 1 milliseconds".
In reply to comment 120, the discovery that the caller of PR_Poll is steadily decreasing the timeout time by 15 until it goes to zero, at which point it becomes a busy-wait, it s very significant finding.  Can you find that caller
and fix it?
(In reply to comment #122)
> the caller of PR_Poll is steadily decreasing the timeout time by 15 until it goes to zero,
> at which point it becomes a busy-wait, (snip)

It's not steadily decreasing. It's decrement of timeout value by XX/1000 for "...returned after XX milliseconds". And, when reached ZERO, timeout by mailnews.tcptimeout is detected.

My comment was;
Such sequence was seen in both non-SSL case and SSL case after "SMTP Send: ." and after send of whole mail data, because server never returns response(test case of Bug 437494).
I wanted to say;
- Such sequence is probably correct sequence to wait for 100 sec(100*1000
  milliseconds, mailnews.tcptimeout), if PR_Poll is utilized for sleep.
  Above is evidence that such "sleep for mailnew.tcptimeout" works very well
  even in SSL case.
- Even if such sequence is used correctly for sleep, timeout value is probably
  not decremented, if XX in "...returned after XX milliseconds" is
  less than 1000 milliseconds.
  Next timeout(in second, at caller of NSPR, socket level) should be
  { previous_timeout - (XX+1000)/1000 }
  instead of { previous_timeout - XX/1000 }
  if PR_Poll of NSPR is utilized for sleep.
- If 'XX in "...returned after XX milliseconds" is less than 1000 milliseconds'
  happens while BUSY_WAIT type PR_sleep, 'BUSY_WAIT type PR_sleep' may be
  executed forever until main task ends(==whole mail data is sent).

Nelson Bolyard (:MisterSSL), can you look into socket log of bug 577545?
In that case, (a) "no NSPR logb by Tb" continues for three minutes is seen after 37 times of successful sending of 4KB data(144KB in total) followed by 4304 times of "retry of send 4KB", and (b) "receive data" happens just after the three minutes silence.
I guess (a) is "busy of TCP due to re-transmission of large data", but I'm not sure. (b) may be re-negotiation because SSL case.
Someone wrote that this bug has become a drama. I agree.

It is now 2011, TB5 is out and I still need to cope with this bug. 

I've watched the bug reports for this quite a while now and to be frank, I question the competence of the Mozilla team on this one. All my colleagues have/had this problem (my company switched mail clients because of this).

I'm not interested in techno-babble (no offense) but I see and understand that it hasn’t led to a solution yet. From what I’m getting is, that Mozilla still needs to understand the problem. However, I want a mail client that can send big emails without disabling a multi-core system. From my point of view that is a very simple thing and thus, it is mind-boggling to me, that this hasn’t been fixed for such a long time. 

I like TB and I'm thankful for all the work that was put into it. I want to use it for many years to come and hopefully will, but the history of the “100% cpu while sendig large emails” bug is becoming a very dark spot on TB’s history.
I want to echo everything in 124, but also confirm that as of Thunderbird 9.0.1 this is absolutely crippling my WinXP-SP3 system sending 10MB of attachments through our web host's SMTP server.
Some big changes were made in Thunderbird 11 regarding how SSL works, in particular bug 674147. It might be the case that this bug was fixed indirectly by the patches in bug 674147. I would love to hear from some users as to whether or not that is the case.

Further, if it isn't fixed, it is more likely now that we will be able to track down the cause as we've been doing work in this area recently.
You can download install packages or binaries with the patches mentioned in comment 127 at either of:

http://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/latest-comm-aurora/
http://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/latest-comm-central/

comm-central is the most recent one, but is more for brave ones only.  Backup your profiles or test with clean profiles: http://support.mozilla.com/en-US/kb/Managing%20profiles

Thanks.
(In reply to Brian Smith (:bsmith) from comment #127)
> I would love to hear from some
> users as to whether or not that is the case.

It is quite challenging to maintain the initial situation. For myself the constraints that discovered the case are gone. By artificially limiting your upload and sending a large message via SSL'ed imap you should be able to generate a sufficient test case. From my point of view that was the communality of this case, regardless of the (Redmond driven) operating system.

Sadly, I am currently not in the position to review the once and current algorithms, OS functions and so forth that are used and implemented by this particular routine. Maybe you can provide a sufficient diff of this.

Just my 2 cents.

Kind regards
(In reply to Brian Smith (:bsmith) from comment #127)
> Some big changes were made in Thunderbird 11 regarding how SSL works,
> in particular bug 674147. It might be the case that this bug was fixed
> indirectly by the patches in bug 674147.

Thanks a lot for comment.
If problem disappears after change by that bug, it may be "problem of this bug was DIRECTLY resolved by the change".

(a) In Performance Couter log I attached to bug 559676 comment #14, higher priority of SSL thread than Tb's main thread is seen.
(b) Very many PR_Read logs for 4KB data(almost always returns 0 bytes, because sending from Tb to SMTP) is seen in any SMTP log and IMAP log which is related to this bug and I checked log data.
(c) Bug 559676 actually occurred.
> Bug 559676 nsSocketTransportService::DoPollIteration hogging the CPU while uploading a large file to an SSL web server
> Bug 559676 comment #19
> The details in the shark report attached to comment 0 clearly show that
> 63.5% of the time is being spent BUSY WAITING using PR_Poll, and only
> 26.7% of the time is being spent productively working on sockets. 
(d) Bugs like following actually exists. 
> Bug 711786 nsISocketTransport::IsAlive is documented as being thread safe, but > Bug 711786 comment #0 
> nsSocketTransport2::IsAlive() is not thread-safe
> However, the IsAlive() methods calls PR_Recv, and calling PR_Recv
> (at least on an SSL socket) is NOT thread-safe. (See bug 427948, at least.)
> Bug 427948 nsNSSSocketInfo is not thread safe
> Bug 255135 PSM loads string bundles on the socket thread [was: nsStandardURL not thread-safe]

I suspected SSL problem like Bug 559676 initially, but I couldn't find evidence of it in NSPR logs provided in this bug and some other bugs. Phenomenon of comment #123 may be relevant to "not thread safe", but I'm not sure.
I also suspected CPU consumption by "too many retransmission of lost packets", but I couldn't check it because TCP level trace is needed to know whether retransmission of packets happened or not in environment where this bug's problem occurred.

I still think some reports on same symptom of "CPU 100% while sending big mail" was caused by router's bug(in this case, problem vanishes if router's firmware update happens, or small tcp.sendbuffer is an effecive workaround.)
However, in many of problem reports, small tcp.sendbuffer was actually not effecive, and if "retransmission of lost packets" is cause of this bug, this bug should still occur even after change by bug 674147.
So, I now confidently guess that this bug was caused by threading related issues in SSL(+Socket+NSPR). I guess that "remained excess/non-productive works in SSL thread which has higher dispatching priority than main thread" is successfully killed by combination of new psm::InitializeSSLServerCertVerificationThreads(); and new psm::StopSSLServerCertVerificationThreads();
I can confirm that latest earlybird (12a2) solves the 100% CPU usage during send or copying to the sent folder with SSL/TLS on. Great, no problems with sending now. Now, it is only about waiting this version becomes stable and official.
In my environment no bug in firmware that causes this CPU hog is not probably the case, because we moved our company to different location and build the new network infrastructure with absolutely nothing same as an older place and nothing change - in TB 10 still 100%CPU hogging during SSL/TLS send.
Whiteboard: [fixed by bug 674147]
Target Milestone: --- → Thunderbird 11.0
No longer blocks: 577545
Going to mark this one as fixed. Please feel free to reopen or file new bugs if you still see this kinds of issues.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
I can confirm that latest earlybird (12a2) solves the 100% CPU usage during send or copying to the sent folder with SSL/TLS on. Great, no problems with sending now. Now, it is only about waiting this version becomes stable and official.
Thanks for all that confirmed that the issue is fixed.

When a bug is fixed indirectly through the fix for another bug, the convention to use is to mark the bug WORKSFORME and put the other bug (with the fix) in the Depends field, like I have done along with this comment.
Depends on: 674147
Resolution: FIXED → WORKSFORME
Confirmed for TB 13.0.1, too. Two years after the bug was introduced :-/
No longer blocks: 572397
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: