100% CPU while sending (SMTP with SSL encryption)

RESOLVED DUPLICATE of bug 538283

Status

MailNews Core
Networking: SMTP
--
major
RESOLVED DUPLICATE of bug 538283
8 years ago
6 years ago

People

(Reporter: webforen, Unassigned)

Tracking

1.9.1 Branch
x86
Windows XP

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 7 obsolete attachments)

(Reporter)

Description

8 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 (.NET CLR 3.5.30729)
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.10) Gecko/20100512 Thunderbird/3.0.5

Thunderbird 3.0.5 consumes 100% while sending a message with attachment over SMTP with SSL encryption.

The 100% CPU consumption is NOT while preparing/attaching an attachment and NOT while saving the message to the Sent-Folder but only while the message is transmitted to the SMTP server.

As test I used a simple message with 1 MB attachment.

I tried with a new profile as well as with safe mode.

When I temporarily switch off encryption (of my own SMTP server) the problem seems to be gone.

As far as I remember the problem was not with Thunderbird 2.

The SMTP server is exim4 on a Debian lenny (stable) system.

Reproducible: Always

Steps to Reproduce:
1. Create new profile. Setup an SMTP server with encryption (STARTTLS).
2. Create new message and attach a binary attachment of e.g. 4MB
3. Send the message.
Actual Results:  
The CPU starts consuming 100%. With large attachments the message won't be transmitted. With fast connection and small attachments the sending progress just takes long.

Expected Results:  
Send the message WITHOUT consuming 100% CPU. A (single code) machine is almost unuseable while sending messages with attachments with Thunderbird.

Whenever I need to send a message with attachment I have to use webmail, yielding Thunderbird unuseable.

On request I can also provide a test account on an SMTP server.
Severity: critical → major
Component: General → Networking: SMTP
Product: Thunderbird → MailNews Core
QA Contact: general → networking.smtp
Version: unspecified → 1.9.1 Branch
When did your problem start to occur? (Tb's upgrade from which to which, after install of add-on, suddenly while you are using Tb 3.0.5, ...)

If started to occur after upgrade to Tb 3.0.x from Tb 2.0.x or former, it may be same problem as bug 538283. We are suspecting similar issue to bug 541367 and bug 566548 in bug 538283. Read thru bug 566548, please. 
Can you check with network.tcp.sendbuffer=65536 (or 32768) followed by restart of Tb?
(Reporter)

Comment 2

8 years ago
Hi Wada,

Thank you for your reply!

As far as I remember it started to occur after I upgraded to TB3. But, on the other way, the problem persists in a completely new profile (without any add-ons).

I read through the bugs. Unfortunately bug 566548 does not resolve my problem; When I change network.tcp.sendbuffer the problem stays the same. And, as already mentioned, the problem seems to have to do only with TLS which is not the case in bug 566548 (rather than 5382839...)

However in the bug I read something interesting: I have IPv6 enabled on my WinXP machine. I completely uninstalled it and the behaviour changed a little bit. Obivously it has also to do with IPv6...

Before: 100% CPU while sending ... most of the time it started to hang around 4% and and the message could not be delivered

Now (without IPv6): Very high CPU load (~ 75% for TB) but now 100% any more. The message is actually delivered.

Now I tried the same with a new profile (i.e. network.tcp.sendbuffer=131072) but it is the same. So I guess this value has nothing to do with this bug.

Two other infos:
1.) The problem happens on different locations: (a) when I am at home and directly connecting to my mailserver (b) at the university (c) at different ISPs connecting to different servers (mobile over UMTS, DSL, ...). So I conclude the problem are not router issues.

2.) The problem happens with different mailservers: My own (exim4 on Debian lenny) and at the University (sendmail on SUN) and the GMX server (mail.gmx.net). All with SSL/TLS.

Is there anything different I could do? Is there any information I could provide?

Thank you!
Does network.dns.disableIPv6=true affect on your problem? AFAIR, default of the prefs was changed to false from Tb 3.0.
(Reporter)

Comment 4

8 years ago
Dear Wada,

I have bad news: Maybe the problem is not related to IPv6. Today I tried again (with new profile) and I have the same issue. But IPv6 is still uninstalled on my system.

I also tried with network.dns.disableIPv6=true but no change.
As seen in performance counter log I attached to bug 559676(set in dependency tree for bug 538283) and comment by other people, CPU usage for productive uploading work by SSL is ordinary 10% to 20%.

Do you enable virus scanner of ant-virus software for outgoing mail?

As current "virus scanner for outgoing mail" utilizes port scanning, virus scanning may be done under Tb's process during data sending, and it may consume large CPU power for scanning.

If SMTP proxy type scanner, problem of the scanner(buffer overrun etc.) may produce many re-tranmission by Tb continuously. It may increase CPU usage by Tb.
Can you check with Gmail's SMTP server? As Gmail's SMTP requests SSL, and as I can test with Gmail's SMTP, comparison of NSPR log data in two different environments is possible.
(Reporter)

Comment 7

8 years ago
Hi Wada,

No, I have no Antivirus, Proxy etc. installed.

I tried now with gmail's SMTP server (smtp.gmail.com, isn't it?). I have the same issue there. Please just tell me how to produce the debugging you would need.

Regards
Get log with next parameter, please.
> https://wiki.mozilla.org/MailNews:Logging
> https://developer.mozilla.org/en/HTTP_Logging
> SET NSPR_LOG_MODULES=timestamp,smtp:5,nsHostResolver:5,nsSocketTransport:5

As 4K buffer is used, consecutive logs for socket level requests of "PR_Write of 4096 bytes" by Tb or SSL is seen in log. If no IP level or TCP level delay/re-transmission doesn't happen, the consecutive send requsts of 4096 bytes by Tb/SL is smoothly processed by TCP/IP layer, and is logged in NSPR log. If delay/re-transmission happens, "send request of 4096" is rejected by lower layer and return code of -1(send request is rejected due to no free buffer) is seen in log.
As huge log data is generated by nsSocketTransport:5, checking above is hard for human beings. So, I wrote simple script to check responses from lower level layer to consecutive send(write) requests for 4096 bytes.

Attach log file to this bug, with length data of attachment file, please. I'll try to send similar size file using Gmail's SMTP and get log.
(Reporter)

Comment 9

8 years ago
Hi,

Thank you. Really, the log is huge, it filled my harddisk within a few seconds. As as I have seen no raw output and passwords are included. I hope so ;-)

You can ignore the first 5 MB. They are from starting up Thunderbird.

Afterwards I sent a message with an attachment of 3117606 bytes with gmail. The CPU went to 100% and I canceled at 7% or so.

The logfile is 70 MB. I compressed with bzip2, so it has 400kb; a more reasonable size. I hope this is OK for you.

If I have done anything wrong, please let me know.

Regards
(Reporter)

Comment 10

8 years ago
Created attachment 457286 [details]
bzip2 compressed logfile of sending mail+attachment via gmail
Created attachment 457494 [details]
Summary data of  calling PR_Write [count=N]/PR_Write returned [n=M] and STS poll iter [X]

Summary data of next log:
(A) Successful send request 
  calling PR_Write [count=4096], PR_Write returned [n=4096] => Type=11  
  calling PR_Write [count=N],    PR_Write returned [n=N]    => Type=21
(B) Error returne code([n=-1]) to send request
  calling PR_Write [count=4096], PR_Write returned [n=-1]   => Type=13  
  calling PR_Write [count=N],    PR_Write returned [n=N]    => Type=23
(C) STS poll iter [0], or STS poll iter [1]                 => Type=51
    If concecutive "STS poll iter [X]", merged to  single Type=51,
    and shown like  STS_poll_iter[0,0,1,0, ... ,1,1,0]"
(D) Above logs after first "calling PR_Write [count=4096]" only are picked up.

Some patterns are merged to single line data to reduce number of data lines.
(E) Consecutive Type=13 and Typ=51 is merged to Type=13+51
(F) Patter on Type=11, {Type=13+51}*N (where N=1 to 3)
    is merged to Type=11+13+51+ ... +13+51


(1) At 10:34:36 : 18 * ( accepted send request for 4096 bytes )  
> 2010-07-14 10:34:36.326000 UTC - @type=11,[count=4096]->[n=4096]
> 2010-07-14 10:34:36.336000 UTC - @type=11,[count=4096]->[n=4096]
> 2010-07-14 10:34:36.336000 UTC - @type=11,[count=4096]->[n=4096]
> 2010-07-14 10:34:36.346000 UTC - @type=11,[count=4096]->[n=4096]
> 2010-07-14 10:34:36.346000 UTC - @type=11,[count=4096]->[n=4096]
> 2010-07-14 10:34:36.346000 UTC - @type=11,[count=4096]->[n=4096]
> 2010-07-14 10:34:36.356000 UTC - @type=11,[count=4096]->[n=4096]
> 2010-07-14 10:34:36.356000 UTC - @type=11,[count=4096]->[n=4096]
> 2010-07-14 10:34:36.366000 UTC - @type=11,[count=4096]->[n=4096]
> 2010-07-14 10:34:36.366000 UTC - @type=11,[count=4096]->[n=4096]
> 2010-07-14 10:34:36.366000 UTC - @type=11,[count=4096]->[n=4096]
> 2010-07-14 10:34:36.366000 UTC - @type=11,[count=4096]->[n=4096]
> 2010-07-14 10:34:36.376000 UTC - @type=11,[count=4096]->[n=4096]
> 2010-07-14 10:34:36.376000 UTC - @type=11,[count=4096]->[n=4096]
> 2010-07-14 10:34:36.426000 UTC - @type=11+13+51+13+51+13+51,[count=4096]->[n=4096]->[count=4096]->[n=-1]->STS_poll_iter[0,0,0,1,1]->[count=4096]->[n=-1]->STS_poll_iter[0,0,0,1,1]->[count=4096]->[n=-1]->STS_poll_iter[0,0]

(2) At 10:34:44 : 1 * ( accepted send request for 4096 bytes )  
> 2010-07-14 10:34:44.107000 UTC - @type=11+13+51+13+51+13+51,[count=4096]->[n=4096]->[count=4096]->[n=-1]->STS_poll_iter[0,0,0,1,1]->[count=4096]->[n=-1]->STS_poll_iter[0,0,0,1,1]->[count=4096]->[n=-1]->STS_poll_iter[0,0]

(3) At 10:34:45 : 4 * ( accepted send request for 4096 bytes )  
> 2010-07-14 10:34:45.169000 UTC - @type=11+13+51+13+51+13+51,[count=4096]->[n=4096]->[count=4096]->[n=-1]->STS_poll_iter[0,0]->[count=4096]->[n=-1]->STS_poll_iter[0,0,0,0,1]->[count=4096]->[n=-1]->STS_poll_iter[0,0]
> 2010-07-14 10:34:45.559000 UTC - @type=11+13+51+13+51+13+51,[count=4096]->[n=4096]->[count=4096]->[n=-1]->STS_poll_iter[0,0]->[count=4096]->[n=-1]->STS_poll_iter[0,0,0,0,1]->[count=4096]->[n=-1]->STS_poll_iter[0,0]
> 2010-07-14 10:34:45.689000 UTC - @type=11+13+51+13+51+13+51,[count=4096]->[n=4096]->[count=4096]->[n=-1]->STS_poll_iter[0,0]->[count=4096]->[n=-1]->STS_poll_iter[0,0,0,0,1]->[count=4096]->[n=-1]->STS_poll_iter[0,0]
> 2010-07-14 10:34:45.910000 UTC - @type=11+13+51+13+51+13+51,[count=4096]->[n=4096]->[count=4096]->[n=-1]->STS_poll_iter[0,0,0,1]->[count=4096]->[n=-1]->STS_poll_iter[0,0,0,1]->[count=4096]->[n=-1]->STS_poll_iter[0,0]

(4) After it, Type=21(calling PR_Write [count=N], PR_Write returned [n=N]
    where N=9, 11, 27 etc., accepted send request for N bytes) is seen.

As a few Type=21 lines per sec, TCP buffer of TCP layer is freed a_few*(10 to 27 bytes)/sec. For TCP Socket user, it means send speed is a_few*(10 to 27 bytes)/sec.

(5) Last part at 10:38:21.
> 2010-07-14 10:38:21.550000 UTC - @type=51,STS_poll_iter[0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,1,1,0,0]

For TCP Sockt user(Thunderbird's SSL 
module), above (4) is similart to slowness of data transmission caused by frequent re-transmission of IP packets due to IP packect loss.
However, required data transmission time is same between non-SSL and SSL.

Problem of TCP modules of OS or setup of TCP environment of OS?
Problem of OS's SSL related modules?
Tb looses some notifications from TCP layer?
Repeated STS_poll_iter[X] by retry when no free TCP buffer is cause?
(no wait for next iteration, then CPU hog?)
Created attachment 457498 [details]
Summary data of NSPR log for mail send via Gmail's SMTP from my PC
Created attachment 457500 [details]
Summary data of NSPR log for mail send via non-SSL SMTP from my PC

As seen in summary data, consecttive Type=11(consecutive accepted send requests of 4096 bytes) is seen, if non-SSL SMTP.
Code of SSL pointed in bug 559676 comment #2O is one of reasons why STS_poll_iter is always logged after an accepted send request of 4096 bytes when SSL. To avoid long wait in other SSL related threads, PR_Sleep(PR_INTERVAL_NO_WAIT); is issued after each send of 4096 bytes by SSL code.
How long did it take to send 3117606 bytes with gmail?
Can you get log with SET NSPR_LOG_MODULES=timestamp,smtp:5 only?
Timestamp of next log is sufficient to estimate transmission speed.
> SMTP Send: DATA  
> SMTP Response: 354  Go ahead l29sm5566874rvb.16 
> SMTP Send: .  
> SMTP Response: 250 2.0.0 OK 1276588432 l29sm5566874rvb.16 

AFAIR, I sent 2MB attachment when I got next log.
> 00001214 16:53:31.327 [2552] 0[182c140]: SMTP Response: 354  Go ahead l29sm5566874rvb.16
> 00043724 16:53:44.077 [2552] 0[182c140]: SMTP Send: .  
> 00054524 16:53:52.327 [2552] 0[182c140]: SMTP Response: 250 2.0.0 OK 1276588432 l29sm5566874rvb.16
Created attachment 457767 [details]
New summary data, this bug's case with Gmail's SMTP, PR_Write and PR_Read

To see activity after n=-1 for send request of 4096 bytes(during many STS_poll_iter, Type=51 in previous summary), I checked log for PR_Read additionally.
  calling PR_Read [count=4096], PR_Read returned [n=-1] => Type=13RD

Summary data corresponds to:
(A) Type=11/Type=21, Successful send request 
    Type=11 : NNNN=4096, Type=12 : 0<NNNN<4096
      nsSocketOutputStream::Write [this=abcdegh count=NNNN]
        calling PR_Write [count=NNNN]	
        PR_Write returned [n=NNNN]
      nsSocketTransport::SendStatus [this=abcdefgh status=xxxxxxxx]	
(B) Type=13/Type=23, Error returne code([n=-1]) to Write request
    Type=13 : NNNN=4096, Type=23 : 0<NNNN<4096
      nsSocketOutputStream::Write [this=abcdefgh count=NNNN]
        calling PR_Write [count=NNNN]	
        PR_Write returned [n=-1]
      nsSocketInputStream::AsyncWait [this=abcdefgh]	
(C) Type=13RD, Error return code([n=-1]) to Read request
      nsSocketOutputStream::Read [this=abcdefgh count=4096]
        calling PR_Read [count=4096]	
        PR_Read returned [n=-1]
      nsSocketInputStream::AsyncWait [this=abcdefgh]	
(D) Type=51, STS poll iter [0], or STS poll iter [1]
    If concecutive "STS poll iter [X]", merged to  single Type=51,
    and shown like  STS_poll_iter[0,0,1,0, ... ,1,1,0]"
(E) Above logs after first "calling PR_Write [count=4096]" only are picked up.
(F) Some patterns are merged into single data line such as type=13+51+13RD+51.

(1) Your log with Gmail's SMTP.
Continuous(10 to 20) send requests for 4096 bytes are accepted.
> 2010-07-14 10:34:36.396000 UTC - @type=11+13+51+13RD+51,[count=4096]->[n=4096]->[count=4096]->[n=-1]->STS_poll_iter[0]->
After it, pattern of type=13+51+13RD+51 are repeated till next successful send request(9 to 27 bytes). 
> 2010-07-14 10:34:36.436000 UTC - @type=13+51+13RD+51,[count=4096]->[n=-1]->STS_poll_iter[0]->Read[count=4096]->[n=-1]->STS_poll_iter[0,0,1,1]
> 2010-07-14 10:34:36.436000 UTC - @type=13+51+13RD+51,[count=4096]->[n=-1]->STS_poll_iter[0]->Read[count=4096]->[n=-1]->STS_poll_iter[0]

(2) My log with Gmail's SMTP.
Pattern of 11+13+51+13RD+51 continues, although is disturbed sometimes.
4KB buffer data is smoothly sent and free TCP buffer becomes available soon.

(3) My log with non-SSL SMTP.
Pattern of 13RD(PR_Read for 4096, and n=-1) is not seen.

Flood of the pettern of 13+51+13RD+51 is probably a reason of high CPU utilization with SSL, and it may be cause of CPU 100% in your environment.
Many 13RD(PR_Read for 4096, n=-1, and AsyncWait) with SSL may be "busy wait" in bug 559676.
Created attachment 457768 [details]
New summary data, my case with Gmail's SMTP, PR_Write and PR_Read
Created attachment 457769 [details]
New summary data, my case with non-SSL SMTP, PR_Write and PR_Read
Attachment #457498 - Attachment is obsolete: true
Attachment #457500 - Attachment is obsolete: true
If too many PR_Read is cause(it looks one PR_Read per an accepted 4KB data or per a failed send request), very small network.tcp.sendbuffer may alter phenomenon.
What will happen with network.tcp.sendbuffer=4096 or 8192?
(Restart Tb after setting change to avoid unwanted peoblem, please)
Created attachment 457782 [details]
New summary data(last part), this bug's case with Gmail's SMTP, PR_Write and PR_Read

Previous one is first 1000 lines of summary data. This file is for last part of summary data, in which send request is seen again(for 9 bytes).
Next is is last type=13+51+13RD+51 record and data follows it.
> 2010-07-14 10:34:49.174000 UTC - @type=13+51+13RD+51,[count=4096]->[n=-1]->STS_poll_iter[0]->Read[count=4096]->[n=-1]->STS_poll_iter[0,0]
> 2010-07-14 10:38:10.975000 UTC - @type=13RD+51,Read[count=4096]->[n=-1]->STS_poll_iter[0,0,1]
> 2010-07-14 10:38:10.975000 UTC - @type=12RD,Read[count=4096]->[n=17]
> 2010-07-14 10:38:10.975000 UTC - @type=23RD,Read[count=4079]->[n=-1]
> 2010-07-14 10:38:10.975000 UTC - @type=51,STS_poll_iter[0,0,1]
> 2010-07-14 10:38:10.975000 UTC - @type=23RD,Read[count=4079]->[n=-1]
> 2010-07-14 10:38:10.975000 UTC - @type=51,STS_poll_iter[0,0,1,0,0,0,1]
> 2010-07-14 10:38:10.975000 UTC - @type=23+51,[count=6]->[n=-1]->STS_poll_iter[0,0,1]
> 2010-07-14 10:38:10.975000 UTC - @type=21,[count=6]->[n=6]

After last type=13+51+13RD+51 at 10:34:49.174000, next event is one at 10:38:10.975000. There is no activity for 3 minutes and 20 seconds.
This "3 minutes and 20 seconds" corresponds to next NSPR log lines.
> 2010-07-14 10:34:49.174000 UTC - 3544[1829780]: STS poll iter [0]
> 2010-07-14 10:34:49.174000 UTC - 3544[1829780]:   active [5] { handler=479b6e0 condition=0 pollflags=7 }
> 2010-07-14 10:34:49.174000 UTC - 3544[1829780]:   active [4] { handler=3759520 condition=0 pollflags=5 }
> 2010-07-14 10:34:49.174000 UTC - 3544[1829780]:   active [3] { handler=25f9ac0 condition=0 pollflags=5 }
> 2010-07-14 10:34:49.174000 UTC - 3544[1829780]:   active [2] { handler=479b520 condition=0 pollflags=5 }
> 2010-07-14 10:34:49.174000 UTC - 3544[1829780]:   active [1] { handler=5253a60 condition=0 pollflags=5 }
> 2010-07-14 10:34:49.174000 UTC - 3544[1829780]:   active [0] { handler=25afda0 condition=0 pollflags=5 }
> 2010-07-14 10:34:49.174000 UTC - 3544[1829780]:   calling PR_Poll [active=6 idle=1]
> 2010-07-14 10:38:10.965000 UTC - 3544[1829780]:     ...returned after 2607 milliseconds
> 2010-07-14 10:38:10.975000 UTC - 3544[1829780]: nsSocketTransport::OnSocketReady [this=479b520 outFlags=1]
> 2010-07-14 10:38:10.975000 UTC - 3544[1829780]: nsSocketInputStream::OnSocketReady [this=479b5b8 cond=0]
> 2010-07-14 10:38:10.975000 UTC - 3544[1829780]: STS dispatch [1dd3ac8]
> 2010-07-14 10:38:10.975000 UTC - 3544[1829780]: nsSocketInputStream::Read [this=479b5b8 count=4096]
> 2010-07-14 10:38:10.975000 UTC - 3544[1829780]:   calling PR_Read [count=4096]
> 2010-07-14 10:38:10.975000 UTC - 3544[1829780]:   PR_Read returned [n=-1]
> 2010-07-14 10:38:10.975000 UTC - 3544[1829780]: nsSocketInputStream::AsyncWait [this=479b5b8]

For Tb, above log means that TCP/IP sent only 6 bytes to server successfuly during the 3 minutes 20 seconds.
Did CPU 100% occur even during this no activity of Tb for 3 minutes and 20 seconds?
nsSocketTransport uses spin loop like logic for waiting for data or event?
(Reporter)

Comment 21

8 years ago
Dear Wada,

Sorry for my long absence, I was on holidays.

Concerning your first question: No, network.tcp.sendbuffer=4096 does not change anything (as already mentioned).

Concerning the second question: I do not fully understand. Yes, I think it occured during the "no activity".

Now I again started with:
NSPR_LOG_MODULES=timestamp,smtp:5
NSPR_LOG_FILE=d:\tb.log

I tried to send a file with 108762 Bytes. The first part up to approx. 80% went "fast", then the CPU started consuming 100% and nothing worked any more. This is the last log entry:

> 2010-08-06 17:58:16.198000 UTC - 0[182a140]: SMTP entering state: 0
> 2010-08-06 17:58:16.198000 UTC - 0[182a140]: SMTP Response: 354  Go ahead a48sm2703456eei.12
> 2010-08-06 17:58:16.198000 UTC - 0[182a140]: SMTP entering state: 7
> 2010-08-06 17:58:16.198000 UTC - 0[182a140]: SMTP entering state: 8

As you can see, it is 17:58:16. I canceled the message at 18:03 at 83%.

Any, by the way, now I encountered the same issue when the message is moved to drafts folder (IMAP/SSL).

Please tell me what I should do next. Again creating a full log?
(Reporter)

Comment 22

8 years ago
Ok, once again: I can not upload this logfile because of the 6MB limit. The logfile is about 2GB (!) so it is compressed with bzip2 (8MB).

http://dl.dropbox.com/u/9844683/tb.log.bzip2

The sending process starts approximately at 18:10:10. At this time the CPU started working at 100%. The sending over SMTP ended at about 18:10:40. Afterwards the mail was moved to the sent folder. I terminated the process at 18:14:50. Within these almost 5 minutes the CPU was always at 100% and the logfile was filled with 2GB of data.
(In reply to comment #21)
> I tried to send a file with 108762 Bytes. The first part up to approx. 80% went "fast",
> then the CPU started consuming 100% and nothing worked any more.

It sounds a variant of phenomenon of bug 541367 or Bug 566548.
  - As network.tcp.sendbuffer=128KB, almost all mail data is passed to
    TCP layer when TCP Window sieze=128KB is advertised by server.
    So, TCP/IP layer transmits all packets for the 128KB data.
  - Due to router's bug, transmission fails frequently.
    TCP/IP layer(network module of OS on your PC) has to retransmit data,
    and 128KB/2(average) is retransmitted again, because TCP layer already
    has 128KB data to be transmitted.
    Resend of some packets are successful. So, connectios loss doesn't occur,
    but CPU hog due to re-transmission by TCP/IP layer occurs.

> Any, by the way, now I encountered the same issue when the message is moved to
> drafts folder (IMAP/SSL).

As IMAP logging records mail data, log becomes huge. It make log analysis very hard. Let's focus on simple SMTP case first.

> Concerning your first question:
> No, network.tcp.sendbuffer=4096 does not change anything (as already mentioned).

Did you restart Tb after network.tcp.sendbuffer value change?
If network.tcp.sendbuffer=4096, only 4096 bytes is passed to TCP layer, and next 4096 bytes is not passed to TCP layer by Tb until transmission of previous 4096 bytes by TCP(and IP) is completed.
So, phenomenon should be merely "it takes long", as I wrote in bug 538283 comment #62.

> The sending process starts approximately at 18:10:10. At this time the CPU
started working at 100%.
> The sending over SMTP ended at about 18:10:40.

It seems that "send of 108762 Bytes attachment(150KB mail?)" took 30 seconds if no SMTP logging. 
Sent mail data size is written in SIZE=???? of "SMTP Send: MAIL FROM" log line.

Can you check with different mail data size(such as 4KB, 8KB, 16KB, and 128KB, 256KB), with network.tcp.sendbuffer=4096 after restart of Tb?
Transmission time is linear to mail data size? CPU 100% of Tb's process is seen with small mail dat size too?
(Reporter)

Comment 24

8 years ago
Dear Wada,

My logfile as well what I have written was done with network.tcp.sendbuffer=4096. I also restarted Thunderbird afterwards.

Also the problem appears with *all* networks including university etc. with non SOHO equipment but expensive Cisco Routers. So it can not be a bug in the router or the network.

Also, it does not only "take long". The CPU is at 100% all the time and nothing works any more.

One more thing: It transmission sometimes seems to become exponentially slower: The first part is fast, afterwards there is no movement anymore on the progress bar.

I am not at home right now but I will try with different payload size. Should I also record logs? Which level?
(In reply to comment #24)
> Also the problem appears with *all* networks including university etc.
> with non SOHO equipment but expensive Cisco Routers.
> So it can not be a bug in the router or the network.

Not only router attached to your PC and/or SMTP server. All routers between your PC and SMTP server is relevant to problem if router related problem. And, only one router's problem between your PC and SMTP server can produce problem like bug 541367 or Bug 566548.    

> Should I also record logs? Which level?

NSPR log of timestamp,smtp:5 with network.tcp.sendbuffer=4096 is sufficient for first check. Watch CPU utilization of thunderbird.exe by Task Manager, please.
As TCP/IP can send only packets for 4096 bytes continuously before receive of ACK, transmission time should be linear to mail data size(4KB, 8KB, 16KB, ..., 128KB), if severe delay in network or packet loss doesn't exist. 
Check log I wrote in bug 538283 comment #62 to see mail data size and to estimate transmission time, please.
(Note: In my environment, send of 1.4MB mail to Gmail's SMTP took 50 seconds with network.tcp.sendbuffer=4096.)
setting dependency to bug 538283 for ease of tracking and search.
Depends on: 538283
Created attachment 498048 [details]
New summary: calling PR_Write&PR_Read [count=N]/PR_Write&PR_Read returned [n=M], STS poll iter [X]/calling PR_Poll [active=A idle=B]/timeout = MM milliseconds/...returned after after NN milliseconds

Log line of "calling PR_Poll [active=A idle=B]", "timeout = MM milliseconds", "...returned after after NN milliseconds", are also checkled, and data of them are merged into sumary line for "STS poll iter [X]".
And, continuous summary lines for "4KB send request error(due to no free send buffer) and followed retries of it)" was merged into single summary line for eas of read.
A "send error and works after the error" corresponds to next summary lines.
  [  one type=13(4KB send is rejected due to no free send buffer)
   + several type=51 data(PR_Poll is utilized for sleep)
   + some { type=13RD(4KB receive error) + several type=51(sleep) } ]
If this pattern of "send error and works after the error" are continously logged, they are merged into single type=13XX line.
Attachment #457494 - Attachment is obsolete: true
Attachment #457767 - Attachment is obsolete: true
Attachment #457768 - Attachment is obsolete: true
Attachment #457769 - Attachment is obsolete: true
Brief summar of your NSPR log.

(1) Succsessful "4KB Send" are next 37 send request only(148KB).
    There is no send request for less than 4KB data between these logs.

> 2010-07-14 10:34:36.326000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.336000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.336000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.346000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.346000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.346000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.356000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.356000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.366000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.366000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.366000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.366000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.376000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.376000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.396000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.406000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.406000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.406000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.406000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.406000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.406000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.416000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.416000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.416000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.416000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.416000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.416000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.426000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.426000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.426000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.426000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:36.426000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:44.107000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:45.169000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:45.559000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:45.689000 UTC - @type=11,Write[count=4096]->[n=4096]
> 2010-07-14 10:34:45.910000 UTC - @type=11,Write[count=4096]->[n=4096]


(2) After last succsessful "4KB Send", pattern is like next.

> 2010-07-14 10:34:45.910000 UTC - @type=13,Write[count=4096]->[n=-1]
> 2010-07-14 10:34:45.910000 UTC - @type=51,STS_poll_iter[0],PR_Poll[active=6/idle=1],timeout[0],next_log=...returned[0]
> 2010-07-14 10:34:45.910000 UTC - @type=13RD,Read[count=4096]->[n=-1]
> 2010-07-14 10:34:45.910000 UTC - @type=51,STS_poll_iter[0],PR_Poll[active=6/idle=1],timeout[0],next_log=...returned[0]
> 2010-07-14 10:34:45.910000 UTC - @type=51,STS_poll_iter[0],PR_Poll[active=6/idle=1],timeout[0],next_log=...returned[0]
> 2010-07-14 10:34:45.910000 UTC - @type=51,STS_poll_iter[1],PR_Poll[active=6/idle=1],timeout[100000],next_log=...returned[1]

> 2010-07-14 10:34:45.910000 UTC - @type=13XX,<4302>-repeats-of-13/51/13RD/51-pattern,type=13(4302),type=51(12912),type=13RD(2870)

  (2-1) 4302 tries of "4KB send"(type=13) between 10:34:45.910000 and 10:34:49.174000 (for 3.264 seconds).
        1 retry of "4KB send" per 1.318 milliseconds.
        I think it's too aggressive. 1 retry per 1 millisecond when "no free send buffer" is too frequent.
        I guess "...returned after 0 milliseconds" in next sequence is relevnt.
          STS poll iter [1]	
            active [0] { handler=25bfa20 condition=0 pollflags=7 }	
            calling PR_Poll [active=1 idle=0]	
          poll timeout: 100	
              timeout = 100000 milliseconds	
              ...returned after 0 milliseconds	
        As this pattern is seen in my log(no CPU 100% problem),
        this is not cause of CPU 100%, although this increases CPU uttliztion.

> 2010-07-14 10:34:49.174000 UTC - @type=13,Write[count=4096]->[n=-1]
> 2010-07-14 10:34:49.174000 UTC - @type=51,STS_poll_iter[0],PR_Poll[active=6/idle=1],timeout[0],next_log=...returned[0]
> 2010-07-14 10:34:49.174000 UTC - @type=13RD,Read[count=4096]->[n=-1]
> 2010-07-14 10:34:49.174000 UTC - @type=51,STS_poll_iter[0],PR_Poll[active=6/idle=1],timeout[0],next_log=...returned[0]
> 2010-07-14 10:34:49.174000 UTC - @type=54,STS_poll_iter[0]
> 2010-07-14 10:34:49.174000 UTC - @type=55,PR_Poll[active=6/idle=1]

(3) Aftr that, no NSPR log by Tb  for three minutse.

> 2010-07-14 10:38:10.965000 UTC - @type=57,...returned[2607]

  (3-1) no "timeout MM milliseconds" log. Why?

> 2010-07-14 10:38:10.975000 UTC - @type=13RD,Read[count=4096]->[n=-1]
> 2010-07-14 10:38:10.975000 UTC - @type=51,STS_poll_iter[0],PR_Poll[active=5/idle=0],timeout[0],next_log=...returned[0]
> 2010-07-14 10:38:10.975000 UTC - @type=51,STS_poll_iter[0],PR_Poll[active=5/idle=0],timeout[0],next_log=...returned[0]

> 2010-07-14 10:38:10.975000 UTC - @type=51,STS_poll_iter[1],PR_Poll[active=5/idle=0],timeout[65535000],next_log=...returned[1]

  (3-2) Why "timeout 65535000 milliseconds at here"?

(4) After no log for three minutes by Tb, "4KB receive" is requested and 17 byte is returned. Read for what and what kind of data is returned?
    Other logs for "receive for less than 4KB" is seen.
    As "SMTP Send ." is not written in log by Tb. So, unsent data still remain.
    Why "receive operation"?

> 2010-07-14 10:38:10.975000 UTC - @type=12RD,Read[count=4096]->[n=17]
> 2010-07-14 10:38:10.975000 UTC - @type=23RD,Read[count=4079]->[n=-1]

(5) "Send request for less than 4KB" is also seen.
    Send buffer is partially freed?
    Or something like re-negotiation of SSL happened?

> 2010-07-14 10:38:11.185000 UTC - @type=23,Write[count=9]->[n=-1]
Attachment #457782 - Attachment is obsolete: true

Comment 29

7 years ago
 
I'm having the same 100% CPU usage problem on my system when sending large emails using SSL/TLS.
Thunderbird 3.1.10
Windows XP Pro (sp3)

The problem appeared after switching providers, the new provider uses a ssl/tls connection for POP mail access.

I setup a gmail account to test whether this is a provider or application specific problem. The 100% cpu usage problem also occurs with my gmail account.


Thanks
Rob

Comment 30

7 years ago
in tb5 the cpu hog is still present while using ssl/tls, during sending and also copying the file to IMAP send folder... But no hog while copying some messages manually between IMAP folders or also between local and IMAP folder...

Mbiba

Comment 31

7 years ago
And finally I can say, windows 7 (64bit) is not affected by this hog in the same network environment, it looks this is only the windows xp sp3 case...Unfortunately upgrade to win 7 is impossible in our (and also in many other companies)company environment.

Mbiba
Some big changes were made in Thunderbird 11 regarding how SSL works, in particular bug 674147. It might be the case that this bug was fixed indirectly by the patches in bug 674147. I would love to hear from some users as to whether or not that is the case.

Further, if it isn't fixed, it is more likely now that we will be able to track down the cause as we've been doing work in this area recently.
(In reply to webforen from comment #24)
> My logfile as well what I have written was done with
> network.tcp.sendbuffer=4096. I also restarted Thunderbird afterwards.
> Also, it does not only "take long". The CPU is at 100% all the time and
> nothing works any more.

In bug Bug 538283 #131 and #132, following was reported.
- TB 10 still produces problem.
- latest earlybird (12a2, patch of bug 674147 is landed) solved problem.
I think Bug 538283 and your case is same problem, so closing as dup of Bug 538283.
If your problem is not resolved by recent trunk nightly build, re-open this bug, please.
Status: UNCONFIRMED → RESOLVED
Last Resolved: 6 years ago
No longer depends on: 538283
Resolution: --- → DUPLICATE
Duplicate of bug: 538283
You need to log in before you can comment on or make changes to this bug.