Failed to send because socket state is closed
Categories
(MailNews Core :: Networking: SMTP, defect)
Tracking
(thunderbird_esr128 affected, thunderbird_esr140 affected)
People
(Reporter: myaddons, Assigned: gds)
References
(Regression)
Details
(Keywords: regression)
Attachments
(1 file, 2 obsolete files)
A few weeks ago a user of my add-on Mail Merge reported an issue, that Mail Merge got stuck during the delivery of the individual messages. At the beginning the problem seemed to happen randomly.
Further analysis showed the problem is related to the fairly new reusing of the SMTP connection (Bug 136871) and the "5 second quit timer" (Bug 1864924). Here are the relevant SMTP log messages:
mailnews.smtp: Message sent successfully. SmtpClient.sys.mjs:1353:21
mailnews.smtp: Start 5 second QUIT timer SmtpClient.sys.mjs:1378:21
mailnews.smtp: C: QUIT SmtpClient.sys.mjs:677:19
mailnews.smtp: Reusing a connection SmtpClient.sys.mjs:127:19
mailnews.smtp: C: MAIL FROM:<XXX> BODY=8BITMIME SIZE=78691 SmtpClient.sys.mjs:677:19
mailnews.smtp: S: 221 sslproxy02.your-server.de closing connection
mailnews.smtp: Socket closed. SmtpClient.sys.mjs:573:17
mailnews.smtp: MAIL FROM successful, proceeding with 1 recipients SmtpClient.sys.mjs:1224:17
mailnews.smtp: Adding recipient... SmtpClient.sys.mjs:1229:17
mailnews.smtp: Failed to send "RCPT TO:<XXX>" because socket state is closed SmtpClient.sys.mjs:665:21
It looks to me like there is a race condition between closing the socket and reusing the connection. Although the socket is going to be closed, Thunderbird tries to reuse the connection without clearing the "5 second quit timer" or establishing a new connection.
On a successful delivery the "5 second QUIT timer" is either cleared:
mailnews.smtp: Message sent successfully. SmtpClient.sys.mjs:1353:21
mailnews.smtp: Start 5 second QUIT timer SmtpClient.sys.mjs:1378:21
mailnews.smtp: Reusing a connection SmtpClient.sys.mjs:127:19
mailnews.smtp: Clearing QUIT timer SmtpClient.sys.mjs:204:19
...
Or the socket is closed and a new connection is established:
mailnews.smtp: Message sent successfully. SmtpClient.sys.mjs:1353:21
mailnews.smtp: Start 5 second QUIT timer SmtpClient.sys.mjs:1378:21
mailnews.smtp: C: QUIT SmtpClient.sys.mjs:677:19
mailnews.smtp: S: 221 sslproxy02.your-server.de closing connection
mailnews.smtp: Socket closed. SmtpClient.sys.mjs:573:17
mailnews.smtp: Connection to mail.your-server.de closed SmtpClient.sys.mjs:183:19
mailnews.smtp: Connecting to smtp://mail.your-server.de:587 SmtpClient.sys.mjs:132:19
mailnews.smtp: Connected SmtpClient.sys.mjs:429:17
...
We tried to disable the reusing of the SMTP connection via the hidden preferences mail.smtpserver.default.max_cached_connections = 1 and mail.smtpserver.default.max_messages_per_connection = 1. This solves the problem!
| Reporter | ||
Comment 1•11 months ago
|
||
The problem seems to be triggered more easily and more reliably by a feature of Mail Merge: In the user interface Mail Merge offers the ability to add a delay between two consecutive messages; this delay can also be generated at random between 1 and X seconds.
Initially the user selected a random pause between 1 and 30 seconds in the user interface. So from time to time the delay between two consecutive messages will be approximately 5 seconds - just the right duration for the "5 second quit timer" to close the socket.
We tried a fixed pause of 30 seconds: This solves the problem!
| Assignee | ||
Comment 2•11 months ago
•
|
||
From comment 0:
We tried to disable the reusing of the SMTP connection via the hidden preferences mail.smtpserver.default.max_cached_connections = 1 and mail.smtpserver.default.max_messages_per_connection = 1. This solves the problem!
It's been over a year since I've looked at this code or the previous bug reports referenced above, but reading bug 1864924 comment 22, it looks like just setting max_cached_connections to 0 would also fix the problem for you. You can leave max_messages_per_connection at default of 10.
Edit: You are correct. But just setting max_messages_per_connection to 1 is enough; max_cached_connections is n/a.
This will cause each message sent to produce a connection, send the message and close the connection before the next message send is requested.
Of course, it should still work reliably without the pref change and I think I tried to anticipate the problem you report with the the property sendIsActive as described in bug 1864924 comment 21. I'll look closer at this again.
| Assignee | ||
Comment 3•11 months ago
|
||
WIP: I think something like this might fix it. Haven't yet tried running it.
Another simpler way might be to just immediately close the connection when the 5sec timer expires rather than sending QUIT, since QUIT does not immediately cause the connection to go away which allows the next message to start to be sent before the connection close is signaled by the server.
| Assignee | ||
Comment 4•10 months ago
|
||
Alexander,
Sorry for the long time to respond further to this.
Here's a try build (based on today's daily 5/18/25):
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=c5a612dff3afe47d05045645c9ecdb44bbfac56e
This should fix the case that you point out where the next messages is attempted to be sent right after the delayed by 5 second QUIT is sent but before the server responds to QUIT (which causes the connection to terminate). The problems was that the message being sent after QUIT was sent saw flags that the connection was still up (no QUIT response yet) so data was queued for sending. But the eventual QUIT response causes the connection to be terminated while the messages was being sent. The fix is to start a new connection when a new message is sent right after the delayed QUIT sent (when flag delayedQuitSent is seen in connect()).
Here's more explanation based on the failure sequence you provided (thanks!) in comment 0:
mailnews.smtp: Message sent successfully. SmtpClient.sys.mjs:1353:21
mailnews.smtp: Start 5 second QUIT timer SmtpClient.sys.mjs:1378:21
5 sec QUIT timer only started when reusing connection, which is true at this point
mailnews.smtp: C: QUIT SmtpClient.sys.mjs:677:19
No new message yet so QUIT sent after 5 second delay. flag delayedQuitSent set true
mailnews.smtp: Reusing a connection SmtpClient.sys.mjs:127:19
Now a new message is ready to be sent detected in connect() and socket is still open since QUIT response
has not yet occurred. Since delayedQuitSent is true, we immediately close the original connection and
open a new one.
mailnews.smtp: C: MAIL FROM:<XXX> BODY=8BITMIME SIZE=78691 SmtpClient.sys.mjs:677:19
Before this was sent with original (doomed) connection but is now sent with new connection
mailnews.smtp: S: 221 sslproxy02.your-server.de closing connection
mailnews.smtp: Socket closed. SmtpClient.sys.mjs:573:17
If this occurs now it is the original socket closed
mailnews.smtp: MAIL FROM successful, proceeding with 1 recipients SmtpClient.sys.mjs:1224:17
At some point, QUIT response occurred in the original code causing the socket to close and the
"Failed to send" error in _sendCommand().
mailnews.smtp: Adding recipient... SmtpClient.sys.mjs:1229:17
mailnews.smtp: Failed to send "RCPT TO:<XXX>" because socket state is closed SmtpClient.sys.mjs:665:21
Now "Failed to send" won't occur since we are sending on a new connection and a message should successfully
send.
If you can test the build with your Mail Merge test setup (with TB preference set back to default) it would be most helpful.
I'm not sure which platform(s) you are using so here are installers for the most common ones from the try build:
Win64: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/ap_Jz4z9QwKdfWBWdfpKqw/runs/0/artifacts/public/build/install/sea/target.installer.exe
Macosx64: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/eEsqRTWPQRGKExUKrTIn-Q/runs/0/artifacts/public/build/target.dmg
Linux64: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/XLokwdfDQ6O7UduSLrxZ-g/runs/0/artifacts/public/build/target.tar.xz
| Reporter | ||
Comment 5•10 months ago
|
||
Thank you Gene for taking the time to look into the problem! :-)
Is it possible to create another try build based on Thunderbird Release, i.e. Version 138? Unfortunately I haven't been able myself to reproduce the problem and the user I am in contact with uses Thunderbird 138 at the moment. If it is not too much trouble to create a try build I would like to avoid to switch him to Daily.
| Assignee | ||
Comment 6•10 months ago
|
||
Is it possible to create another try build based on Thunderbird Release, i.e. Version 138?
Yes, it is. I've started the build here but I guess the build system server is busy so it's taking a while:
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=081ce99ec0619faf1d5aee62a539d1f30ccad1ed
| Assignee | ||
Comment 7•10 months ago
|
||
Here's direct links to the installers for patched 138esr:
138.0.1 Win64 installer: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/cuyHXJ_HS7GRsBi3SGcU0g/runs/0/artifacts/public/build/install/sea/target.installer.exe
138.0.1 Macosx64 installer: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/VhQwyOIPTF-DheLUM2XXSw/runs/0/artifacts/public/build/target.dmg
138.0.1 Linux64: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/GijIg4M5Ra6pS-iyAc6fyg/runs/1/artifacts/public/build/target.tar.xz
Note: When this try build is run, Help | About will show this as "Daily" but the version will show as 138.0.1.
Updated•10 months ago
|
| Assignee | ||
Comment 8•10 months ago
|
||
In an email to me, reporter Alexander said:
I have done a test run with the Try Build based on 138.0.1 and stumbled over a lot of these new log messages:
Need new connection due to delayed QUIT timer
Actually I was expecting to never see this new kind of log message, because I wasn't able to reproduce the original problem in the first place. So I took a look at both your Patch and the original SmtpClient.sys.mjs. Here are a few observations:
- While I think the patch should do the trick - at the moment I am eagerly awaiting the test results by the user which may take a few days - I think the log message is shown far too often.
When the quit timer is activated (SmtpClient.sys.mjs#1357-1362) and the connection gets closed after the 5 second delay, then "delayedQuitSent" is still true - it apparently does not get reset to false immediately when the connection is closed.
When the next connection is opened via connect() (Patch#1.52) the "else" part will be executed (Patch#1.57), because the connection is "closed" and not "open". However, because "delayedQuitSent" has not been reset to false, the log message is always shown (Patch#1.65). And the closed connection gets closed again.
I added a check for "socketIsOpen" so the extraneous log message and close() won't occur when delayed QUIT actually closes the socket/connection (which is almost always the case when messages are composed and sent manually).
As I indicated I think "delayedQuitSent" needs to get reset to false, once the connection is closed - instead of reset to false at the end of connect() (Patch#1.87).
Since "delayedQuitSent" is only used in connect(), I think it is best to set it false there.
- I also wonder, whether the delay time, which is hardcoded to 5 seconds at the moment (Patch#1.153), can be configured via a (hidden) preference, e.g. "mail.smtpserver.default.delayedQuitTimer = 5". Like it is done for "mail.smtpserver.default.max_cached_connections " and "mail.smtpserver.default.max_messages_per_connection".
This would make it possible to experiment more easily in these kind of situations with yet another important setting affecting the connection lifetime.
I made this pref "quit_delay_ms" for now. As I mentioned in the email, Magnus preferred hard coding this, but said "not a command". Re: https://phabricator.services.mozilla.com/D196701?id=801611#inline-1092472
- I think the quit timer gets cleared too late in case the connection shall be reused - which still leaves the door open for yet another timing problem.
At the moment the quit timer gets cleared in useEnvelope() (SmtpClient.sys.mjs#202-207). However, the decision to reuse the connection has already been made in connect() (SmtpClient.sys.mjs#127). Therefore I think the quit timer should get cleared immediately afterwards.
Moved the clear of QUIT timer to start of connect(), no longer in useEnvelope(). (Wasn't sure where you meant by "immediately afterwards".)
- I think it would be better to handle "messagesPerConnection" (SmtpClient.sys.mjs#149-153) at the beginning of connect() instead of at the end. This would ensure it is handled, before onidle() is called. (Not sure whether "freed" should be moved at the top as well.)
I moved these to top of connect() before onIdle(). Moved the init of _freed too. Not sure that it matters but don't see any harm in doing that.
I need to do some testing and I'll make another "try" build.
| Assignee | ||
Comment 9•10 months ago
|
||
Here's the new "try" build based on comm-release repo: https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=36aa492ddea48ed6b199ba38a17a6fffba40bac9
This identifies as "daily/138.0.2" in Help | About
Direct links to
win64: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/bMtg7MbmSTCPDCkKc6parw/runs/0/artifacts/public/build/install/sea/target.installer.exe
osx64: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/cHQ13J4ZTrmi8Qe2B5yBOg/runs/0/artifacts/public/build/target.dmg
linux64: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/UNDUC4r4S9CUtD4Bj2xoLA/runs/0/artifacts/public/build/target.tar.xz
| Assignee | ||
Comment 10•10 months ago
|
||
A sequence of messages can be sent on the same SMTP connection. After each is sent
a 5s timer is started and, if the next messages has not yet been started to be sent
when the timer expires, an smtp QUIT is sent telling the smtp server that we are done
sending on this connection and the server should terminate the connection. The problem
occurs when a new messages is sent after the QUIT is sent but before the
connection is completely closed. The message may at first be accepted by the socket
for sending but, while sending, the socket is closed causing the send to fail. Also,
any data that may be successfully sent to the server is rejected by the server
since the server has received the QUIT smtp command.
The fix is to set a flag right as QUIT is sent and for any new message seeing this
flag as true, a new SMTP connection is established for sending that message and
subsequent messages.
| Reporter | ||
Comment 11•10 months ago
|
||
I have done a quick test with 100 individual messages with a random pause between 0 and 10 seconds - directly sent via "Send Now" as "Deliver Mode" by my add-on Mail Merge: Everything still worked as expected and the (additional) log messages look just as they should. However, because I am still not able to reproduce the original problem myself, I never saw the new log message Need new connection due to delayed QUIT timer.
I have contacted the user again and asked him to run the test with the second try build. But I am still awaiting a reply and it may very well take a few more days. I will report back, whether the original problem is solved for him or not.
While thinking about the proposed patch a bit longer, I have one more suggestion:
I think the variable delayedQuitSent should be renamed to - for example - isQuitSent and setting delayedQuitSent = true (Line: 1394) should be moved to the beginning of the function quit() (Line: 190). This is the function where the command QUIT gets actually sent and the function is called from multiple places - including when the quit timer is fired (Line: 1395). This way the variable reflects the actual state of the command QUIT more precisely and works for all cases - including possible future use-cases.
| Assignee | ||
Comment 12•10 months ago
|
||
While thinking about the proposed patch a bit longer, I have one more suggestion:
I think the variable delayedQuitSent should be renamed to - for example - isQuitSent and setting delayedQuitSent = true (Line: 1394) should be moved to the beginning of the function quit() (Line: 190). This is the function where the command QUIT gets actually sent and the function is called from multiple places - including when the quit timer is fired (Line: 1395). This way the variable reflects the actual state of the command QUIT more precisely and works for all cases - including possible future use-cases.
I think your suggestion is good. I also think your other suggestion to set isQuitSent back to false when the socket is known to be closed is now needed since isQuitSent will become true even when not-delayed QUIT is sent.
I'll update the patch in comment 10 and I've started another "try" build. Again it is based on 138.0.2: https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=5ecf5d9f5f18a717e8849cc7d83ccd4305008e28
I'll provide the direct links when (and if) the build finishes.
| Assignee | ||
Comment 13•10 months ago
|
||
win64 installer: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/M9ZLj19qShixTsbHXdx8ZQ/runs/0/artifacts/public/build/install/sea/target.installer.exe
others:
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/MdYViv_DSgqKQyry4JnE2Q/runs/0/artifacts/public/build/target.dmg
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/aeyKvPHSTj-T4-gxcDisuQ/runs/0/artifacts/public/build/target.tar.xz
Comment 14•9 months ago
|
||
Gene, do you have time to drive this patch over the finishing line?
| Assignee | ||
Comment 15•9 months ago
|
||
(In reply to Magnus Melin [:mkmelin] from comment #14)
Gene, do you have time to drive this patch over the finishing line?
Yes, I've been working on it. After getting it to work (per my tests) it seemed too complex and maybe left open possibilities for problems. So I've changed my original approach from what is in the comment 10 WIP patch. I'll abandon that and start a new one.
So now I just let the quit timer start and just let quit occur (unless timer gets canceled by a new connect()) with less checking. If I see that the socket is closed or quit has occurred (onClose not yet signaled) when sending a command, I just open a new connection and "retry" the send from the start. This is similar to the current retry method for when smtp command failure occurs.
Also, the action on QUIT response (221 from server) was to close the connection. Per the smtp RFC, the connection close is to be done by the server. If the client (tb) does it, wireshark shows a lot of RST responses from the server since it's not expecting this. So now just doing nothing on the QUIT response.
In the patch you will see "gds" on some lines (mostly console.log calls) that are just temporary debug/info. They will be either removed or made into official log calls.
I also fixed a regression from my fix for bug 1864924 in "onFree" in SmtpServer. When not reusing, the client was not getting put back into the "free" list. I only saw a problem when multiple emails are composed and then quickly sent by clicking the "Send" button on each window. Didn't see this with "send later" of multiple messages because the sends are serial (next sends waits until previous send is finished before sending).
Also, improved some comments, made some vars private that didn't need to be global, and made quit delay timer time a pref.
Alexander, if you are reading this, I'm making a new "v9" try build here: https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=843ed3d2a0216918fa725ffd525412f17a3446f8
You will see I've included "debug" builds along with optimized builds. The reason is that I was unable to duplicate the problem you show in comment 0 with optimized but only with debug build. I can see the issue with new quit delay timer pref set to 100 ms and then put two message, using send later, into outbox and then send the unsent messages. The quit occurs after the first message is sent and the 2nd message starts but sees that the quit has been sent so it calls connect() to retry on a new connection. I've verified that all is working with wireshark and, of course, with console logging.
| Assignee | ||
Comment 16•9 months ago
|
||
Updated•9 months ago
|
| Assignee | ||
Updated•9 months ago
|
| Assignee | ||
Updated•9 months ago
|
Comment 17•9 months ago
|
||
Pushed by toby@thunderbird.net:
https://hg.mozilla.org/comm-central/rev/85518b3d7c4b
Sometimes SMTP send fails due to socket state closed after delayed QUIT. r=mkmelin
Comment 18•9 months ago
|
||
I backed this out, it caused browser_attachmentReminder.js (and presumably other composition tests) to crash.
https://hg.mozilla.org/comm-central/rev/2affab834ebc86325cd4294a9198fa896f135853
| Assignee | ||
Comment 19•9 months ago
|
||
(In reply to Geoff Lankow (:darktrojan) from comment #18)
I backed this out, it caused browser_attachmentReminder.js (and presumably other composition tests) to crash.
https://hg.mozilla.org/comm-central/rev/2affab834ebc86325cd4294a9198fa896f135853
Saw this in my try build but was "crossed out" so thought it was n/a.
Anyhow, looking for the reason it fails. Probably something to do with retries, maybe.
| Assignee | ||
Comment 20•9 months ago
|
||
The test failures are now fixed (from what I can tell).
Here's the try build:
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=43df1f1583422c2b6b93b83ca21aa7711e2cf788
Problem was caused by reconnect and resend when socket was closed when any QUIT was sent. Now I only do reconnect/resend when a delayed QUIT is sent (s/_isQuitSent/_isDelayedQuitSent). Rearranged the code at the start of _sendCommand().
Also, was able to make "reuseConnection" just a local variable instead of having file scope (no longer _reuseConnection). This required a parameter on function _free().
Tested sending many messages that trigger the original bug. I don't see any problems.
Update patch at comment 16.
| Assignee | ||
Comment 21•9 months ago
|
||
Alexander,
The try build above is for daily. If you want or need a try build based on the patch for a release, let me know. But we may want to wait until Magnus reviews my latest change. I held off submitting it since Magnus was out until today, June 30, it appears.
| Assignee | ||
Updated•9 months ago
|
Comment 22•9 months ago
|
||
Pushed by vineet@thunderbird.net:
https://hg.mozilla.org/comm-central/rev/8451c7b23df5
Sometimes SMTP send fails due to socket state closed after delayed QUIT. r=mkmelin
Comment 23•5 months ago
|
||
(In reply to gene smith from comment #15)
(In reply to Magnus Melin [:mkmelin] from comment #14)
Gene, do you have time to drive this patch over the finishing line?
Yes, I've been working on it. After getting it to work (per my tests) it seemed too complex and maybe left open possibilities for problems. So I've changed my original approach from what is in the comment 10 WIP patch. I'll abandon that and start a new one.
So now I just let the quit timer start and just let quit occur (unless timer gets canceled by a new connect()) with less checking. If I see that the socket is closed or quit has occurred (onClose not yet signaled) when sending a command, I just open a new connection and "retry" the send from the start. This is similar to the current retry method for when smtp command failure occurs.
I've not yet fully checked for regression reports of bug 1864924, so I can't say to what extent version 140esr is impacted. One would hope that bug 1864924 resulted in a net positive - more good than harm.
Gene, do you have an opinion?
And, if it were needed is a simplified patch for esr even a possibility?
| Assignee | ||
Comment 24•5 months ago
|
||
Re: comment 23:
I've not yet fully checked for regression reports of bug 1864924, so I can't say to what extent version 140esr is impacted. One would hope that bug 1864924 resulted in a net positive - more good than harm.
Gene, do you have an opinion?
And, if it were needed is a simplified patch for esr even a possibility?
Bug 1864924 should only affect users of "mail merge" addon or heavy users of the "Send Later" feature. The patch for this bug 1965690 is already in the 143 and later releases and betas, so assuming there are significant users of these non-ESR releases, it should be OK to back-port this to 140esr. But since no complaints about bug 1864924 by 140esr users (AFAIK), it may not be necessary to back port.
I checked the patch for this bug and it applies cleanly as-is to the 140esr tree. So would not need a modified/simplified patch to include this in ESR, if you think it is necessary.
| Reporter | ||
Comment 25•5 months ago
|
||
The user, who originally reported the problem to me, is using the monthly releases of Thunderbird, i.e. version 143 - and not version 140 ESR. He has confirmed the patch solves the problem for him. I have not received any other similar problem reports. And I have not received any new regression reports due to the patch.
So back porting the patch seems not strictly necessary; but version 140 ESR is certainly affected.
| Assignee | ||
Updated•4 months ago
|
Description
•