Unable to re-connect to certain NNTP server.
Categories
(MailNews Core :: Networking: NNTP, defect)
Tracking
(thunderbird_esr115 wontfix, thunderbird_esr128 wontfix, thunderbird140 fixed, thunderbird141 fixed)
People
(Reporter: mozillabugz, Assigned: welpy-cw)
References
Details
Attachments
(3 files, 2 obsolete files)
|
173.63 KB,
image/png
|
Details | |
|
397.90 KB,
image/png
|
Details | |
|
48 bytes,
text/x-phabricator-request
|
wsmwk
:
approval-comm-beta+
|
Details | Review |
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:132.0) Gecko/20100101 Firefox/132.0
Steps to reproduce:
I have had an "uneasy feeling" since the C/C++ NNTP client was replaced by the JS client that all is not well with how it handles connections, difficult to give exact steps to reproduce, however once it gets into a "bad state" it requires switching to work offline, switching back to work online, then it starts working again.
If I can establish a method to accurately reproduce, I will come back and update, but I suspect it depends on TCP connection idle times, inactivity within TB and other "randomish" events.
Actual results:
I can end up in a state where there are messages for a usenet group that are unread with bodies not retrieved, clicking on a message claims to be "loading message" in the status bar, but it never loads, I can switch to another message but it similarly doesn't load, double-clicking the message to load in a new tab still doesn't load.
Once in this state I can perform a wirshark trace and see that what happens is
TB connects to server
server responds "200 welcome"
TB immediately sends "QUIT" and closes connection
server responds "205 bye"
TCP stack issues a RST/ACK rather than a FIN/ACK to server.
Expected results:
message should actually be loaded and displayed
| Reporter | ||
Comment 1•1 year ago
|
||
Screenshot of decoded wireshark
http://andyburns.uk/misc/tb-connect-quit.png
Comment 2•1 year ago
|
||
Without steps to reproduce, and nntp log (see https://wiki.mozilla.org/MailNews:Logging) there's not much to do here.
| Reporter | ||
Comment 3•1 year ago
|
||
Logs I can provide, but given that steps to triggering the issue likely depends on the state of the someone else's NNTP server, and my timing of getting new messages and retrieving them, it's effectively beyond my control, however I can say it occurs several times a day to me ... I'm aware of other users seeing the issue, and it occurs on multiple servers.
| Reporter | ||
Comment 4•1 year ago
|
||
| Reporter | ||
Comment 5•1 year ago
|
||
I turned on nntp logging and was able to see TB had got into the "connect/quit loop" within a few minutes
mailnews.nntp: Done with status=0 NntpClient.sys.mjs:957:18
mailnews.nntp: asyncOpen news://news.individual.net:119/lp4o5iFf960U1%40mid.individual.net?group=comp.mobile.android&key=139908 NntpChannel.sys.mjs:164:18
NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIMsgFolder.firstNewMessage]
onMessagesLoaded chrome://messenger/content/mailCommon.js:1077
_applyViewChanges resource:///modules/DBViewWrapper.sys.mjs:1548
_enterFolder resource:///modules/DBViewWrapper.sys.mjs:1282
_folderLoaded resource:///modules/DBViewWrapper.sys.mjs:1260
onFolderEvent resource:///modules/DBViewWrapper.sys.mjs:246
updateFolderAndNotifyOnLoad resource:///modules/DBViewWrapper.sys.mjs:113
open resource:///modules/DBViewWrapper.sys.mjs:884
_onSelect chrome://messenger/content/about3Pane.js:2756
handleEvent chrome://messenger/content/about3Pane.js:1725
updateRowClasses chrome://messenger/content/tree-listbox-mixin.mjs:717
updateSelection chrome://messenger/content/tree-listbox-mixin.mjs:599
_onClick chrome://messenger/content/tree-listbox-mixin.mjs:189
handleEvent chrome://messenger/content/tree-listbox-mixin.mjs:156
mailCommon.js:1085:19
2147942487 ContentDispatchChooser.sys.mjs:83:18
mailnews.nntp: S: 503 time out
NntpClient.sys.mjs:166:18
mailnews.nntp: Connection closed. NntpClient.sys.mjs:276:18
NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIMsgFolder.firstNewMessage]
onMessagesLoaded chrome://messenger/content/mailCommon.js:1077
_applyViewChanges resource:///modules/DBViewWrapper.sys.mjs:1548
_enterFolder resource:///modules/DBViewWrapper.sys.mjs:1282
open resource:///modules/DBViewWrapper.sys.mjs:897
_onSelect chrome://messenger/content/about3Pane.js:2756
handleEvent chrome://messenger/content/about3Pane.js:1725
updateRowClasses chrome://messenger/content/tree-listbox-mixin.mjs:717
updateSelection chrome://messenger/content/tree-listbox-mixin.mjs:599
_onClick chrome://messenger/content/tree-listbox-mixin.mjs:189
handleEvent chrome://messenger/content/tree-listbox-mixin.mjs:156
mailCommon.js:1085:19
mailnews.nntp: Connecting to news://news.plus.net:119 NntpClient.sys.mjs:96:20
mailnews.nntp: Connected NntpClient.sys.mjs:141:18
mailnews.nntp: S: 200 News.GigaNews.Com
NntpClient.sys.mjs:166:18
mailnews.nntp: C: QUIT NntpClient.sys.mjs:320:20
mailnews.nntp: Done with status=2147500037 NntpClient.sys.mjs:957:18
mailnews.nntp: Connection closed. NntpClient.sys.mjs:276:18
mailnews.nntp: loadMessage news-message://news.plus.net/uk.rec.sheds#1374969 NntpMessageService.sys.mjs:29:18
mailnews.nntp: asyncOpen news://news.plus.net:119/20241107205339.315cdf05a08b192f4fd7aa2a%40127.0.0.1?group=uk.rec.sheds&key=1374969 NntpChannel.sys.mjs:164:18
mailnews.nntp: Read from server NntpChannel.sys.mjs:281:18
mailnews.nntp: Connecting to news://news.plus.net:119 NntpClient.sys.mjs:96:20
mailnews.nntp: Connected NntpClient.sys.mjs:141:18
mailnews.nntp: S: 200 News.GigaNews.Com
NntpClient.sys.mjs:166:18
mailnews.nntp: C: QUIT NntpClient.sys.mjs:320:20
mailnews.nntp: Done with status=2147500037 NntpClient.sys.mjs:957:18
This page is in Quirks Mode. Page layout may be impacted. For Standards Mode use “<!DOCTYPE html>”.
2 20241107205339.315cdf05a08b192f4fd7aa2a@127.0.0.1
mailnews.nntp: Connection closed. NntpClient.sys.mjs:276:18
mailnews.nntp: Connecting to news://news.plus.net:119 NntpClient.sys.mjs:96:20
mailnews.nntp: Connected NntpClient.sys.mjs:141:18
mailnews.nntp: S: 200 News.GigaNews.Com
NntpClient.sys.mjs:166:18
mailnews.nntp: C: QUIT NntpClient.sys.mjs:320:20
mailnews.nntp: Done with status=2147500037 NntpClient.sys.mjs:957:18
mailnews.nntp: Connection closed. NntpClient.sys.mjs:276:18
mailnews.nntp: Connecting to news://news.plus.net:119 NntpClient.sys.mjs:96:20
mailnews.nntp: Connected NntpClient.sys.mjs:141:18
mailnews.nntp: S: 200 News.GigaNews.Com
NntpClient.sys.mjs:166:18
mailnews.nntp: C: QUIT NntpClient.sys.mjs:320:20
mailnews.nntp: Done with status=2147500037 NntpClient.sys.mjs:957:18
mailnews.nntp: Connection closed. NntpClient.sys.mjs:276:18
mailnews.nntp: loadMessage news-message://news.plus.net/uk.rec.sheds#1374965 NntpMessageService.sys.mjs:29:18
mailnews.nntp: asyncOpen news://news.plus.net:119/997519690.752671787.656505.timdownieuk-yahoo.co.youkay%40news.individual.net?group=uk.rec.sheds&key=1374965 2 NntpChannel.sys.mjs:164:18
mailnews.nntp: loadMessage news-message://news.plus.net/uk.rec.sheds#1374966 NntpMessageService.sys.mjs:29:18
mailnews.nntp: asyncOpen news://news.plus.net:119/vgibfg%242l0hl%241%40dont-email.me?group=uk.rec.sheds&key=1374966 2 NntpChannel.sys.mjs:164:18
mailnews.nntp: loadMessage news-message://news.plus.net/uk.rec.sheds#1374967 NntpMessageService.sys.mjs:29:18
mailnews.nntp: asyncOpen news://news.plus.net:119/lp49v6FdhbdU1%40mid.individual.net?group=uk.rec.sheds&key=1374967 2 NntpChannel.sys.mjs:164:18
mailnews.nntp: loadMessage news-message://news.plus.net/uk.rec.sheds#1374957 NntpMessageService.sys.mjs:29:18
mailnews.nntp: asyncOpen news://news.plus.net:119/vghr5n%242iim4%241%40dont-email.me?group=uk.rec.sheds&key=1374957 NntpChannel.sys.mjs:164:18
mailnews.nntp: Read from server NntpChannel.sys.mjs:281:18
mailnews.nntp: Connecting to news://news.plus.net:119 NntpClient.sys.mjs:96:20
mailnews.nntp: Connected NntpClient.sys.mjs:141:18
mailnews.nntp: S: 200 News.GigaNews.Com
NntpClient.sys.mjs:166:18
mailnews.nntp: C: QUIT NntpClient.sys.mjs:320:20
mailnews.nntp: Done with status=2147500037 NntpClient.sys.mjs:957:18
This page is in Quirks Mode. Page layout may be impacted. For Standards Mode use “<!DOCTYPE html>”.
2 vghr5n$2iim4$1@dont-email.me
mailnews.nntp: Connection closed. NntpClient.sys.mjs:276:18
mailnews.nntp: loadMessage news-message://news.plus.net/uk.rec.sheds#1374958 NntpMessageService.sys.mjs:29:18
mailnews.nntp: asyncOpen news://news.plus.net:119/09bf19dd-67af-181f-95fb-425501a6261a%40example.net?group=uk.rec.sheds&key=1374958 2 NntpChannel.sys.mjs:164:18
mailnews.nntp: loadMessage news-message://news.plus.net/uk.rec.sheds#1374957 NntpMessageService.sys.mjs:29:18
mailnews.nntp: asyncOpen news://news.plus.net:119/vghr5n%242iim4%241%40dont-email.me?group=uk.rec.sheds&key=1374957 NntpChannel.sys.mjs:164:18
mailnews.nntp: Read from server NntpChannel.sys.mjs:281:18
mailnews.nntp: Connecting to news://news.plus.net:119 NntpClient.sys.mjs:96:20
mailnews.nntp: Connected NntpClient.sys.mjs:141:18
mailnews.nntp: S: 200 News.GigaNews.Com
NntpClient.sys.mjs:166:18
mailnews.nntp: C: QUIT NntpClient.sys.mjs:320:20
mailnews.nntp: Done with status=2147500037 NntpClient.sys.mjs:957:18
This page is in Quirks Mode. Page layout may be impacted. For Standards Mode use “<!DOCTYPE html>”.
vghr5n$2iim4$1@dont-email.me
This page is in Quirks Mode. Page layout may be impacted. For Standards Mode use “<!DOCTYPE html>”.
vghr5n$2iim4$1@dont-email.me
mailnews.nntp: Connection closed. NntpClient.sys.mjs:276:18
Comment 6•1 year ago
|
||
That would be more informative with timestamps. See the wiki for how to enable.
And also accompanying explanation for what you do and see for the steps.
| Reporter | ||
Comment 7•1 year ago
|
||
I already cleared the error console, but have restarted it and had discovered how to see timestamps.
I'm trying to home-in on a sequence of events to reproduce, for giganews I think it's related to a timeout that happens after several keepalives, for news.individual it seems to be a connection that doesn't use keepalives, but closes unexpectedly after a longer period of inactivity and then several retransmissions.
Comment 8•1 year ago
|
||
(In reply to Andy Burns from comment #4)
This line is noticeable:
mailnews.nntp: S: 503 time out
The log contains data on two connections to two different servers - "Individual.net" and "Plus.net" (or GigaNews.Com). It is therefore not possible to say for sure which one it comes from. However, I have not yet seen it at Individual. I am not familiar with Plus.net.
Can you say which one is causing the bug?
503 time out is not standardized in RFC 3977. TB therefore probably does not expect it. Which commands were sent to the server shortly before?
If TB does not expect this response, unusual effects may occur.
| Reporter | ||
Comment 9•1 year ago
|
||
The 503 timeout occurs on the plusnet (giganews) connection, I'm not saying that no errors occur on the individual.net server, but I'm concentrating on the plusnet server.
There are differences in the way the two servers behave...
plusnet appears to use TCP keepalive messages, after 7 or 8 such keepalive exchanges, the server seems to regard the connection as "idle" and then it responds with the 503, I assume that TB then marks the connection as ended, and next time it needs to retrieve a message, it makes a new TCP connection, but it then immediately sends a QUIT message and doesn't attempt to retrieve the article, but doesn't report any problem, it just leaves the message as not retrieved.
Maybe some sort of flag to indicate the connection is closed isn't being cleared and the nntpclient drops into some code that kills brand new connections? it feels that way, until I switch it to working offline and back online again, maybe that clears the flag?
The individual.net server doesn't seem to use keepalives, though I think after a longer period the connection gets closed, hence less likely that I see problems on that server as I will likely retrieve new headers or articles during the longer timeout period.
I've re-added an eternal sepetember account now, just to have something else to compare to.
Comment 10•1 year ago
|
||
(In reply to Alfred Peters [:infofrommozilla] from comment #8)
Can you say which one is causing the bug?
Which commands were sent to the server shortly before?
?
Comment 11•1 year ago
|
||
See/dupe of bug 1901338
| Reporter | ||
Comment 12•1 year ago
|
||
It's the plusnet/giganews server that returns the 503 timeout error
The following is a minimal sequence of events to reproduce the issue
I focus on a thread which has unread messages from the plusnet server
go offline (to kill all nntp connections)
start a fresh wireshark capture of all port 119 traffic
clear error console
get new messages (not for all servers, just for the plusnet server)
TB realises it needs a new connection, server sends 200 response
TB sends a GROUP command an receives a reply
[sometimes it also sends an XOVER command and receives a reply]
I then do nothing for several minutes,
during which keepalives are exchanged
until the 503 timeout status is send [not in response to any request from TB]
at this point there is no nntp connection,
confirmed by netstat -an | find ":119"
Then I click on one of the unread messages in the same thread
[***]
TB realises it needs another new connection
it makes a connection which is accepted by the server
TB immediately sends a QUIT request, server responds with 205 bye status
at this point there is no nntp connection,
confirmed by netstat command
TB is now "stuck" every action results in it looping back to [***]
the only way to end this loop, is to go offline/online again.
| Reporter | ||
Comment 13•1 year ago
|
||
| Reporter | ||
Comment 14•1 year ago
|
||
| Reporter | ||
Updated•1 year ago
|
| Reporter | ||
Updated•1 year ago
|
Comment 15•1 year ago
|
||
(In reply to Magnus Melin [:mkmelin] from comment #11)
See/dupe of bug 1901338
If it is due to the server terminating the connection, yes.
However, the log looks as if TB terminates the connection repeatedly for no reason after the 503 response. Perhaps this response is confusing the NNTP client. Someone should take a look at this...
Comment 16•1 year ago
|
||
I have a slightly different scenario but a similarly bad ending: no more NNTP exchanges, regardless of how hard I hit F5, until I go offline then back online.
The scenario is simply to use nntp.lore.kernel.org:119 (it is a public NNTP server dedicated to Linux kernel development; a pretty important use case I'd say ;-), get an update, look at a few messages ... then do nothing for 5 minutes.
Then, do a "Get Messages" on one of the newsgroups. Stalled.
Here is what can be seen from a tcpdump capture on port 119:
- during the 5-minute inactivity period, both endpoints (TB and the remote server) happily keep the connection. No packets exchanged (which is normal as TCP keepalives on this system are set to kick in only after 2 hours).
- when we awaken TB for a refresh, it issues a "GROUP" command on the already established TCP
- at this point, the
nntp.lore.kernel.orgserver seems not to be happy (applicative timeout ?) and closes with just a TCP FIN (no 503 or other payload). - the TB side replies with a FIN, and netstat show the connection is fully terminated
So, at this point "the TB side" (here, the Linux OS, Debian 11) knows very well that the connection is done. TB itself, I'm not sure. Because further refreshes never try a new connection, until I manually put TB offline then back online.
I think something is wrong with the network callbacks for NNTP connections' end-of-life.
Comment 17•1 year ago
|
||
Update: if I force the FIN during inactivity (i.e. not in response to a command like "GROUP"), then TB correctly reacts by reconnecting, and doing the sequel of its NNTP duties on the fresh connection.
So, it looks like the failed callback occurs only while TB is expecting an applicative (NNTP) response ; when TB is idle, the end-of-connection callback kicks in normally.
Hope this helps locate the faulty code.
Comment 18•1 year ago
|
||
An additional observation: when the connection ends with an RST instead of a FIN, the reaction is different. In that case, TB "fails once" to restore the NNTP connection, but a subsequent "Get Messages" properly starts afresh, and succeeds in reconnecting.
Putting breakpoints inside the NntpClient class shows that
- in the RST case, _onError() is called with errorCode == NS_ERROR_NET_RESET (of course), then _onClose() is called
- in the FIN case, only _onClose() is called
So it looks like _onError() has a "cleaning up" effect even in the context of expecting an NNTP response from the server, while _onClose() alone doesn't in that situation.
| Reporter | ||
Comment 19•1 year ago
|
||
Having given the steps to reproduce with the giganews server which sends status 503 after a timeout period, I can confirm that using a different news server which doesn't do that, I notice TB still gets into a state where repeatedly clicking "get messages" results in a status bar message "there are no new messages on server", yet switching offline and back online, the getting messages again, will retrieve new messages in multiple groups.
I will try to reproduce the steps using individual.net but it is more like a 2 hour sequence, that the 5 minute sequence with giganews.
Comment 20•1 year ago
|
||
In addition to the code-level hints above, please also note that the problem is easily missed if one has too short a refresh period. As the "Check for new messages every" value cannot be set to "infinite" by value zero (in which case it is close to 30 seconds), I just set it to 1440 (aka one day) to be on the safe side. With this, on nntp.lore.kernel.org:119, I get a 100% repro after 6 minutes of inactivity.
Comment 21•1 year ago
|
||
In case somebody is blocked by this and cannot wait until it is fixed (my case), here is a workaround: simply use nftables's set operator to map all NNTP FIN to RST, unconditionally:
# I'm doing NNTP over SSH port forwarding, that's why I get FIN instead of RST
# so, the offending FINs are over the `lo` device, with remote TCP port 1119
table netdev ta {
chain ch {
type filter hook ingress device "lo" priority filter; policy accept;
tcp sport 1119 tcp flags fin tcp flags set rst
}
}
The result is that when waking up from a long inactivity, TB receives an RST after writing an NNTP command to the dead connection, and posts a pop-up saying so. A second "Get Messages" then works properly. No more need to go offline/online.
Comment 22•8 months ago
|
||
From comment 16:
The scenario is simply to use nntp.lore.kernel.org:119 (it is a public NNTP server dedicated to Linux kernel development; a pretty important use case I'd say ;-), get an update, look at a few messages ... then do nothing for 5 minutes.
Then, do a "Get Messages" on one of the newsgroups. Stalled.
Set up account as described keeping "check for message every X min" and "check for messages at startup" both disabled. Messages come in on "get messages" click. Wait >5 minutes. Click "get messages" again and see popup saying "Connection to server nntp.lore.kernel.org was reset". Click again and new messages come in.
From comment 18:
An additional observation: when the connection ends with an RST instead of a FIN, the reaction is different. In that case, TB "fails once" to restore the NNTP connection, but a subsequent "Get Messages" properly starts afresh, and succeeds in reconnecting.
I suppose this is what I'm seeing.
Anyhow, need to check this with wireshark. I strongly suspect this is similar (or maybe the same) issue as bug 1901338.
| Assignee | ||
Comment 23•5 months ago
|
||
Fortunately very easy to reproduce, even without a Giganews subscription:
- Enable NNTP logging.
- Set up a newsgroups account for news.giganews.com.
- "Subscribe…"
- "Refresh"
- "Cancel"
- Wait for 5 minutes.
The following simple patch should fix this.
| Assignee | ||
Comment 24•5 months ago
|
||
Updated•5 months ago
|
| Reporter | ||
Comment 25•5 months ago
|
||
If this makes its way into a nightly build of some sort, I'll be happy to test it ...
Comment 26•5 months ago
|
||
@Hartmut: just tested by hand-patching the omni.ja in my 128.11.0esr => works like a charm !!!
You are awesome: thanks a lot !
| Assignee | ||
Updated•5 months ago
|
Comment 28•5 months ago
|
||
Pushed by geoff@darktrojan.net:
https://hg.mozilla.org/comm-central/rev/66973eedb82b
Reset NntpClient._nextAction as well. r=mkmelin
| Assignee | ||
Comment 29•5 months ago
|
||
(In reply to Andy Burns from comment #25)
If this makes its way into a nightly build of some sort, I'll be happy to test it ...
For Windows 64-bit you can use target.installer.exe from here, while it should be in the next official Daily (available from the Thunderbird website) as well.
| Assignee | ||
Comment 30•5 months ago
|
||
(In reply to alexandre.ferrieux from comment #26)
@Hartmut: just tested by hand-patching the omni.ja in my 128.11.0esr => works like a charm !!!
You are awesome: thanks a lot !
That's really nice to hear. Thanks for your quick feedback!
| Reporter | ||
Comment 31•5 months ago
|
||
(In reply to Hartmut Welpmann [:welpy-cw] from comment #29)
For Windows 64-bit you can use
target.installer.exefrom here
I think I've run it long enough, and refreshed groups enough times, to say it would normally have gone wrong before now, but it's working fine.
Thanks.
One weird thing ,may be just an unrelated "daily" thing, but I have an RSS feeds account, and after the first time I do a "get messages" all four feeds within it start showing a "wig-wag" icon, which never stops wig-wagging?
| Assignee | ||
Comment 32•5 months ago
|
||
Comment on attachment 9492744 [details]
Bug 1929948 - Reset NntpClient._nextAction as well. r=#thunderbird-reviewers
Uplift Approval Request
- Please state case for uplift consideration and ensure bug severity is set: Very simple fix with considerable benefit for some users.
- User impact if declined: Reestablishing a connection to an NNTP server may not be possible.
- Is this code covered by automated tests?: No
- Has the fix been verified in Daily?: Yes
- Has the fix been verified in Beta?: No
- Needs manual test from QA?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): The state of the NntpClient wasn't completely reset after an action had been finished, I don't expect any side-effects from fixing this.
- String changes made/needed: None.
| Assignee | ||
Updated•5 months ago
|
Comment 33•5 months ago
|
||
Comment on attachment 9492744 [details]
Bug 1929948 - Reset NntpClient._nextAction as well. r=#thunderbird-reviewers
[Triage Comment]
Approved for beta
Comment 34•5 months ago
|
||
| bugherder uplift | ||
Thunderbird 140.0b3:
https://hg.mozilla.org/releases/comm-beta/rev/6c79ff4aee37
Description
•