Closed Bug 1929948 Opened 1 year ago Closed 5 months ago

Unable to re-connect to certain NNTP server.

Categories

(MailNews Core :: Networking: NNTP, defect)

Thunderbird 128
defect

Tracking

(thunderbird_esr115 wontfix, thunderbird_esr128 wontfix, thunderbird140 fixed, thunderbird141 fixed)

RESOLVED FIXED
141 Branch
Tracking Status
thunderbird_esr115 --- wontfix
thunderbird_esr128 --- wontfix
thunderbird140 --- fixed
thunderbird141 --- fixed

People

(Reporter: mozillabugz, Assigned: welpy-cw)

References

Details

Attachments

(3 files, 2 obsolete files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:132.0) Gecko/20100101 Firefox/132.0

Steps to reproduce:

I have had an "uneasy feeling" since the C/C++ NNTP client was replaced by the JS client that all is not well with how it handles connections, difficult to give exact steps to reproduce, however once it gets into a "bad state" it requires switching to work offline, switching back to work online, then it starts working again.

If I can establish a method to accurately reproduce, I will come back and update, but I suspect it depends on TCP connection idle times, inactivity within TB and other "randomish" events.

Actual results:

I can end up in a state where there are messages for a usenet group that are unread with bodies not retrieved, clicking on a message claims to be "loading message" in the status bar, but it never loads, I can switch to another message but it similarly doesn't load, double-clicking the message to load in a new tab still doesn't load.

Once in this state I can perform a wirshark trace and see that what happens is
TB connects to server
server responds "200 welcome"
TB immediately sends "QUIT" and closes connection
server responds "205 bye"
TCP stack issues a RST/ACK rather than a FIN/ACK to server.

Expected results:

message should actually be loaded and displayed

Screenshot of decoded wireshark
http://andyburns.uk/misc/tb-connect-quit.png

Without steps to reproduce, and nntp log (see https://wiki.mozilla.org/MailNews:Logging) there's not much to do here.

Component: Untriaged → Networking: NNTP
Product: Thunderbird → MailNews Core

Logs I can provide, but given that steps to triggering the issue likely depends on the state of the someone else's NNTP server, and my timing of getting new messages and retrieving them, it's effectively beyond my control, however I can say it occurs several times a day to me ... I'm aware of other users seeing the issue, and it occurs on multiple servers.

Attached file Error Console Output (obsolete) —
I turned on logging and was able to see it in the "connect/quit loop" within a few minutes

I turned on nntp logging and was able to see TB had got into the "connect/quit loop" within a few minutes

mailnews.nntp: Done with status=0 NntpClient.sys.mjs:957:18
mailnews.nntp: asyncOpen news://news.individual.net:119/lp4o5iFf960U1%40mid.individual.net?group=comp.mobile.android&key=139908 NntpChannel.sys.mjs:164:18
NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIMsgFolder.firstNewMessage]
    onMessagesLoaded chrome://messenger/content/mailCommon.js:1077
    _applyViewChanges resource:///modules/DBViewWrapper.sys.mjs:1548
    _enterFolder resource:///modules/DBViewWrapper.sys.mjs:1282
    _folderLoaded resource:///modules/DBViewWrapper.sys.mjs:1260
    onFolderEvent resource:///modules/DBViewWrapper.sys.mjs:246
    updateFolderAndNotifyOnLoad resource:///modules/DBViewWrapper.sys.mjs:113
    open resource:///modules/DBViewWrapper.sys.mjs:884
    _onSelect chrome://messenger/content/about3Pane.js:2756
    handleEvent chrome://messenger/content/about3Pane.js:1725
    updateRowClasses chrome://messenger/content/tree-listbox-mixin.mjs:717
    updateSelection chrome://messenger/content/tree-listbox-mixin.mjs:599
    _onClick chrome://messenger/content/tree-listbox-mixin.mjs:189
    handleEvent chrome://messenger/content/tree-listbox-mixin.mjs:156
mailCommon.js:1085:19
2147942487 ContentDispatchChooser.sys.mjs:83:18
mailnews.nntp: S: 503 time out

NntpClient.sys.mjs:166:18
mailnews.nntp: Connection closed. NntpClient.sys.mjs:276:18
NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIMsgFolder.firstNewMessage]
    onMessagesLoaded chrome://messenger/content/mailCommon.js:1077
    _applyViewChanges resource:///modules/DBViewWrapper.sys.mjs:1548
    _enterFolder resource:///modules/DBViewWrapper.sys.mjs:1282
    open resource:///modules/DBViewWrapper.sys.mjs:897
    _onSelect chrome://messenger/content/about3Pane.js:2756
    handleEvent chrome://messenger/content/about3Pane.js:1725
    updateRowClasses chrome://messenger/content/tree-listbox-mixin.mjs:717
    updateSelection chrome://messenger/content/tree-listbox-mixin.mjs:599
    _onClick chrome://messenger/content/tree-listbox-mixin.mjs:189
    handleEvent chrome://messenger/content/tree-listbox-mixin.mjs:156
mailCommon.js:1085:19
mailnews.nntp: Connecting to news://news.plus.net:119 NntpClient.sys.mjs:96:20
mailnews.nntp: Connected NntpClient.sys.mjs:141:18
mailnews.nntp: S: 200 News.GigaNews.Com

NntpClient.sys.mjs:166:18
mailnews.nntp: C: QUIT NntpClient.sys.mjs:320:20
mailnews.nntp: Done with status=2147500037 NntpClient.sys.mjs:957:18
mailnews.nntp: Connection closed. NntpClient.sys.mjs:276:18
mailnews.nntp: loadMessage news-message://news.plus.net/uk.rec.sheds#1374969 NntpMessageService.sys.mjs:29:18
mailnews.nntp: asyncOpen news://news.plus.net:119/20241107205339.315cdf05a08b192f4fd7aa2a%40127.0.0.1?group=uk.rec.sheds&key=1374969 NntpChannel.sys.mjs:164:18
mailnews.nntp: Read from server NntpChannel.sys.mjs:281:18
mailnews.nntp: Connecting to news://news.plus.net:119 NntpClient.sys.mjs:96:20
mailnews.nntp: Connected NntpClient.sys.mjs:141:18
mailnews.nntp: S: 200 News.GigaNews.Com

NntpClient.sys.mjs:166:18
mailnews.nntp: C: QUIT NntpClient.sys.mjs:320:20
mailnews.nntp: Done with status=2147500037 NntpClient.sys.mjs:957:18
This page is in Quirks Mode. Page layout may be impacted. For Standards Mode use “<!DOCTYPE html>”.
2 20241107205339.315cdf05a08b192f4fd7aa2a@127.0.0.1
mailnews.nntp: Connection closed. NntpClient.sys.mjs:276:18
mailnews.nntp: Connecting to news://news.plus.net:119 NntpClient.sys.mjs:96:20
mailnews.nntp: Connected NntpClient.sys.mjs:141:18
mailnews.nntp: S: 200 News.GigaNews.Com

NntpClient.sys.mjs:166:18
mailnews.nntp: C: QUIT NntpClient.sys.mjs:320:20
mailnews.nntp: Done with status=2147500037 NntpClient.sys.mjs:957:18
mailnews.nntp: Connection closed. NntpClient.sys.mjs:276:18
mailnews.nntp: Connecting to news://news.plus.net:119 NntpClient.sys.mjs:96:20
mailnews.nntp: Connected NntpClient.sys.mjs:141:18
mailnews.nntp: S: 200 News.GigaNews.Com

NntpClient.sys.mjs:166:18
mailnews.nntp: C: QUIT NntpClient.sys.mjs:320:20
mailnews.nntp: Done with status=2147500037 NntpClient.sys.mjs:957:18
mailnews.nntp: Connection closed. NntpClient.sys.mjs:276:18
mailnews.nntp: loadMessage news-message://news.plus.net/uk.rec.sheds#1374965 NntpMessageService.sys.mjs:29:18
mailnews.nntp: asyncOpen news://news.plus.net:119/997519690.752671787.656505.timdownieuk-yahoo.co.youkay%40news.individual.net?group=uk.rec.sheds&key=1374965 2 NntpChannel.sys.mjs:164:18
mailnews.nntp: loadMessage news-message://news.plus.net/uk.rec.sheds#1374966 NntpMessageService.sys.mjs:29:18
mailnews.nntp: asyncOpen news://news.plus.net:119/vgibfg%242l0hl%241%40dont-email.me?group=uk.rec.sheds&key=1374966 2 NntpChannel.sys.mjs:164:18
mailnews.nntp: loadMessage news-message://news.plus.net/uk.rec.sheds#1374967 NntpMessageService.sys.mjs:29:18
mailnews.nntp: asyncOpen news://news.plus.net:119/lp49v6FdhbdU1%40mid.individual.net?group=uk.rec.sheds&key=1374967 2 NntpChannel.sys.mjs:164:18
mailnews.nntp: loadMessage news-message://news.plus.net/uk.rec.sheds#1374957 NntpMessageService.sys.mjs:29:18
mailnews.nntp: asyncOpen news://news.plus.net:119/vghr5n%242iim4%241%40dont-email.me?group=uk.rec.sheds&key=1374957 NntpChannel.sys.mjs:164:18
mailnews.nntp: Read from server NntpChannel.sys.mjs:281:18
mailnews.nntp: Connecting to news://news.plus.net:119 NntpClient.sys.mjs:96:20
mailnews.nntp: Connected NntpClient.sys.mjs:141:18
mailnews.nntp: S: 200 News.GigaNews.Com

NntpClient.sys.mjs:166:18
mailnews.nntp: C: QUIT NntpClient.sys.mjs:320:20
mailnews.nntp: Done with status=2147500037 NntpClient.sys.mjs:957:18
This page is in Quirks Mode. Page layout may be impacted. For Standards Mode use “<!DOCTYPE html>”.
2 vghr5n$2iim4$1@dont-email.me
mailnews.nntp: Connection closed. NntpClient.sys.mjs:276:18
mailnews.nntp: loadMessage news-message://news.plus.net/uk.rec.sheds#1374958 NntpMessageService.sys.mjs:29:18
mailnews.nntp: asyncOpen news://news.plus.net:119/09bf19dd-67af-181f-95fb-425501a6261a%40example.net?group=uk.rec.sheds&key=1374958 2 NntpChannel.sys.mjs:164:18
mailnews.nntp: loadMessage news-message://news.plus.net/uk.rec.sheds#1374957 NntpMessageService.sys.mjs:29:18
mailnews.nntp: asyncOpen news://news.plus.net:119/vghr5n%242iim4%241%40dont-email.me?group=uk.rec.sheds&key=1374957 NntpChannel.sys.mjs:164:18
mailnews.nntp: Read from server NntpChannel.sys.mjs:281:18
mailnews.nntp: Connecting to news://news.plus.net:119 NntpClient.sys.mjs:96:20
mailnews.nntp: Connected NntpClient.sys.mjs:141:18
mailnews.nntp: S: 200 News.GigaNews.Com

NntpClient.sys.mjs:166:18
mailnews.nntp: C: QUIT NntpClient.sys.mjs:320:20
mailnews.nntp: Done with status=2147500037 NntpClient.sys.mjs:957:18
This page is in Quirks Mode. Page layout may be impacted. For Standards Mode use “<!DOCTYPE html>”.
vghr5n$2iim4$1@dont-email.me
This page is in Quirks Mode. Page layout may be impacted. For Standards Mode use “<!DOCTYPE html>”.
vghr5n$2iim4$1@dont-email.me
mailnews.nntp: Connection closed. NntpClient.sys.mjs:276:18

​

That would be more informative with timestamps. See the wiki for how to enable.
And also accompanying explanation for what you do and see for the steps.

I already cleared the error console, but have restarted it and had discovered how to see timestamps.

I'm trying to home-in on a sequence of events to reproduce, for giganews I think it's related to a timeout that happens after several keepalives, for news.individual it seems to be a connection that doesn't use keepalives, but closes unexpectedly after a longer period of inactivity and then several retransmissions.

(In reply to Andy Burns from comment #4)

This line is noticeable:

mailnews.nntp: S: 503 time out

The log contains data on two connections to two different servers - "Individual.net" and "Plus.net" (or GigaNews.Com). It is therefore not possible to say for sure which one it comes from. However, I have not yet seen it at Individual. I am not familiar with Plus.net.

Can you say which one is causing the bug?

503 time out is not standardized in RFC 3977. TB therefore probably does not expect it. Which commands were sent to the server shortly before?
If TB does not expect this response, unusual effects may occur.

The 503 timeout occurs on the plusnet (giganews) connection, I'm not saying that no errors occur on the individual.net server, but I'm concentrating on the plusnet server.

There are differences in the way the two servers behave...

plusnet appears to use TCP keepalive messages, after 7 or 8 such keepalive exchanges, the server seems to regard the connection as "idle" and then it responds with the 503, I assume that TB then marks the connection as ended, and next time it needs to retrieve a message, it makes a new TCP connection, but it then immediately sends a QUIT message and doesn't attempt to retrieve the article, but doesn't report any problem, it just leaves the message as not retrieved.

Maybe some sort of flag to indicate the connection is closed isn't being cleared and the nntpclient drops into some code that kills brand new connections? it feels that way, until I switch it to working offline and back online again, maybe that clears the flag?

The individual.net server doesn't seem to use keepalives, though I think after a longer period the connection gets closed, hence less likely that I see problems on that server as I will likely retrieve new headers or articles during the longer timeout period.

I've re-added an eternal sepetember account now, just to have something else to compare to.

(In reply to Alfred Peters [:infofrommozilla] from comment #8)

Can you say which one is causing the bug?

Which commands were sent to the server shortly before?

?

See/dupe of bug 1901338

It's the plusnet/giganews server that returns the 503 timeout error
The following is a minimal sequence of events to reproduce the issue

I focus on a thread which has unread messages from the plusnet server

go offline (to kill all nntp connections)

start a fresh wireshark capture of all port 119 traffic
clear error console

get new messages (not for all servers, just for the plusnet server)

TB realises it needs a new connection, server sends 200 response
TB sends a GROUP command an receives a reply
[sometimes it also sends an XOVER command and receives a reply]

I then do nothing for several minutes,
during which keepalives are exchanged
until the 503 timeout status is send [not in response to any request from TB]

at this point there is no nntp connection,
confirmed by netstat -an | find ":119"

Then I click on one of the unread messages in the same thread

[***]
TB realises it needs another new connection
it makes a connection which is accepted by the server
TB immediately sends a QUIT request, server responds with 205 bye status

at this point there is no nntp connection,
confirmed by netstat command

TB is now "stuck" every action results in it looping back to [***]
the only way to end this loop, is to go offline/online again.

Attached image Minimal console log
Attachment #9436226 - Attachment is obsolete: true
Attachment #9436254 - Attachment is obsolete: true

(In reply to Magnus Melin [:mkmelin] from comment #11)

See/dupe of bug 1901338

If it is due to the server terminating the connection, yes.

However, the log looks as if TB terminates the connection repeatedly for no reason after the 503 response. Perhaps this response is confusing the NNTP client. Someone should take a look at this...

I have a slightly different scenario but a similarly bad ending: no more NNTP exchanges, regardless of how hard I hit F5, until I go offline then back online.

The scenario is simply to use nntp.lore.kernel.org:119 (it is a public NNTP server dedicated to Linux kernel development; a pretty important use case I'd say ;-), get an update, look at a few messages ... then do nothing for 5 minutes.

Then, do a "Get Messages" on one of the newsgroups. Stalled.

Here is what can be seen from a tcpdump capture on port 119:

  • during the 5-minute inactivity period, both endpoints (TB and the remote server) happily keep the connection. No packets exchanged (which is normal as TCP keepalives on this system are set to kick in only after 2 hours).
  • when we awaken TB for a refresh, it issues a "GROUP" command on the already established TCP
  • at this point, the nntp.lore.kernel.org server seems not to be happy (applicative timeout ?) and closes with just a TCP FIN (no 503 or other payload).
  • the TB side replies with a FIN, and netstat show the connection is fully terminated

So, at this point "the TB side" (here, the Linux OS, Debian 11) knows very well that the connection is done. TB itself, I'm not sure. Because further refreshes never try a new connection, until I manually put TB offline then back online.

I think something is wrong with the network callbacks for NNTP connections' end-of-life.

Update: if I force the FIN during inactivity (i.e. not in response to a command like "GROUP"), then TB correctly reacts by reconnecting, and doing the sequel of its NNTP duties on the fresh connection.

So, it looks like the failed callback occurs only while TB is expecting an applicative (NNTP) response ; when TB is idle, the end-of-connection callback kicks in normally.

Hope this helps locate the faulty code.

An additional observation: when the connection ends with an RST instead of a FIN, the reaction is different. In that case, TB "fails once" to restore the NNTP connection, but a subsequent "Get Messages" properly starts afresh, and succeeds in reconnecting.

Putting breakpoints inside the NntpClient class shows that

  • in the RST case, _onError() is called with errorCode == NS_ERROR_NET_RESET (of course), then _onClose() is called
  • in the FIN case, only _onClose() is called

So it looks like _onError() has a "cleaning up" effect even in the context of expecting an NNTP response from the server, while _onClose() alone doesn't in that situation.

Having given the steps to reproduce with the giganews server which sends status 503 after a timeout period, I can confirm that using a different news server which doesn't do that, I notice TB still gets into a state where repeatedly clicking "get messages" results in a status bar message "there are no new messages on server", yet switching offline and back online, the getting messages again, will retrieve new messages in multiple groups.

I will try to reproduce the steps using individual.net but it is more like a 2 hour sequence, that the 5 minute sequence with giganews.

In addition to the code-level hints above, please also note that the problem is easily missed if one has too short a refresh period. As the "Check for new messages every" value cannot be set to "infinite" by value zero (in which case it is close to 30 seconds), I just set it to 1440 (aka one day) to be on the safe side. With this, on nntp.lore.kernel.org:119, I get a 100% repro after 6 minutes of inactivity.

In case somebody is blocked by this and cannot wait until it is fixed (my case), here is a workaround: simply use nftables's set operator to map all NNTP FIN to RST, unconditionally:

# I'm doing NNTP over SSH port forwarding, that's why I get FIN instead of RST
# so, the offending FINs are over the `lo` device, with remote TCP port 1119
table netdev ta {
        chain ch {
                type filter hook ingress device "lo" priority filter; policy accept;
                tcp sport 1119 tcp flags fin tcp flags set rst
        }
}

The result is that when waking up from a long inactivity, TB receives an RST after writing an NNTP command to the dead connection, and posts a pop-up saying so. A second "Get Messages" then works properly. No more need to go offline/online.

See Also: → 1901338

From comment 16:

The scenario is simply to use nntp.lore.kernel.org:119 (it is a public NNTP server dedicated to Linux kernel development; a pretty important use case I'd say ;-), get an update, look at a few messages ... then do nothing for 5 minutes.

Then, do a "Get Messages" on one of the newsgroups. Stalled.

Set up account as described keeping "check for message every X min" and "check for messages at startup" both disabled. Messages come in on "get messages" click. Wait >5 minutes. Click "get messages" again and see popup saying "Connection to server nntp.lore.kernel.org was reset". Click again and new messages come in.

From comment 18:

An additional observation: when the connection ends with an RST instead of a FIN, the reaction is different. In that case, TB "fails once" to restore the NNTP connection, but a subsequent "Get Messages" properly starts afresh, and succeeds in reconnecting.

I suppose this is what I'm seeing.
Anyhow, need to check this with wireshark. I strongly suspect this is similar (or maybe the same) issue as bug 1901338.

Fortunately very easy to reproduce, even without a Giganews subscription:

  1. Enable NNTP logging.
  2. Set up a newsgroups account for news.giganews.com.
  3. "Subscribe…"
  4. "Refresh"
  5. "Cancel"
  6. Wait for 5 minutes.

The following simple patch should fix this.

Severity: -- → S3
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: problems with NNTP sessions → Unable to re-connect to certain NNTP server.
Assignee: nobody → h.w.forms
Status: NEW → ASSIGNED

If this makes its way into a nightly build of some sort, I'll be happy to test it ...

@Hartmut: just tested by hand-patching the omni.ja in my 128.11.0esr => works like a charm !!!

You are awesome: thanks a lot !

Duplicate of this bug: 1901338
Target Milestone: --- → 141 Branch

Pushed by geoff@darktrojan.net:
https://hg.mozilla.org/comm-central/rev/66973eedb82b
Reset NntpClient._nextAction as well. r=mkmelin

Status: ASSIGNED → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED

(In reply to Andy Burns from comment #25)

If this makes its way into a nightly build of some sort, I'll be happy to test it ...

For Windows 64-bit you can use target.installer.exe from here, while it should be in the next official Daily (available from the Thunderbird website) as well.

(In reply to alexandre.ferrieux from comment #26)

@Hartmut: just tested by hand-patching the omni.ja in my 128.11.0esr => works like a charm !!!

You are awesome: thanks a lot !

That's really nice to hear. Thanks for your quick feedback!

(In reply to Hartmut Welpmann [:welpy-cw] from comment #29)

For Windows 64-bit you can use target.installer.exe from here

I think I've run it long enough, and refreshed groups enough times, to say it would normally have gone wrong before now, but it's working fine.
Thanks.

One weird thing ,may be just an unrelated "daily" thing, but I have an RSS feeds account, and after the first time I do a "get messages" all four feeds within it start showing a "wig-wag" icon, which never stops wig-wagging?

Comment on attachment 9492744 [details]
Bug 1929948 - Reset NntpClient._nextAction as well. r=#thunderbird-reviewers

Uplift Approval Request

  • Please state case for uplift consideration and ensure bug severity is set: Very simple fix with considerable benefit for some users.
  • User impact if declined: Reestablishing a connection to an NNTP server may not be possible.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Daily?: Yes
  • Has the fix been verified in Beta?: No
  • Needs manual test from QA?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The state of the NntpClient wasn't completely reset after an action had been finished, I don't expect any side-effects from fixing this.
  • String changes made/needed: None.
Attachment #9492744 - Flags: approval-comm-beta?

Comment on attachment 9492744 [details]
Bug 1929948 - Reset NntpClient._nextAction as well. r=#thunderbird-reviewers

[Triage Comment]
Approved for beta

Attachment #9492744 - Flags: approval-comm-beta? → approval-comm-beta+
No longer duplicate of this bug: 1901338
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: