1901338 - When thunderbird loses connection to news.eternal-september.org:563 it can't reconnect.

Reporter

Description

•

9 months ago

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:126.0) Gecko/20100101 Firefox/126.0

Steps to reproduce:

Cofigure a Usenet news account for news.eternal-september.org. Select port 563
and SSL/TLS, have it check for new messages once per hour. Leave it alone for an
hour or so, then try to get new messages. Results are easier to see in a high
volume newsgroup such as alt.atheism.

Actual results:

After an hour running thunderbird unattended, netstat showed that there were
no existing connections to the server. Clicking on Get New Messages did not
cause a new connection to be established. Netstat didn't show anything trying
to connect to the server. When I exited from thunderbird by clicking the x at the
right end of the title bar, a notification was displayed saying that Eternal September
had refused the connection.

The only way I could get to see new articles was to exit from thunderbird then
start it up again.

These observations were made on a computer running Fedora 40 Linux and
thunderbird 115.11.0. I haven't tried any NNTPS server besides Eternal September.
This problem doesn't happen to unencrypted connections to port 119.

Expected results:

After a connection has been dropped, thunderbird should have reconnected to
the server when I clicked check for new messages or at the scheduled hourly time
to check for new messages, and retrieved new articles.

Magnus Melin [:mkmelin]

Updated

•

9 months ago

Status: UNCONFIRMED → RESOLVED

Closed: 9 months ago

Component: Untriaged → Networking: NNTP

Duplicate of bug: 1876261

Product: Thunderbird → MailNews Core

Resolution: --- → DUPLICATE

gene smith

Assignee

Comment 2

•

9 months ago

•

Edited

(In reply to David Canzi from comment #0)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:126.0) Gecko/20100101 Firefox/126.0

Note: all monitoring of network behavior is via wireshark.

Steps to reproduce:

Cofigure a Usenet news account for news.eternal-september.org. Select port 563
and SSL/TLS, have it check for new messages once per hour. Leave it alone for an
hour or so, then try to get new messages. Results are easier to see in a high
volume newsgroup such as alt.atheism.

Ok, did that. After 30 minutes the server produces a TLS alert that the connection is going down. Then TB initiates a close of the connection, as it should.

Actual results:

After an hour running thunderbird unattended, netstat showed that there were
no existing connections to the server.

After 30 minutes there will be no connection since the server wants to shutdown the connection and TB shuts it down. Then after 30 more minutes, TB re-establishes the connection and brings in the new articles.

Clicking on Get New Messages did not
cause a new connection to be established. Netstat didn't show anything trying
to connect to the server.

If I click "Get messages" after 30m and the connection is still shut down, TB re-establishes it and TB fetches any new messages.

When I exited from thunderbird by clicking the x at the
right end of the title bar, a notification was displayed saying that Eternal September
had refused the connection.

I haven't tried that yet since everything seems to work as it should (for me).

The only way I could get to see new articles was to exit from thunderbird then
start it up again.

Haven't needed to do that.

These observations were made on a computer running Fedora 40 Linux and
thunderbird 115.11.0. I haven't tried any NNTPS server besides Eternal September.
This problem doesn't happen to unencrypted connections to port 119.

I'm on kubuntu 22.04 with latest self-built daily [128.0a1 (2024-06-10) (64-bit)] without the changes proposed in bug 1876261.

Expected results:

After a connection has been dropped, thunderbird should have reconnected to
the server when I clicked check for new messages or at the scheduled hourly time
to check for new messages, and retrieved new articles.

Seems to work for me.

gene smith

Assignee

Comment 3

•

9 months ago

I also tried it on fedora 40 with the same setup and seems to be working fine with TB at 115.9 there. (I haven't updated fedora 40 recently so not yet at 115.11.)
Were you not seeing the problem on earlier TB versions and just started seeing the problem at 115.11?

David Canzi

Reporter

Comment 4

•

9 months ago

I ran dnf upgrade late on Jun 5th, which updated thunderbird,
and I started noticing connection problems Jun 6th. I was able
to get another NNTPS server to work fine by using stunnel to
connect to it. I assumed that my connections were being timed
out by a firewall, and stunnel can be configured to time out
idle connections before the firewall does.

This morning my router failed for several hours and I rebooted
it. I can't get thunderbird to reconnect to Eternal September
now. I will shortly exit from thunderbird and restart it. I have
configured thunderbird to check for new messages every 35
minutes, intentionally longer than the server's timeout. I'll
get to observe how long after a timeout it becomes possible
to connect again.

David Canzi

Reporter

Comment 5

•

9 months ago

I started thunderbird anew just before 17:20 EDT. It was configured
to check for new messages in alt.atheism every 35 minutes. Netstat
showed two sockets connected to Eternal September. Those
sockets were closed between 17:55 and 18:00, so they were open,
or seemed to be open, for at least 35 minutes. These symptoms
resemble what happens when a firewall interrupts an idle
connection, leaving the client holding one end of a dead
connection and only discovering that the connection is dead
when it tries to use it.

An hour and a half has passed since the sockets were closed
and I still can't get thunderbird to reconnect to Eternal
September and download new messages.

gene smith

Assignee

Comment 6

•

9 months ago

I've tested with 115.11.0 provided by ubuntu and don't see a problem.

I have made a patched TB 115.11.1 version including the changes at bug 1876261 that would be good for you to try. If that fixes the problem for you it will tell for sure that what you see is the same problem as (duplicate of) bug 1876261. If not, then it is caused by something else like maybe the hypothetical firewall you mention and I have more work to do.
Here's the link to the .tar.bz2: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/dV9he__TS2CH3I8KmOEQnA/runs/0/artifacts/public/build/target.tar.bz2
You can unzip this anywhere and just run from command line the "thunderbird" executable inside. It will announce itself as "daily" but in Help/About you should see "115.11.1". You don't have to touch your fedora supplied thunderbird and can go back to using it if you want.

FYI, the full "try" build is here: https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=eb681a5ae00bb9257f845ad9a9cbf646758081bf

David Canzi

Reporter

Comment 7

•

9 months ago

Sorry about the delay.

I ran the version of thunderbird from target.tar.bz2, configured
it to access Eternal September and to check for new messages
every 35 minutes. I exited, started it anew at 14:00, and did
not interact with it. The socket it used to connect to ES
disappeared from netstat output about 35 minutes later, and
35 minutes after that it reconnected to ES and successfully
downloaded new articles.

I consider this bug to have been fixed.

My computer did not receive the notification from ES that
it should close the connection, so I still suspect a firewall
is part of the problem for me. It's not a firewall operated
by ES, but might be operated by my internet provider.

gene smith

Assignee

Comment 8

•

9 months ago

Ok, thanks! Good to hear that it is working OK with the patch and that it resolves this issue in the "real world". (My testing on the patch was done by simulating network errors by blocking ports with iptables.)

David Canzi

Reporter

Comment 9

•

9 months ago

I consider this bug to have been fixed.

I spoke too soon. Though this version (115.11.1) reconnected
35 minutes after closing the socket, it hasn't repeated this
feat. It last closed its socket at about 15:45, it is now 16:45,
and nothing I do can make thunderbird connect again to
the server.

gene smith

Assignee

Comment 10

•

9 months ago

I spoke too soon.

After I responded in comment 8 I decided to simulate what you describe again by blocking incoming port 563 using sudo iptables -A INPUT -p tcp --sport nntps -j DROP, doing this a bit less than before when the server sends the alert to drop the connection due to connection idle for 30m. I can see in wireshark that the server sends the alert but TB doesn't see it due the iptables rule. TB still thinks the connection is active and at the 35m point, tb requests nntp stuff from the server but sees no response (per wireshark, the response is sent by server). Then 35 minutes later TB tries to request data from nntp server again and again sees no response and by then the server has reset/dropped the connection. So by now, both have dropped the connection. I then remove the iptables rule and then for the next hour, nothing is sent and I thought tb had completely given up. But then a connection attempt is sent by TB and the server responds and nntp stuff flows again. After 30 minutes, nntp server drops the connection and after 5 more minutes TB reconnects and requests nntp data. This seems to be continuing and working with the server TLS layer alerting TB that the connection is about to go down after 30m idle and TB requests nntp data on a 35m interval.

So, for me, it looks mostly OK except for the 1 hour "gap" that I saw where nothing happened at all. Once I get past that, normal polling resumes with the iptables rule removed.

It last closed its socket at about 15:45, it is now 16:45,
and nothing I do can make thunderbird connect again to
the server.

Are you doing "Get messages" to request the messages or maybe just selecting messages in the list? I just let it sit and let TBs timed polling (called "biff" in the code) bring in the new messages on that high volume list.

If you could record and attach a wireshark file it might show better what is going on and I can compare it to what I see here. I just run with a simple filter tcp.port==563.

gene smith

Assignee

Comment 11

•

9 months ago

•

Edited

Ok, getting kind of late here but I did another thing to try to duplicate what you see. Instead of using my normal wifi via Charter isp, I disabled wifi on my phone and connected using the hotspot feature to, I think, AT&T internet. I noticed, with wireshark, it then started using ipv6 to connect to the ES nntp server. Then after idle for 30m, there was no notification that the connection was going down so at 35m TB just sent the nntp command to fetch messages, unaware that the connection was gone. Several retries are done with, of course, no success. Then after over an hour a new successful connection to the server is made.
So with the "biff" timer the time between server connection goes up to a bit over and hour. But I can still make immediate connections to the server by selecting the folder or clicking "get messages" with no problem while "biff" is waiting to do a connection.

Anyhow, not sure if this will help but I've added a "tcptimeout" to the patch that I one time had in but decided that it didn't help. However, with what I am seeing using the hotspot route, it may actually help. So I've made a new try build including the tcptimeout that you can test:
Edit: This is not correct, see comment 12:
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/HVPjJA22QgWU0veaQf5qCA/runs/0/artifacts/public/build/target.tar.bz2
which is at try build https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=45fc63099e37d794d81e6fd592338b51b4de8420&selectedTaskRun=HVPjJA22QgWU0veaQf5qCA.0

The tcptimeout will signal that the connection is down if there is no network response after 25 seconds. So when the server drops the connection without a notification via TLS, tb will still know after 25 seconds that a new connection needs to be established and not just keep retrying in vain.

If this still doesn't help, I'll probably still need to see a wireshark trace to know what's going on at your location.

gene smith

Assignee

Comment 12

•

9 months ago

David, I don't know if you have tried the change pointed to in the previous comment yet. I left out part of what I should have changed, which I discovered after testing it here this afternoon. I need to do some more testing and, if it looks ok, I'll re-do the patched 115.11.1 build.

David Canzi

Reporter

Comment 13

•

9 months ago

David, I don't know if you have tried the change pointed to in the previous comment yet.

I tried it at nearly the same time that you added comment 12. It opened a
connection to ES, downloaded new articles, and about half a minute later
closed the socket.

I've installed wireshark and experimented a little.

I have perl scripts that connect to NNTP and NNTPS servers independently
of thunderbird, and cron jobs that run these scripts. I'll have to figure out
when a wireshark capture won't be cluttered with irrelevant connections
and/or temporarily stop running some of these cron jobs.

I have done some extreme things. When I was puzzled by what was
happening I tried repairing my alt.atheism folder. I tried this twice.
When I noticed that thunderbird wasn't downloading new articles
even when I started it anew, I deleted the Eternal September
account, recreated it, subscribed to the newsgroups I had subscribed
to before, and reinstalled the filters.

Thunderbird seems now to be keeping up to date in alt.atheism,
but it downloads very few articles in my other two newsgroups,
and I need to check if the arrival of new articles is really that
infrequent. Need to observe a while longer.

gene smith

Assignee

Comment 14

•

9 months ago

Thunderbird seems now to be keeping up to date in alt.atheism, ...

Sounds like you are saying you don't see the original problem described on comment 0 now that you repaired the folder?

I was never able to exactly duplicate what you described (wouldn't connect or bring in new messages even if "new messages" clicked after idle for more than 30m). But, as I tried to describe in other comments, I do see another problem when I use an alternate network route (via cell phone provided internet). What I see with that is there is no indication that the server has timed out the idle connection. Sometime I see this as soon as 3 minutes after the connection goes idle in that when TB tries to send again on the idle connection, a timeout is seen.

In the NNTP rfc, in the paragraph right before here https://datatracker.ietf.org/doc/html/rfc3977#section-3.1.1, it talks about a 3 minute inactivity timer. It also says When the timer expires, the server SHOULD close the connection without sending any response to the client. I think this means there is no NNTP protocol response. But the ES server (at least the one I connect to via cell network) doesn't send any response even at the TCP/IP level it appears. When I connect via my usual Charter/Spectrum ISP, on a long time idle connection there is first a TLS level alert that the connection is going down which TB sees and it sends a TCP/IP "FIN" to shutdown the connection. Using my usual (charter) ISP, the timeout is 30m and not 3m like I sometimes see when connection via the cell route.

If you could check this at your location with wireshark, it might tell us something. Probably what you need to do is turn off "Check for new messages every X minutes" for ES account, select group alt.atheism and then shutdown TB and start wireshark listening on your network interface and set the filter to tcp.port==563. Then start TB and watch the wireshark display. You should see connections to ES and initial activities and (since very active) possibly new messages come in. Then just let it sit idle without touching TB and see if ES eventually sends anything like a TLS alert that the connection is going down or TCP/IP FIN or RESET. Depending on how long their idle timer is set, the TLS alert or FIN/RESET could take from 3 minutes to maybe 30m. But if you don't see anything at all from ES after more than 30m, then you are probably seeing the same thing I see.

If you are seeing the same thing as I see, it seems to me there is a problem is in the ES server configuration.

Also, but probably not important, does wireshark show ipv4 or ipv6 packets when connecting to ES?

gene smith

Assignee

Comment 15

•

9 months ago

•

Edited

Since you had to do a repair and assuming I understand correctly and it fixed the problems, you might want to try the latest beta 128.0b1. Unlike the latest 115, it contains a fix for bug 1857450 which may be the reason you needed to repair the folder.
I think this assumes you are using offline store with the nntp account (referred to as "cache" in the bug title).
Note: It seems that when you set up a new nntp account, offline store is turned off by default.
Here's a link to the beta .tar.bz2: https://archive.mozilla.org/pub/thunderbird/releases/128.0b1/linux-x86_64/en-US/thunderbird-128.0b1.tar.bz2
If you want a different locale than en-US, others are there too in the archive.

Edit: re-read your comment 13. I suppose you could say re-creating the account is the ultimate folder repair. Anyhow, still think the beta may prevent future problems and it will be in the upcoming new release (128?).

Comment 16

•

9 months ago

I will read your last 2 comments shortly, but I'll say some things
briefly. I have packet captures for version 115.11.0 and version
115.9.0. The results are similar. On startup, Thunderbird downloads
new articles from Eternal September, then nothing else happens
until 35 minutes later. My computer doesn't receive anything
from ES at or near 30 minutes.

I have Thunderbird currently getting articles from two news
servers, one using port 563, the other using port 119 without
encryption. At startup, Thunderbird opens two connections
to ES, port 563, and two to paganini.bofh.team, port 119.
After 35 minutes it closes all of these sockets and opens
one new connection to paganini. After this, I can't get
version 115.9.0 to communicate with ES. I haven't tested
115.11.0 at this same point. After another 35 minutes, Thunderbird
closes its connection to paganini and opens two new
connections to paganini and none to ES. After this I can't get
version 115.11.0 to communicate with ES.

I don't know how to attach packet captures to this comment.

gene smith

Assignee

Comment 17

•

9 months ago

•

Edited

(In reply to David Canzi from comment #16)

I will read your last 2 comments shortly, but I'll say some things
briefly. I have packet captures for version 115.11.0 and version
115.9.0. The results are similar. On startup, Thunderbird downloads
new articles from Eternal September, then nothing else happens
until 35 minutes later.

What I see at the first 35m point is TB doesn't know the connection has been closed by ES. It them tries to send a NNTP command to check for new messages and no response is received after several requests are sent. Since the release TB doesn't have a tcptimeout enabled, there is no indication to TB that the connection is down so it gives up and waits for the next 35m "biff" time to occur. By then it somehow knows the connection is gone (not sure how) and at the 70m point it does a reconnect and successfully communicates with ES and brings in any new message.

My computer doesn't receive anything from ES at or near 30 minutes.

I think this is a problem (bug?) at the ES server. When connecting to ES with port 119 (no TLS) at the 30m point it sends a tcp FIN to close the connection and everything works as expected. If port 563 (TLS) is used and the route is via my charter isp, ipv4 is used and at the 30m point ES sends an "alert" via the TLS protocol that the connection is going down. TB then sends the FIN to close the connection all is fine. But if I connect via cell phone network (AT&T as isp) it uses ipv6 (may not be significant) but at the 30m point, there is no TLS alert or FIN sent by the ES server; either that or it is block by some other server along the route. This seems like what you are seeing too if wireshark shows nothing coming in from ES at the 30m point.
Note: I haven't tried connecting via cell network to ES with port 119.

I have Thunderbird currently getting articles from two news
servers, one using port 563, the other using port 119 without
encryption. At startup, Thunderbird opens two connections
to ES, port 563, and two to paganini.bofh.team, port 119.

If you have more than 1 group subscribed on a server and if you are set up to check for new messages at startup, there will, by default, be two connections per server created at start up. You can adjust this with the TB config editor (advanced) parameter mail.server.serverX.max_cached_connections where X is the server number. You have to look at items "mail.server.server" to determine the server number X. You can set the parameter to 1 if you want only one connection to occur, or you can set it larger if you have a lot of subscribed groups. (I think I saw on ES site they allow only 2 connections.)

After 35 minutes it closes all of these sockets and opens
one new connection to paganini. After this, I can't get
version 115.9.0 to communicate with ES. I haven't tested
115.11.0 at this same point. After another 35 minutes, Thunderbird
closes its connection to paganini and opens two new
connections to paganini and none to ES. After this I can't get
version 115.11.0 to communicate with ES.

Here's what I see with the release TB version. If I set the time between checks for message to 1 hour (60m biff time) using IPv6 via cell route, TB fails when checking for message at the 60m point, then at the 120m point it tries again and succeeds. It then fails again at the 180m point and succeeds at the 240m point. So effectively the biff time is 120m (2 hours).

I have a fix for this so that if TB fails at the scheduled time, before the send I configure the tcptimeout for 25 seconds. Now TB will see the timeout signal (from the mozilla framework) and know the connection is gone, create a new connection and retry the command. The only problem I see with this is it takes 25 seconds for TB to know/assume that the connection is down. So there will be a 25 delay if you manually request a new message after TB has set idle long enough for the server to silently disconnect (which is something I don't think the server should do).
~~The fix: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/MQRsTmncTh2OJ5j9ESd38w/runs/0/artifacts/public/build/target.tar.bz2~~ Edit: Don't use, has a problem. See next comment.
Again, this is a patched version of 115.11.1.
Here's the full "try" build link: https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=c325ab932f620a8ab4b89c563436f240e92a39a7

I don't know how to attach packet captures to this comment.

If you saved the capture file, you can attach it using the "Attach new file" button above. Or if you prefer, you can just email it to me. I think I have received files OK that are more than 16M is length. But maybe just try the patched version ~~above~~ below and see if that has any effect first.

gene smith

Assignee

Comment 18

•

9 months ago

David, Sorry (again!), don't use the link to target.tar.bz2 in previous comment. I just remembered I had to make one other change since it caused the a "timed out" screen to appear incorrectly. I'll update the try build right now.

gene smith

Assignee

Comment 19

•

9 months ago

Corrected "try" build based on 115.11.1:
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Rn6mNf0pRU6krqKIbRLWRw/runs/0/artifacts/public/build/target.tar.bz2
Full "try" build info:
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=24e555c7f2a57cacbf5e29850aea041a6657cfd3

gene smith

Assignee

Comment 20

•

9 months ago

From comment 17:

Note: I haven't tried connecting via cell network to ES with port 119.

I just did that and I think it shows the problem is NOT with ES server. I'm also seeing ipv6 in use with cell network and port 119 (no TLS). When TB tries to send another NNTP request at the 35m point there is no response from the ES server. TB then retries a few times and after 25s (the tcptimeout time I have set on the patch) TB sends a FIN to try to close the connection and then starts a new connection and successfully sends NNTP commands and bring in any new messages. then waits again, etc ...

So everything works OK with ES using my ISP's network on port 119 or 563, It only has problem (both port 119 and 563) when going via my cell network. So there must be something in the cell network (a router, firefall or whatever) than is also silently dropping the connection after a certain time, usually 30m. All I could find about AT&T is that it maintains a mobile originated (cell) idle connection for 30m (1800s) before dropping the connection so maybe that is why. Re: https://developer.att.com/technical-library/network-technologies/network-timers. I can speculate that both AT&T and ES drop the idle connection at about 30m. But AT&T drops it first so any alert or FIN sent by ES never arrives after AT&T drops the connection silently: "...almost all firewalls will silently remove idle connections from their state and will not initiate a close (send a TCP FIN or RST) to the client or server. ..." from here: https://aws.amazon.com/blogs/networking-and-content-delivery/introducing-configurable-idle-timeout-for-connection-tracking/

Anyhow, after all this, I should just say that I still think the version linked to in comment 19 will fix the problem, at least it does for me.

David Canzi

Reporter

Comment 21

•

9 months ago

I think you figured out the key facts in comment 20, that the fate
of a connection is affected by which networks it passes through, and
that firewalls are rude.

I have come to the conclusion based on my own observations that
the problem I am having is specific to my location. Until recently
I had Thunderbird connecting to 3 servers, one using port 119 and
two using port 563. All 3 started having this problem at the same
time, but for the server using port 119 the symptoms were much
less debilitating. What do all 3 of these problems have in common?
Me, my computer, my provider's network. Something within that set
of places changed, causing Thunderbird, which had been working
fine, to start failing.

I'll try the try build from comment 18, and perhaps the beta version
from comment 15.

gene smith

Assignee

Comment 22

•

9 months ago

•

Edited

I'll try the try build from comment 18, and perhaps the beta version
from comment 15.

Just to be clear, the try version in comment 19 (comment 18 is broken) and the beta mentioned in comment 15 address completely different problems. The beta only may affect possible folder corruption and doesn't affect the network problems that will hopefully be fixed by comment 19 try build.

So I would recommend running the "try" build first (comment 19) and see if it helps. If it does, I can incorporate the network fix into the current beta and make a try build of the patched beta and you can then run that. Hope this makes sense!

David Canzi

Reporter

Comment 23

•

9 months ago

Each of the first 2 try builds I tried created a new directory
under ~/.thunderbird, and then I could create an account
for Eternal September, enter a user name and password,
subscribe to alt.atheism, and test things.

On startup, the latest try build re-used the profile created
by the second try build, displayed the previously downloaded
articles, but didn't download any new articles. It didn't even
connect to ES. When I unsubscribed from alt.atheism and
deleted the account and recreated the account, I couldn't
subscribe to any newsgroup. When I clicked on Manage
Subscriptions, it showed me a subscribe window, but the
box that should show a list of newsgroups was empty.
When I clicked Refresh, it displayed a message saying
Please wait, and nothing further happened. It didn't
connect to ES. It didn't pop up a window to request
a user name and password.

gene smith

Assignee

Comment 24

•

9 months ago

... It didn't connect to ES.

Sorry again. I totally messed up the code (again). I left in an extra } causing a "javascript syntax error" so NNTP stuff immediately gives up and does nothing. I was over-confident and didn't test the try build myself after I "merged" in the changes. I'll go ahead and fix it and personally test it this time to make sure it works before I post another link.

gene smith

Assignee

Comment 25

•

9 months ago

Fix with syntax error fixed (and tested some):
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/MKErGhaITRGWUkKPHKcoIQ/runs/0/artifacts/public/build/target.tar.bz2
Here's the try build info:
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=5d07611b38913021bdabcd62b5140c3d9d723c3f

David Canzi

Reporter

Comment 26

•

9 months ago

Using the latest try build, I subscribed to paganini.bofh.team using
port 119, set up to check both servers for new articles every 50
minutes. Exited. Started two tshark commands, one for port 563,
one for port 119. Started Thunderbird and left it alone for a few
hours. After a few hours, when I knew the connections to the
servers were about 40 minutes old, I did Get Messages on
paganini. A pop-up notification said that paganini had reset the
connection. A second Get Messages caused Thunderbird to
connect to paganini and downloaded new article. The first
Get Messages on ES produced no visible results. The second
Get Messages opened a new connection and downloaded new
article.

After I start Thunderbird, the first time I do something that
requiresa connection to Eternal September, a pop-up
asks me for user name and password. If I'm a little slow
getting these fields entered, when I press Login, I get a
pop-up notification that the connection has timed out.

The packet logs show that, at 50 minutes and 150 minutes,
Thunderbird sends out something and an incoming packet
arrives with the src IP address of the server and a RST flag.
At the first Get Messages, the same thing happens as at
50 and 150 minutes, but with a wrinkle. for ES, several
additional events I don't understand were logged between
the first outbound packet and the returning RST packet.

I think the RST packets are something new. They arrive
less than a second after the outbound packet.

At 100 and 200 minutes and at the second Get Messages,
a new connection is opened and packets are exchanged.

If I tell Thunderbird to check for new articles every N
minutes, where N > 30, it checks for new articles every
2N minutes. Sometimes Get Messages doesn't work.
When that happens, a second Get Messages works.

The try build only opens one connection to the server.
This is a sensible thing to do. IMAP clients need to
open multiple connections. NNTP clients don't.

This comment is more complex than I thought it would be
when I started writing it.

David Canzi

Reporter

Comment 27

•

9 months ago

Attached file Packet captures from try build using two servers, one NNTP and one NNTPS. — Details

gene smith

Assignee

Comment 28

•

9 months ago

Thanks for capture files. I've only look at the 119 (NNTP) one so far. What I'm seeing is at the 50m point TB does a GROUP request and gets a RST. So TB never received a FIN that the server had timed out the connection and I'm not sure if the server has actually closed the connection. But when TB sends on that same port at each 50m point, it gets a reset (RST) response so no tcptimeout is generated by mozilla since there was an actual response, the RST. Currently this just looks like a random error to the TB NNTP code so TB just waits for the next "biff" interval to re-connect and retry the GROUP command, so the effective biff time becomes 2*50 = 100s.
I may see a way to fix this by treating the RST from the server the same as a tcptimeout from mozilla framework.
I'll look at the 563/NNTPS capture now...

gene smith

Assignee

Comment 29

•

9 months ago

Looks like the same problem on the NNTPS/563 server with the RST response on the timed out connection. I don't know why I don't see anything like that from my location connecting to the same servers. As you suggested, maybe it's a router or firewall along your path and not the actual server sending the RST?

Also, I think you will see the same problem on the first "get messages" not working after more then 30m idle and then the 2nd one will work since TB is not getting a tcptimeout but a RST or FIN responses from the server which resets the tcptimeout timer in the mozilla network code.

Regarding the timeout when entering the password manually, yes that will happen if you take more the 25s to complete the entry. I should probably not enable the tcptimeout until authentication is complete.

I'll try to fix these issues and make another try build for you.
Thanks for working with me on this!

gene smith

Assignee

Comment 30

•

9 months ago

After I start Thunderbird, the first time I do something that
requiresa connection to Eternal September, a pop-up
asks me for user name and password. If I'm a little slow
getting these fields entered, when I press Login, I get a
pop-up notification that the connection has timed out.

Question: When this happens and you see the timed out pop-up, do you have to put in your username/password again? When I do this it saves the password/username (to ram memory) but just doesn't bring in new messages until I click "get messages" and then it uses the newly entered and stored credentials from memory in the session. So, even though there is a timeout, the credentials are saved and are still used (unless, of course, I put in the wrong username and password).

David Canzi

Reporter

Comment 31

•

9 months ago

Question: When this happens and you see the timed out pop-up, do you have to put in your username/password again?

No. Once is all that's needed.

As of yesterday afternoon, the installed version (115.12.0) on my computer
will not show me new articles in alt.atheism even after I exit and start
over. (Other newsgroups are so far unaffected.) Repairing the folder didn't
work. Maybe I'll just use the try build to read news...

David Canzi

Reporter

Comment 32

•

9 months ago

In the paganini.bofh.team account I have unsubscribed and then
resubscribed to alt.atheism. Before resubscribing, I removed
alt.atheism.dat. I started without filters and recreated the two
most important ones. And for the past 2 days, the installed
version of Thunderbird (115.12.0) has been downloading
alt.atheism from bofh.team successfully.

alt.atheism was the first of three newsgroups I subscribed to
from bofh.team. Unsubscribing and resubscribing moved it to
then end of the list, and I can explain why that helps, if you're
interested. I can't explain why removing alt.atheism.dat would
help.

gene smith

Assignee

Comment 33

•

9 months ago

All I see in .dat file for all the groups is this and have no idea yet as to what it means:

version="9"
logging="no"

Is filtering actually a factor in the issues you are having? I think you only mentioned filters once before but I never followed up on it.

Anyhow, based on what I see in your two capture files from tshark, I think I have a workable fix for the unexpected RST responses from the server on the the next command after the 50m idle times between biffs. Also, and harder to figure out, I think I've eliminated the timeout pop-up when entering credentials. Also, checked to make sure there are no spurious timeouts while doing other things where a connection might go idle long enough for a timeout to be detected, like posting (tested a couple posts on alt.alt.test -- which I assume that's what it's for).

Here's a new "try" build based on patched 115.12.2 (running it right now): https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/H5jGmrVlSTuqISc_w0jbew/runs/0/artifacts/public/build/target.tar.bz2
And here's the complete try info: https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=b8976fced4976f82e55d9c390fcd833f900bfd18&selectedTaskRun=H5jGmrVlSTuqISc_w0jbew.0

Unsubscribing and resubscribing moved it to
then end of the list, and I can explain why that helps, if you're
interested.

Sure, I'm all ears.

I can't explain why removing alt.atheism.dat would
help.

Do I understand right that what you did only helped with the bofh site and not with the ES site?

gene smith

Assignee

Comment 34

•

9 months ago

•

Edited

From David's comment 26:

The try build only opens one connection to the server.
This is a sensible thing to do. IMAP clients need to
open multiple connections. NNTP clients don't.

I didn't do anything in the patch that changed the number of connections to the server. Currently the default number of connections per nntp server is 2. (It can be changed in the prefs / Config Editor.) What I observe is if I have two groups subscribed on an nntp server and if I have "check for new messages at startup" selected, two connection will be created and both groups will be accessed in parallel using their own connection after the first startup. If "check for new message at startup" is not selected and you are not selected on a group for that server at the last shutdown, on next startup no connection will occur until a group is selected and the one connection will be used to obtain new messages for that selected group. Then at the first biff interval (assuming biff is enabled) after startup (e.g., 50m point) you will have 2 connections. The original connection will probably be timed out after 50m so there will usually be two new connections created, one to replace the original timed out connection and one for the other group.

gene smith

Assignee

Comment 35

•

9 months ago

I'm re-opening this since I'm seeing this as its own bug and not a duplicate of bug 1876261.

Assignee: nobody → gds

Status: RESOLVED → REOPENED

No longer duplicate of bug: 1876261

Ever confirmed: true

Resolution: DUPLICATE → ---

David Canzi

Reporter

Comment 36

•

9 months ago

I started this bug report because of a networking problem, and
while I was experimenting an unrelated problem arose, where
Thunderbird would not download new articles even after exiting
and restarting. The latter problem should probably be the topic
for some other bug report.

Filtering and alt.atheism.dat are related to this second problem,
which is not network-related. If you define filters for a newsgroup,
they are written into the newsgroup's .dat file.

Sure, I'm all ears.

I saw the following sequence of events in a packet capture:

Thunderbird sends GROUP alt.atheism
My computer receives an RST packet
Thunderbird opens a new connection
Thunderbird sends GROUP alt.free.newsservers

Thunderbird's recovery from the RST did not include
retrying GROUP alt.atheism. For whatever newsgroup
is first in the list, the GROUP command fails and is
not retried. When alt.atheism is no longer first, it no
longer has this problem.

Do I understand right that what you did only helped with the bofh site and not with the ES site?

Removing alt.atheism.dat seemed to help for both ES and
bofh.team, but it helps with the second problem which is
not the topic of this bug report.

I'll download the new try build and see how it does.

gene smith

Assignee

Comment 37

•

9 months ago

•

Edited

If you define filters for a newsgroup, they are written into the newsgroup's .dat file.

Of course, I should know that. But I haven't dealt much with filtering during by TB "career". :)

I saw the following sequence of events in a packet capture:
Thunderbird sends GROUP alt.atheism
My computer receives an RST packet
Thunderbird opens a new connection
Thunderbird sends GROUP alt.free.newsservers

I assume this is right after a 50m (or whatever time) biff interval. I see a similar sequence with unpatched code except I never see a RST from my location. So from here, sending the first GROUP in you sequence never gets a response at all. With the patch in place that you are testing now, a TIMEOUT error will be generated after 25s with no response to a command. Also, a RST response when attempting to "reuse" a connection will also be treated like a TIMEOUT. So if a TIMEOUT or RST occur when a connection is attempting to be reused, a new connection will be opened to retry the command and the old timed-out or reset connection will be closed locally and tcp FIN will be sent to ensure the remote host sees that it is closed.

If you are still having network problems with the latest patch to 115.12.2, another tshark capture file would be most helpful to see what's going on, since, unfortunately, I don't see the same things here, i.e., you see RST, I see no response and a timeout.
Thanks!

David Canzi

Reporter

Comment 38

•

9 months ago

The try build I was using as of comment 26 only opened 1 connection
to each server the first time I ran it. It opened 2 connections per
server after that, until I set max_cached_connections=1. (And now
and then something changes max_cached_connections back to 2.)

I resumed using 35-minute biffs.

One one test, the latest try build popped up a request for userid
and password 35 minutes after I started Thunderbird. I quit out
of Thunderbird, and now I think I should have observed longer
and taken notes.

Next, I ran 2 more tests: one in which I entered name and password
quickly and one in which I entered them slowly. Neither of these
popped up a second request for credentials. At 35 minutes, both
of these lost their connection with Eternal September and nothing
I did would cause them to reconnect. When I quit, I got notification
that Eternal September refused my connection.

I see that this build retries the GROUP command it sent just before
it receives RST. Thank you.

Thank you for re-opening this request and deleting its connection
with the other request.

I'm attaching a tar file containing captures for the test that
requested credentials again and captures for one of the tests
that lost contact with Eternal September.

David Canzi

Reporter

Comment 39

•

9 months ago

Attached file 2024-06-25.pcaps.tar Packet captures from Thunderbird tests — Details

gene smith

Assignee

Comment 40

•

9 months ago

Right now I'm looking at the t.1719351607/563.cap and notes. I see at the about 35m point probably the encrypted GROUP command is sent as it should. Then there's some stuff I don't completely understand ([TCP previous segment not captured]) etc then maybe a retry of the GROUP then TB sends a FIN, all this at timestamp 22:15:44. Then ES responds with 2 RSTs. But the RSTs don't cause TB to attempt a immediate reconnect like I would expect as I put it in the latest try build.

Then at 22:23:09 TB attempts a new connection (SYN), ES responds with SYN/ACK as it should and then TB immediately responds with a RST. Don't know why RST is sent when it should send an ACK! Not sure if this new connection attempt is due to biff or if you manually clicked "Get messages".

I think you are also saying that even additional clicks of "get messages" or biff intervals don't cause a new connection after the last time in the capture file?

I haven't ask you to do this but maybe we need logging for the NNTP transactions inside TB. This is accomplished by going into Config Editor and setting parameter mailnews.nntp.loglevel to All. The logging info is written to the console, opened with ctrl-shift-j. At the gear icon you should set timestamps on so we can correlate the log data with the tshark capture files. You can copy and paste the console screen output to a file and attach it here. (Not sure, it may show your userid and password, so may need to edit it.)

I'll look at other files now...

David Canzi

Reporter

Comment 41

•

9 months ago

22:23:09 was when I quit TB. The uncompleted connection
handshake happens twice sometimes, maybe when
max_cached_connections=2. I think TB opens a socket,
runs the connect system call, and then promptly closes
the socket and/or exits.

I do three things to try to trigger a reconnection:

Get Messages.
Select the newsgroup.
Scroll down and try to view the last article.

All three of these failed to cause TB to reconnect to ES.

gene smith

Assignee

Comment 42

•

9 months ago

Ok, thanks. That's interesting but also frustrating since I can't duplicate what you see.

All three of these failed to cause TB to reconnect to ES.

So when you do any one of the three, are you saying there is absolutely no wireshark (network) activity observed when you do any of these? I.e., once it's locked up only a restart allows reconnect and fetch of messages?
Also, need to know what NNTP logging says is happening when this occurs.
There is more information on setting up logging to console here: https://wiki.mozilla.org/MailNews:Logging#Setting_a_Preference

gene smith

Assignee

Comment 43

•

9 months ago

For whatever reason (firewalls, routers, NAT tables) when I go through the cell network to ES:563 the connection goes away (unannounced) after about 5m (doesn't take 30m like I thought it was). Typically it recovers and the next syn,syn-ack,ack creates a new connection, TLS negotiates and new messages are detected and fetched. But sometimes the new connection is created but the TLS "Client Hello" command fails (ES responds with FIN). I have biff time now at 6m and the same thing happens after 6 more minutes. So far, it has failed like this on 4 biff cycles. Do you see anything like this where "Client Hello" sent by TB just gets a FIN response from ES? This seem somewhat like what you report where nothing new comes in after several biff cycles. I haven't tried to manually do "get messages" or any of your "three" items, just seeing that biff fails to complete the TLS negotiation on several consecutive biff cycles.
In the console logging, each time this occurs the error is reported as NetworkInterruptError Network. I could retry on this error too but it kind of seems like a problems with ES, maybe.
P/S: on the 7th biff cycle it finally didn't fail on "client hello" and brought in new messages, so not seeing a complete lockout.

David Canzi

Reporter

Comment 44

•

9 months ago

Attached file Packet capture and NNTP logging from test of Thunderbird. — Details

David Canzi

Reporter

Comment 45

•

9 months ago

The build from comment 33 loses its connection to ES at the first
biff, and never again reconnects. The build from comment 25 could
reconnect before, but when I started it up earlier today, and let it
run while I ran errands, it failed the same way as the latest build.

gene smith

Assignee

Comment 46

•

9 months ago

Thanks for the log and captures and notes. I sort of see what's happening and maybe have a fix. On reuse of an idle connection, TB detects a FIN even though not explicitly sent (wireshark claims the "segment not captured" and seems to know it's a FIN from the server). The code currently just prints a log message so nothing useful happens on unexpected FIN. Now I'm treating a FIN on reuse the same as a timeout or RST at reuse and initiate a reconnect and retry of the nntp command.
But since I can't duplicate the server response you see from my location, I can't really test this. Also, I haven't even tried to run it yet, will check tomorrow. It looks like it should work but no guarantee.
Here's the patched 115 again with the latest fix: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/IUZBfUYDR26kTzyORtwJ4Q/runs/0/artifacts/public/build/target.tar.bz2
Full try info: https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=410b8d481cdbd9491598b6d557422fb24af65118&selectedTaskRun=IUZBfUYDR26kTzyORtwJ4Q.0

Didn't see your comment 45 until I tried to post this. So far you have shown me RSTs and now FINs that I don't see here. So maybe this latest patch will fix it completely since it handles timeouts, RSTs and now FINs too, all in the same way.

gene smith

Assignee

Comment 47

•

9 months ago

It looks like it should work but no guarantee.

Actually, not so good. Here's another try still building at this time:
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=49257d4ad6038325e0a31d113a68f2008c6bbac9
When complete, click on the green B and look under "Artifacts and Debugging" and you can find the target.tar.bz2 in the list.

David Canzi

Reporter

Comment 48

•

9 months ago

I ran the build from comment 46 today, and the results are
encouraging. At the first biff, Thunderbird requested userid
and password, then reconnected to ES and downloaded
articles. At the second, third, and fourth biffs, Thunderbird
reconnected to ES and downloaded articles without
requesting userid and password.

gene smith

Assignee

Comment 49

•

9 months ago

I wouldn't expect the comment 46 build to work right at all. I'm calling the function _onError() with the wrong parameter type and the JS console shows a Javascript error when I tested it today. But the JS error seems to be non-fatal.
Have you tried the build from comment 47 yet? Here's the link to the build output for comment 47: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/ZU1v0osNTtG7kPxWPgiwWA/runs/0/artifacts/public/build/target.tar.bz2

At the first biff, Thunderbird requested userid
and password, then reconnected to ES and downloaded
articles.

I'm not sure if you mean this prompt is unexpected?

Were you selected on an ES group at the previous shutdown before running the build?
I don't know if you've said, but are you configured to check for new messages at startup?

If either of these are true, on startup, and you don't let TB save your uid/password, you should be prompted immediately for credentials.
Otherwise, the prompt shouldn't occur until the first biff time.

David Canzi

Reporter

Comment 50

•

8 months ago

I have TB configured to check for new articles at startup and
every 35 minutes thereafter. When I ran the comment 46 version,
I was prompted for userid and password at startup and again 35
minutes later. After that it behaved normally, at least as far as
I could tell, which is why I considered the comment 46 version's
results encouraging.

I ran the comment 47 version for part of yesterday. I started
it again early this morning, and let it run while I slept. I've been
checking a few last things about its behaviour. One of these
observations involves waiting for the next biff and observing
what happens when I'm not interacting with Thunderbird. That
final test has just completed, and everything I've seen is normal.

How long does it take for your changes to make it into the
released version?

gene smith

Assignee

Comment 51

•

8 months ago

•

Edited

The way to tell if the change for comment 47 patch is having an effect is to keep the nntp loglevel pref at All and check if you sometime see this: gds: Server FIN'd conn before attempted reuse.. If you don't see that, the fix isn't really having an effect because the network response is now different.

How long does it take for your changes to make it into the
released version?

I have to submit a formal patch and have it approved by Magnus. That usually takes some back and forth since he usually requests some changes. Once approved it goes into the Daily build and after there a while with no reported regressions, it goes to Beta. After Beta for a while it goes into ESR release which will probably be 128. But maybe even the next ESR. So it's not an instant process unless it's a issue that affects a lot or users. But, AFAIK, there are not many TB users still using NNTP.
Then again, if it's a effective fix with low chance of regression and needed by users (or a user), an "uplift" can be requested to expedite the process.

David Canzi

Reporter

Comment 52

•

8 months ago

I've had to abandon two news servers in the past two years and
didn't want to have to abandon another. Eternal September
will become usable again and in the meantime I can cope.

gene smith

Assignee

Comment 53

•

8 months ago

Re: Comment by max at Bug 1876261 comment 25:

The difference between providers may be due to shorter life time of records in NAT tables. I expect that enabling keep-alive for TCP sockets may help. Unfortunately I am not familiar with the feature enough to provide more details.

We currently enable tcp keepalives in the imap c++ code. But I've never found out how to enable it in JS based code now used by NNTP, POP3 etc. The documentation says you have to "dispatch to the socket thread" but don't know how to do this.

Comment 54

•

8 months ago

Attached file Bug 1901338 - Ensure NNTP client reacts properly to tcp timeout, RST and FIN. r=mkmelin — Details

These tend to occur when biff interval is longer than the NNTP server's idle
connection timeout.

David Canzi

Reporter

Comment 55

•

8 months ago

and check if you sometime see this: gds: Server FIN'd conn before attempted reuse.

I just checked. I'm seeing this once per biff.

max

Comment 56

•

8 months ago

(In reply to gene smith from comment #53)

We currently enable tcp keepalives in the imap c++ code. But I've never found out how to enable it in JS based code now used by NNTP, POP3 etc. The documentation says you have to "dispatch to the socket thread" but don't know how to do this.

Keep-alive is not always an improvement. E.g. ssh session may survive temporary ethernet cable disconnect if no keep-alive packets are sent during this period. On the other hand it is must have if there are NAT with not so high timeout for wiping connection records.

Thunderbird in this sense is excessively smart since it reacts on network down events. At the same time it does not allow to manually terminate a specific connection, e.g. the one taken from the pool to fetch current message.

Connection timeout may be temporary increased when some message data have been received and shorten back on completion.

gene smith

Assignee

Comment 57

•

8 months ago

Attached patch keepAlive.diff — Details — Splinter Review

I was never able to figure out how to enable the keepalive in JS code and have it run on socket thread. But, to find out if keepalives have any effect, I just went into the mozilla/netwerk code and forced the enable of keepalives directly in the socket code -- shown in attached diff. (Of course, this is just an experiment and not a proposed solution.)

With this patch in place and by setting network.tcp.keepalive.idle_time to 170 (seconds), a keep-alive packet is sent a bit less than every 3 minutes. With biff time set to 32m, at about the 30m point the server (eternal-september) sends the TLS alert that the connection is going down so TB responds by closing the connection. Without the keepalive enabled, the TLS alert is never received and TB doesn't know the connection is gone/going away.

So having the keepalive enabled definitely helps and, for me, causes the problem I see here to be resolved without the patch at comment 54.

But reporter David C. does not see the same network behavior from his location as I see here. So not sure that the keepalives without the comment 54 patch would fix what he observes.

David Canzi

Reporter

Comment 58

•

8 months ago

But reporter David C. does not see the same network behavior from his location as I see here. So not sure that the keepalives without the comment 54 patch would fix what he observes.

If you give me a test build, I can run it and collect the packet logs.

gene smith

Assignee

Comment 59

•

8 months ago

(In reply to David Canzi from comment #58)

If you give me a test build, I can run it and collect the packet logs.

Hi David,
Thanks for the offer to test this. However, maybe there is a way, but I don't know how to make a try build with modification to the "mozilla" level code. AFAIK, the "try" build only uses a fixed version of the "mozilla" (e.g., firefox) code for building the "comm" (thunderbird) code. So I can tweak the comm code but not the mozilla code when making a try build, and I've never found a way to set a tcp level keepalive using just NNTP javascript code in comm.

However, I'm looking into an alternative solution by using an application level keepalive. I.e., just send a nntp DATE command periodically as a keepalive. I'm new to NNTP protocol but it appears the DATE command is supported on the servers I've tested (gmane, ES and paginini) and is supposed to be supported as long as READER capability exists. But it really doesn't matter if it is supported or not and even if server returns a 500 response (not supported) it will still keep the routers and NATs happy and hopefully prevent auto-closing the connection.

So if I can implement the app level keepalive I'll make a try build with that that you can test.

David Canzi

Reporter

Comment 60

•

8 months ago

I've been trying to be kind to news servers. My computer runs
Thunderbird 24/7, and most of the time I'm not interacting with
TB. What I wanted to happen was for Thunderbird to check for
new articles infrequently (currently set to 90 minutes), and not
hold connections open indefinitely. Sending the server DATE
commands at some interval under 30 minutes would keep the
connection open indefinitely, which is what I've been trying
to avoid.

I decided long ago to check for new articles infrequently, to reduce
my use of costly resources that are no longer costly today. TB by
default opens two connections and checks for new articles every
10 minutes, so it's keeping two connections open indefinitely. I'm
not aware of any complaints about this from news server operators.

My problem goes away if I just check for new articles more often.

It still seems to me that an innocent configuration change shouldn't
present an innocent user with baffling failure modes.

Magnus Melin [:mkmelin]

Comment 61

•

8 months ago

(In reply to gene smith from comment #59)

Thanks for the offer to test this. However, maybe there is a way, but I don't know how to make a try build with modification to the "mozilla" level

See https://developer.thunderbird.net/thunderbird-development/fixing-a-bug/try-server#testing-mozilla-central-patches

gene smith

Assignee

Comment 62

•

8 months ago

What I wanted to happen was for Thunderbird to check for
new articles infrequently (currently set to 90 minutes), and not
hold connections open indefinitely.

Guess I didn't realize holding conn open as long as possible was an issue.

Sending the server DATE
commands at some interval under 30 minutes would keep the
connection open indefinitely, which is what I've been trying
to avoid.

Well, my testing with the DATE command sent periodically didn't quite work as I hoped. It would sometimes cause a complete nntp response lockup after a check for new messages. Haven't figured out why. Anyhow, when I think about it, sending DATE periodically is probably not a lot different than just checking for new messages, so probably not a great idea.
A true tcp keepalive wouldn't have this issue and would allow the server to disconnect when conn is idle for the typical 30m.

It still seems to me that an innocent configuration change shouldn't
present an innocent user with baffling failure modes.

I agree. The tb nntp implementation was ported from c++ to js a few years ago. From what you say, apparently there was no problem with a 90m biff time with older c++ nntp versions, e.g., maybe <= tb 91.

Anyhow, my proposed patch has been reviewed by Magnus and it looks like he doesn't like it much. I haven't yet looked at the details of his complaints.

gene smith

Assignee

Comment 63

•

8 months ago

David, Based on Magnus' comment 61 I went ahead and created a TB try build based on the current released NNTP code except for enabling a TCP keepalive. It's based on tb daily release from June 30 and contains a change in the mozilla code to allow enabling keepalive functionality from the main thread that NNTP-JS runs on (comments out assertions in moz code that running on socket thread when enabling it).

Anyhow, with this build, to completely enable keepalive to a useful value, you have to set the pref network.tcp.keepalive.idle_time. I've been testing with a value a bit less than 3 minutes so I set it to 170 (seconds). I think the default of 600 (or 10 minutes) is probably too long since I've seen (from here) connection drops after 3 to 5 minutes when using the cell phone route.

You can keep the other network.tcp.keepalive.* parameters at their default settings.

So I'd be curious to know if tcp keepalives resolves any of the problems you've had with the 115 release code.

Note: I haven't tried it but probably I can make the same change on top of the release 115 or the new release 128. If you prefer not to run a daily release to do this test, let me know if you prefer a patched 115 or 128 and I'll try to do that.

Patched daily release 129.0a1: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/YJ6aFyKEQjej1cwD0nNM0g/runs/0/artifacts/public/build/target.tar.bz2
Try build summary: https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=1d9a95c56d4dd3bc1e0be45ff3c72e89d9d04eab

David Canzi

Reporter

Comment 64

•

8 months ago

I left it running overnight and it was still working when I checked it
after breakfast. I ran several experiments changing the interval
between keepalives and changing the biff interval. It worked every
time and I learned something about how keepalives work.

Attaching a tar file containing packet captures.

David Canzi

Reporter

Comment 65

•

8 months ago

Attached file t.1721324781.tar Packet captures from comment 63's try build. — Details

gene smith

Assignee

Comment 66

•

8 months ago

•

Edited

Attached file 563-gds.tar.gz — Details

Thanks for captures and sorry for the delay in responding.
Looking at your files with wireshark, I'm seeing something I don't understand. I see 3 keepalive cycles spaced 9m 50s apart and then about 20s later TB disconnects and then reconnects. I don't know why TB disconnects (sends FIN) at these points since it looks like the connection is still OK. I don't see the server indicating any problem on the 563 or 119 server captures.
Specifically, I'm referring to the FIN sent by TB at frame 65, 135 for the 563 capture and
frame 32, 69 for the 119 capture.

I've attached what I see for 563 (ES server) using the comment 63 build and I receive a notification via TLS that the server is bring down the connection after 30m idle (after several keepalive cycles). I don't see that in your 563 file. (Note: the .txt file shows the summary lines using SSLKEYLOGFILE env. var so the TLS is decrypted and shows the protocol text and not just "Application Data". However, I had to set "decode as" to IMAP since there seems to be no option in WS to select NNTP decoding. But setting it as IMAP works OK and show the protocol text correctly.)

By any chance is TB reporting a timeout or other error after being idle for a while?

Just to make sure I referenced the right code in comment 63, please check to make sure you see no "gds:" prefix on any of the console log messages produced when running that build.

gene smith

Assignee

Comment 67

•

8 months ago

•

Edited

Attached file gds-119.pcapng — Details

Here's my port 119 (non-TLS, no login, paganini server) results using the comment 63 try build. ~~Unlike what you see from your location, after the server timeout of 30m, the FIN is sent by the server and not by TB.~~
Edit: Actually, I'm seeing the same thing as David C. here -- see comment 69.

David Canzi

Reporter

Comment 68

•

8 months ago

I haven't looked at your packet captures yet.

I had set the time between keepalives to 9m 30s and the biff
time to 31 minutes. I was trying to get these 4 events to happen
in this order:

the last keepalive before the 30-minute timeout
the 30-minute time out
the first check for new articles
the first keepalive after the 30-minute timeout

I assumed that the period between events 1 and 4 would be the
period between keepalives that I had set. It wasn't. I assumed
that a keepalive is for checking if the connection is still active
on the server end. I thought that when I got those 4 events to
happen in that order, that the check for new articles might
fail. It didn't.

This is what I think I see in my port 563 capture. At 1803 and 1804
seconds, my computer receives TLS app data from ES, sends TLS
app data back, then both server and client send FIN. To me it
looks like server and client have amicably agreed to close the
connection. It isn't until 1876 seconds that Thunderbird reconnects
to ES.

The last TLS app data packet that arrived is something I have
never seen before, because it was being blocked by something
between my computer and ES.

In the port 119 capture, at 1801 seconds, my computer receives
a FIN from paganini, and sends a FIN back. At 1843 seconds,
Thunderbird opens a new connection to paganini.

The FIN from paganini is something I have never seen before,
because it was previously blocked.

Something between my computer and the news server, that
previously blocked the events I had never seen before, takes
note of the keepalives as evidence that the client is still there,
and keepalive acks as evidence that the server is still there,
and resets the time left on their connection to 30 minutes.

gene smith

Assignee

Comment 69

•

8 months ago

•

Edited

Ok, I looked at your 119 file again and, I saw it wrong yesterday. It looks just like what I see. I.e., the server sends the disconnect (FIN) at about 30m and TB acks it. Then at about 31m biff kicks in and does a connection.

For the 563 file, the "Application data" right after the keepalives I assume must actually be the Alert (Level: Warning, Description: Close Notify) being sent by the server at the 30m point and TB responds with the same thing. So I suppose your version of tshark or wireshark, whichever you are using, is decoding the "alert" message as "Application Data" since the sequence and timing pretty much match what I see, just the descriptive text differs.
Edit: Also the packet length for your "application data" match the length for what I see as alerts.

My understanding of the keepalive is that it doesn't affect the 30m timeout of the server. If no application (NNTP) data is sent for 30m, the server disconnects (sends FIN or TLS alert to TB) regardless of keepalive period. All the keepalive is doing is telling the intermediate routers/NATs that the connection is still needed and please do not drop the connection. Then if the connection stays up for the 30m server timeout, the FIN or TLS alert sent back to TB is still seen and an orderly TB reconection can occur without locking up.

However, I don't think there is a guarantee that the routers/NATs will respect the keepalives and they may just ignore them. So my original fix is still needed along with adding the keepalives.

gene smith

Assignee

Updated

•

8 months ago

Updated

•

8 months ago

Blocks: 1909551

Updated

•

8 months ago

Attachment #9410367 - Attachment description: Bug 1901338 - Ensure NNTP client reacts propertly to tcp timeout, RST and FIN. r=mkmelin → Bug 1901338 - Ensure NNTP client reacts properly to tcp timeout, RST and FIN. r=mkmelin

gene smith

Assignee

Updated

•

8 months ago

Blocks: 1909792

Alfred Peters [:infofrommozilla]

Updated

•

4 months ago

Packet captures from try build using two servers, one NNTP and one NNTPS. 9 months ago David Canzi 100.00 KB, application/octet-stream		Details
2024-06-25.pcaps.tar Packet captures from Thunderbird tests 9 months ago David Canzi 180.00 KB, application/octet-stream		Details
Packet capture and NNTP logging from test of Thunderbird. 9 months ago David Canzi 50.00 KB, application/octet-stream		Details
Bug 1901338 - Ensure NNTP client reacts properly to tcp timeout, RST and FIN. r=mkmelin 8 months ago gene smith 48 bytes, text/x-phabricator-request		Details \| Review
keepAlive.diff 8 months ago gene smith 1.75 KB, patch		Details \| Diff \| Splinter Review
t.1721324781.tar Packet captures from comment 63's try build. 8 months ago David Canzi 80.00 KB, application/octet-stream		Details
563-gds.tar.gz 8 months ago gene smith 79.21 KB, application/gzip		Details
gds-119.pcapng 8 months ago gene smith 22.86 KB, application/octet-stream		Details