Open Bug 1876261 Opened 4 months ago Updated 1 month ago

Fetching of a NNTP message can not recover after network failures

Categories

(MailNews Core :: Networking: NNTP, defect)

Thunderbird 115
defect

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: manikulin, Assigned: gds)

References

Details

Attachments

(1 file, 1 obsolete file)

Steps to reproduce:

It is impossible to read a NNTP message if intermittent network errors happens during its fetching. It is annoying to restart thunderbird in such cases.

  • Subscribe to some newsgroup.
  • Ensure that a connection to the server exists by "Get new messages" ([F5]),
  • Simulate firmware crash in intel WiFi adapter by a rejecting firewall rule
    nft add rule inet filter output tcp dport nntp reject
    
  • To simulate resetting of the WiFi adapter either turn off WiFi or just switch thunderbird into offline mode.
  • Click on some message.
  • Remove the firewall rule. Find its handle and use it in the delete command
    nft -a list chain inet filter output
    nft delete rule inet filter output handle 4
    
  • Enable WiFi or switch thunderbird back to online mode.
  • Click on some other message to confirm that it can be loaded.
  • Click on first message.

Actual results:

Blank body of first message. Headers pane may display values for another message or some reduced set of data.

Expected results:

When fetching of a message fails thunderbird should mark it as not loaded inside internal data structures. The message should stay in the "unread" state. Thunderbird should fetch it on next attempt to view this message.

In my case it may be necessary to reset intel wifi adapter several times a day due to firmware crashes. Similar failures may happen due to other issues with network connection.

Can you attach an nntp log? https://wiki.mozilla.org/MailNews:Logging

Component: Untriaged → Networking: NNTP
Product: Thunderbird → MailNews Core

It could, but it's old and some things got fixed after that.
For this bug, I would have thought bug 1859522 should help some.

See Also: → 1758100

(In reply to Magnus Melin [:mkmelin] from comment #1)

Can you attach an nntp log? https://wiki.mozilla.org/MailNews:Logging

Start thunderbird

23:38:24.996 mailnews.nntp: Connecting to news://news.gmane.io:119 NntpClient.jsm:105:20
23:38:25.153 mailnews.nntp: Connected NntpClient.jsm:150:18
23:38:26.139 mailnews.nntp: S: 200 news.gmane.io InterNetNews NNRP server INN 2.6.3 ready (posting ok)

NntpClient.jsm:175:18
23:38:26.140 mailnews.nntp: C: MODE READER NntpClient.jsm:327:20
23:38:26.236 mailnews.nntp: S: 200 news.gmane.io InterNetNews NNRP server INN 2.6.3 ready (posting ok)

NntpClient.jsm:175:18
23:38:26.236 mailnews.nntp: C: GROUP gmane.emacs.orgmode NntpClient.jsm:327:20
23:38:26.328 mailnews.nntp: S: 211 159313 1 159416 gmane.emacs.orgmode

NntpClient.jsm:175:18
23:38:26.331 mailnews.nntp: Done with status=0 NntpClient.jsm:972:18

[F5] Get messages

23:38:34.841 mailnews.nntp: C: GROUP gmane.emacs.orgmode NntpClient.jsm:327:20
23:38:34.932 mailnews.nntp: S: 211 159313 1 159416 gmane.emacs.orgmode

NntpClient.jsm:175:18
23:38:34.933 mailnews.nntp: Done with status=0 NntpClient.jsm:972:18

Simulate network failure

nft add rule inet filter output tcp dport nntp reject

Click on a message

23:40:00.943 mailnews.nntp: loadMessage news-message://news.gmane.io/gmane.emacs.orgmode#159416 NntpMessageService.jsm:42:18
23:40:00.949 mailnews.nntp: asyncOpen news://news.gmane.io:119/87jzns3xva.fsf%40localhost?group=gmane.emacs.orgmode&key=159416 NntpChannel.jsm:175:18
23:40:00.951 mailnews.nntp: Read from server NntpChannel.jsm:314:18
23:40:00.977 mailnews.nntp: C: ARTICLE 159416 NntpClient.jsm:327:20
23:40:20.968 mailnews.nntp: C: QUIT NntpClient.jsm:327:20
23:40:20.969 mailnews.nntp: Done with status=0 NntpClient.jsm:972:18

Switch to offline

23:40:21.031 mailnews.nntp: Connection closed. NntpClient.jsm:283:18
  • Simulate connections restore
    nft delete rule inet filter output handle 4
    
  • Switch to online
  • Click on another message
23:40:48.384 mailnews.nntp: loadMessage news-message://news.gmane.io/gmane.emacs.orgmode#159415 NntpMessageService.jsm:42:18
23:40:48.387 mailnews.nntp: asyncOpen news://news.gmane.io:119/aabc3205-7d0f-4156-9de2-11aba66242de%40gmail.com?group=gmane.emacs.orgmode&key=159415 NntpChannel.jsm:175:18
23:40:48.387 mailnews.nntp: Read from server NntpChannel.jsm:314:18
23:40:48.398 mailnews.nntp: Connecting to news://news.gmane.io:119 NntpClient.jsm:105:20
23:40:49.544 mailnews.nntp: Connected NntpClient.jsm:150:18
23:40:50.215 mailnews.nntp: S: 200 news.gmane.io InterNetNews NNRP server INN 2.6.3 ready (posting ok)

NntpClient.jsm:175:18
23:40:50.216 mailnews.nntp: C: MODE READER NntpClient.jsm:327:20
23:40:50.315 mailnews.nntp: S: 200 news.gmane.io InterNetNews NNRP server INN 2.6.3 ready (posting ok)

NntpClient.jsm:175:18
23:40:50.315 mailnews.nntp: C: GROUP gmane.emacs.orgmode NntpClient.jsm:327:20
23:40:50.430 mailnews.nntp: S: 211 159313 1 159416 gmane.emacs.orgmode

NntpClient.jsm:175:18
23:40:50.430 mailnews.nntp: C: ARTICLE 159415 NntpClient.jsm:327:20
23:40:50.627 mailnews.nntp: S: 220 159415 <aabc3205-7d0f-4156-9de2-11aba66242de@gmail.com> article

Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail

...

23:40:50.732 mailnews.nntp: S:  esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)

 (Exim 4.90_1) (envelope-from <ypuntot@gmail.com>) id 1rUUJr-0006HH-0N

...

23:40:50.854 mailnews.nntp: S: ler$Worker.runWorker(=

CoroutineScheduler.kt:677) at kotlinx.coroutines.scheduling.CoroutineSchedu=

... full message received

23:40:50.855 mailnews.nntp: Done with status=0 NntpClient.jsm:972:18
23:40:50.864 mailnews.nntp: asyncOpen news://news.gmane.io:119/aabc3205-7d0f-4156-9de2-11aba66242de%40gmail.com?group=gmane.emacs.orgmode&key=159415 NntpChannel.jsm:175:18

Click on first message again

23:40:58.136 mailnews.nntp: loadMessage news-message://news.gmane.io/gmane.emacs.orgmode#159416 NntpMessageService.jsm:42:18
23:40:58.139 mailnews.nntp: asyncOpen news://news.gmane.io:119/87jzns3xva.fsf%40localhost?group=gmane.emacs.orgmode&key=159416 NntpChannel.jsm:175:18

Blank body pane and headers from second message.

To read first message it is necessary to restart thunderbird.

(In reply to Alfred Peters [:infofrommozilla] from comment #2)

Isn't that Bug 1758100?

If I remember it correctly, it was reported for C++ NNTP implementation while JS one is used currently at least by default. I have no idea if it is the same bug from the point of view of developers since I am totally confused by reaction to other bugs:

  • When it was possible to switch between alternative implementations a bug was closed as WORKSFORME (not even WONTFIX) just because the case is handled by the newer implementation despite it was easily reproducible with the older one.
  • A bug was closed because some change had been committed and despite originally reported issue still reproducible.

I just see that rewriting NNTP code has not made it reliable. While I had a recipe how to gently kick the old realtek wifi adapter (scan for other networks by click on the NetworkManager applet), the new intel wifi require more sever on/off measure and thunderbird can not handle such scenario. Actually I suspect more issues with DNS queries left without response.

(In reply to Magnus Melin [:mkmelin] from comment #3)

It could, but it's old and some things got fixed after that.
For this bug, I would have thought bug 1859522 should help some.

It is fixed in 115.4.2 while I have 115.7.0-1~deb12u1 installed.

(In reply to Magnus Melin [:mkmelin] from comment #3)

It could, but it's old and some things got fixed after that.
For this bug, I would have thought bug 1859522 should help some.

This bug is for the case when no error messages is displayed at all (blank body and headers usually left from a previous message). However sometimes thunderbird displays a error message. It may be really confusing, e.g. "Connection refused" (no server is listening on the specified port) instead of "No route to host" (network or remote host is down) if I recall it correctly.

It definitely does what reporter max says using his STR as shown in comment 4.
I used iptables instead to block/unblock the port:

block:   sudo iptables -A INPUT -p tcp --sport nntp -j DROP
unblock: sudo iptables -D INPUT -p tcp --sport nntp -j DROP
Status: UNCONFIRMED → NEW
Ever confirmed: true
Attached patch nntp-fix.diff (obsolete) — Splinter Review

As it is, the cache entry for the message that fails to display is never invalidated so it always comes back as an empty message.
Also, for some reason, there is no tcptimeout as we enable for other protocols. With this present, the cache is also invalidated after a 90 second wait with no response from server.

Here's a preliminary fix.

Assignee: nobody → gds

On Saturday I tried to wait longer when packets are not delivered due to crash of WiFi adapter firmware. A error page appears appears after some period of time and it is possible to fetch message this case when connectivity is restored. So there is some timeout or even handler and it works. However:

  • The button on the error page does not work. To retry it is necessary to click on another message and then back on the initially failed message.
  • If another message is selected while fetching the message is "in progress" (actually stuck) then the error page replaces displayed message that was fetched earlier.

It seems that when off-line mode is selected in Thunderbird or when Thunderbird detects that network is disabled, some actions are not taken to set correct state that the message is not fetched.

Error page should be cancelled when user selects another message.

Currently I have single NNTP connection allowed and offline storage enabled for this account.

Sorry for a typo:
So there is some timeout or event handler and it works.

A error page appears appears after some period of time

Yes, you are correct. I had to wait about 17 minutes to see the error page with a "retry" button.
Here's what I saw without my proposed patch applied:

-- I connect to the server but not yet clicked on a message
-- I set iptables to drop incoming data on port 119
-- Now click on a message. See empty screen.
-- NNTP log shows only "C: ARTICLE 28783" (only this is sent per wireshark)
-- Wireshark shows server responding but responses and retries not seen by TB due to block of port 119
-- After about 12 minutes, server sends first RST and sends more RST about every 2 minutes
-- After about 17 minutes, timeout error reported by mozilla to TB/NNTP code. Message now displays "The connection has timed out" [Several suggestion to fix the problem] [Retry].
-- The error should invalidate the cache entry (TBD).
-- Unblocked port 119.
-- Click [Retry] button.
-- Retry has no effect on display, nothing new in log, nothing occurs in wireshark.
-- Click on another message. Comes in OK and log/wireshark show the activity
-- Click back to original message. It comes in too. So cache entry was invalidated.

FWIW, I also tried this with older version 91 before NNTP was moved to JS. If anything, it's worse and the "Retry" button also has no effect and even clicking other messages doesn't cause a re-connect.

I think I can fix some of this except don't know how to cause that retry button to re-do the URL.

(In reply to gene smith from comment #11)

I think I can fix some of this except don't know how to cause that retry button to re-do the URL.

Great. Certainly the retry button should not be a prioritized task. First of all, thunderbird should not believe that a message is fetched when actually network packets were lost. It would be great if an error page would not replace another message. My guess is that either this._msgWindow should be reset when user clicks on another message or it is necessary to check that the window is displaying the same message before calling alertUser.

Please, check that reaction on switching to off-line mode and to disabling WiFi in NetworkManager is sane.

After skimming through NntpClient.jsm I have a couple of guesses why the retry button may be broken.

  • _onData or _onError does not store article URI in some property of an error page
  • _actionDone clears runningUri when calling _reset, so the error page has no chance to retry anything.

I see the code first time, so my ideas may have no sense.

In the case of WiFi adapter firmware crash, it takes only a few minutes before an error appears.

Hi Max,
Here's a "try" build based on today's "daily" that you can test: https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=8c670ca086b32d93aaa80a093c9d6946c6dbb753

I tested this using a very "static" gmane mailing list (no new message arrived) and I configured with no offline store so every message had to be completely fetched to see it (except for the 2nd and later times when it is in memory cache).

The tcp timeout (time TB waits for a response) is now about 25 seconds. It is actually mailnews.tcptimeout / 4 so it can be increased by increasing mailnew.tcptimeout default 100 sec. (We did a similar method to handle imap timeouts based on mailnew.tcptimeout /5.)

This seems to fix the problems where messages appear blank, due to network or local hardware problem, until TB restart. However, the error notification screen still occurs and "retry" is still useless. So you have to click on another message or a different folder to clear the screen when it occurs (now after about 25 seconds when TB gets no response from the server for whatever reason).

My original fix attempt (attached at comment 8) for this worked OK as long as I stayed on the non-displaying message until the timeout occurs. But if I am impatient and try to open other messages I would again end up with multiple message that would display blank no matter what after network restored. This problem seems to be OK now.

Here's the windows installer: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/PZpj1RORT8yThyv--nIv_w/runs/0/artifacts/public/build/install/sea/target.installer.exe
or for linux, just unpack this and run "thunderbird" executable inside:
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/SRE97JZIQVWRCLFsY9Zunw/runs/0/artifacts/public/build/target.tar.bz2

(In reply to gene smith from comment #13)

linux, just unpack this and run "thunderbird" executable inside:
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/SRE97JZIQVWRCLFsY9Zunw/runs/0/artifacts/public/build/target.tar.bz2

Isn't the attached patch for ESR? I was a bit surprised that the build is for daily.

It works more reliably. I have not seen a case when a message can not be fetched at all, however after connection is restored, sometimes first click on a message leads to blank message view, so it is necessary to click on another message and then on the failed one. Sometimes data in the headers pane is incomplete, sometimes the header pane disappears and only message body is shown. I am not completely sure, but perhaps there was a case when a message was marked as read despite it was not fetched.

It is unfortunate that an error page may replace a fully available message.

After inserting a drop firewall rule I tried to click on several messages. Sometimes last one was an available one, sometimes a message have not obtained earlier.

What I like is that disconnecting in NetworkManager or enabling off-line mode in Thunderbird does not lead to fetch failure persisting til restart.

The tcp timeout (time TB waits for a response) is now about 25 seconds. It is actually mailnews.tcptimeout / 4 so it can be increased by increasing mailnew.tcptimeout default 100 sec. (We did a similar method to handle imap timeouts based on mailnew.tcptimeout /5.)

I hope, decreased timeout would not affect large attachments requiring enough time on slow connection.

I tried to set mailnews.tcptimeout to 20. Sometimes error page appears quickly, sometimes it requires enough time. So my impression is that it is not always 5 seconds timeout.

Perhaps instead of shorter timeout I would prefer to have an explicit button to cancel connection (e.g. a popup opened on click on the progress bar). Off-line mode button serves for this purpose to some extent.

Than you for working on this issue.

(In reply to max from comment #14)

Isn't the attached patch for ESR? I was a bit surprised that the build is for daily.

From comment 13:

Hi Max,
Here's a "try" build based on today's "daily" that you can test: ...

I can re-do the try based on latest ESR (or even latest beta) if you need it.

It works more reliably. I have not seen a case when a message can not be fetched at all, however after connection is restored, sometimes first click on a message leads to blank message view, so it is necessary to click on another message and then on the failed one. Sometimes data in the headers pane is incomplete, sometimes the header pane disappears and only message body is shown. I am not completely sure, but perhaps there was a case when a message was marked as read despite it was not fetched.

AFAICT, the POP3 NNTP protocol (server or client) has no say in if a message is marked as read. That's something that occurs once the message is stored locally which is handled by "mailbox" local file "protocol". The only changes I made were at the POP3 protocol level.

It is unfortunate that an error page may replace a fully available message.

Yes, I had never seen that page before. I mostly work on imap and it never displays a full page displacing the message when there is a network error. I think it would be better just to pop up an error message like other protocols do that doesn't take over the whole message area.

After inserting a drop firewall rule I tried to click on several messages. Sometimes last one was an available one, sometimes a message have not obtained earlier.

What I like is that disconnecting in NetworkManager or enabling off-line mode in Thunderbird does not lead to fetch failure persisting til restart.

Good, since that's the primary thing I was trying to fix.

The tcp timeout (time TB waits for a response) is now about 25 seconds. It is actually mailnews.tcptimeout / 4 so it can be increased by increasing mailnew.tcptimeout default 100 sec. (We did a similar method to handle imap timeouts based on mailnew.tcptimeout /5.)

I hope, decreased timeout would not affect large attachments requiring enough time on slow connection.

As long as the server response starts within 25 seconds of the request is all that would matter. It shouldn't matter how long it takes to download the huge message/attachment itself.

I tried to set mailnews.tcptimeout to 20. Sometimes error page appears quickly, sometimes it requires enough time. So my impression is that it is not always 5 seconds timeout.

It's not a perfect timer so there will be some variation, but I don't think setting to 25s means it might timeout anywhere from 0 to 25s.

Perhaps instead of shorter timeout I would prefer to have an explicit button to cancel connection (e.g. a popup opened on click on the progress bar). Off-line mode button serves for this purpose to some extent.

Adding a new UI element like a "Stop" button in the progress bar for NNTP (and POP3 and imap?) would require discussion and approval with the UI movers/shakers.
Anyhow my divide by 4 was just a proposal. Maybe divide by 2 (50s timeout) would be better, or maybe just stay with default 100s (no division).

Edit: Above I wrote POP3 regarding messages marked as read. I forgot we are dealing with NNTP here.

(In reply to gene smith from comment #15)

I think it would be better just to pop up an error message like other protocols do that doesn't take over the whole message area.

I think we do want to show the error page instead of a dialog, where possible, as that's much nicer and less disruptive UI. Of course when moving to next msg, any actions for the previous message should get disregarded.

I have not tried the build for my primary profile. My impression is that the change improves behavior significantly. I just had a hope that better management of connection state will eliminate more glitches.

AFAICT, the POP3 NNTP protocol (server or client) has no say in if a message is marked as read. That's something that occurs once the message is stored locally which is handled by "mailbox" local file "protocol". The only changes I made were at the POP3 protocol level.

It is not a regression. Current ESR may mark messages as read despite network failure as well. I am aware that the state is kept locally for NNTP, but it is a must have feature, otherwise it would be impossible to follow active discussions.

As long as the server response starts within 25 seconds of the request is all that would matter. It shouldn't matter how long it takes to download the huge message/attachment itself.

I am unsure if this timeout is applied to connection attempt or to interval of time between request and response for earlier established connection. I am still in doubts if the divider is helpful.

I think we do want to show the error page instead of a dialog, where possible, as that's much nicer and less disruptive UI.

I agree. Perhaps further improvements may be implemented as a fix of another bug (Bug #1758100 may be a candidate). My suggestion is to clean headers of previous message if the new one is not received withing ~2 seconds.

Status: NEW → ASSIGNED

Been over 3 weeks since I looked at this and tried to duplicate the bug using comment 0 steps. It didn't fail. Then I recalled that the steps in comment 0 has an error. Here are the correct steps:

  1. Setup a news (NNTP) group and configure so that no offline storage is used so server is accessed on first read of every article in steps below.
  2. Connect successfully to nntp news group.
  3. Simulate a network failure by dropping incoming data from port 119/nntp: sudo iptables -A INPUT -p tcp --sport nntp -j DROP
  4. Click on a message. Of course, message does not appear.
    Maybe select a few more messages which won't appear
  5. Switch TB to offline (simulates resetting network hardware)
  6. Remove the drop rule (fixes the simulated error): sudo iptables -D INPUT -p tcp --sport nntp -j DROP
  7. Switch TB to online (this could be done after step 4 with same results)
  8. Click on another message not the same as step 3 message(s). Message header and body comes in.
  9. Click again on step 3 message(s). Body(s) displays as blank, header from step 7 remains displayed.

(In reply to Magnus Melin [:mkmelin] from comment #16)

(In reply to gene smith from comment #15)

I think it would be better just to pop up an error message like other protocols do that doesn't take over the whole message area.

I think we do want to show the error page instead of a dialog, where possible, as that's much nicer and less disruptive UI. Of course when moving to next msg, any actions for the previous message should get disregarded.
1859522

I just found that the about:neterror pages were something you fairly recently added to JS nntp implementation at bug 1859522. I tried 91 again and it unreliably/sometimes also puts up the about:neterror page, and the "Try again" button is a noop at 91 too (running c++ implementation). I have no clue how to make the "Try again" button work so I'll file a new bug for that.

Note: With the "Try again" button NOT working, the only way to try again is move to another message and then back to the original original failed message. But if there is only one message (would be pretty unusual I think) the only way to try it again is to open it in a new window or tab.

Attachment #9392990 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: