Closed Bug 1857450 Opened 8 months ago Closed 26 days ago

nntp offline cache corruption

Categories

(MailNews Core :: Networking: NNTP, defect, P1)

Thunderbird 115

Tracking

(thunderbird125 affected)

RESOLVED FIXED
127 Branch
Tracking Status
thunderbird125 --- affected

People

(Reporter: manikulin, Assigned: benc)

References

Details

(Keywords: regression, regressionwindow-wanted, reproducible)

Attachments

(3 files)

Steps to reproduce:

Yesterday I upgraded thunderbird from 102ESR to 115.3.1-1~deb12u1 from Debian bookworm (stable) repository.

Actual results:

Today I noticed mess in a NNTP newsgroup. It looks like concurrent writes to the offline cache file:

News/news.gmane.org/gmane.emacs.orgmode (actually currently the server is news.gmane.io)

From - Thu Oct  5 23:49:55 2023
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
From - Thu Oct  5 23:49:55 2023
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.orgmode,gmane.emacs.devel
Subject: Re: Completion of links to man pages
Date: Thu, 05 Oct 2023 18:52:36 +0300
Approved: news@gmane.org
Message-ID: <838r8h3w3f.fsf@gnu.org>
References: <ufjj01$p58$1@ciao.gmane.io> <87sf6pi9fn.fsf@localhost>

You see that it is a cross-post and before I notice these jammed messages in gmane.emacs.orgmode having offline cache enabled, I noticed no issues with same messages in gmane.emacs.devel where I have not enabled offline cache.

Expected results:

Messages are written to cache strictly sequentially.

In this particular newsgroup I enabled offline storage due to other thunderbird bugs related to opening messages read earlier and to mitigate lags due to intermittent network issues.

Component: Untriaged → Networking: NNTP
Product: Thunderbird → MailNews Core

Concurrent fetching of messages may be a result of other changes in thunderbird: it does not remember last viewed message in the folder, it selects whole thread instead, it downloads messages to display a snippet from each message in the thread summary.

Is there a way to suppress generation of thread summary?

It may confirm that bug #1807063 regression makes the issue more probable.

I don't think it's a concurrent-download issue. There's some rudimentary per-folder locking which seems to work well enough (you can right-click on an offline-enabled folder, choose "Properties", then on the "Synchronization" tab spam the "Download Now" button multiple times to kick off multiple downloads at once, and it all seems to cope ok).

Does it happen on a lot of messages? or did you just notice that one example?

It could be related to the way "X-Mozilla-*" headers are added in (they are slipped into locally-stored messages added to keep track of message flags/tags/etc). The parsing involved is pretty finicky, and easy to confuse, and I'm sure I've seen those symptoms before during my time digging about in mbox code... but I can't remember the details.
Maybe folder compaction? Did you delete any message from that folder?

(In reply to Ben Campbell from comment #3)

Does it happen on a lot of messages? or did you just notice that one example?

I have got more corrupted messages. I still suspect that it is related to fetching of unread (and so not yet downloaded) messages to generate a summary. For obvious reasons I am trying to avoid it. Most of the time I have threads expanded. It is really inconvenient for mailing lists where threads may contain dozens and hundreds of messages. I have not tried to reproduce it with a clean profile. Thunderbird-115 is a pain, but there are other issues as well.

Maybe folder compaction? Did you delete any message from that folder?

It is an NNTP account and I do not delete messages. Offline storage mbox contains ~16k messages now.

The following might make the issue more prominent:

  • Sometimes it takes a few second to resolve server name
  • Firmware of the Intel WiFi adapter crashes, so some network requests get no response, sometimes packets arrive in a bunch. Thunerbird has long-standing issues with intermittent network failures. It was so for old C++ NNTP code, but new JS implementation is affected as well.

There is no problem to reproduce the issue with with a clean thunderbird profile.

  • subscribe to news://news.gmane.io/gmane.emacs.orgmode that has enough long threads,
  • enable offline cache,
  • quickly click on several collapsed threads to initiate fetching of messages for summaries,
  • iterate over messages from these threads.

Some messages contain apparent jam from body and headers of other messages. Alternatively open the file with offline cache and search for the ^From - regexp. Some lines are withing headers of other messages.

grep -C 1 '^From -' /tmp/tb-bug-1857450/News/news.gmane.io/gmane.emacs.orgmode
From - Tue Jan 23 16:51:41 2024
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
From - Tue Jan 23 16:51:41 2024
X-Mozilla-Status: 0001

From - Tue Jan 23 16:52:51 2024
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
From - Tue Jan 23 16:52:51 2024
X-Mozilla-Status: 0001
See Also: → 1872849

Cool, nice and easy to reproduce with the comment 5 steps!

My guess is that it's bad error handling - not rolling back a partially-downloaded message when the download is aborted (by the user clicking on another message).

This bug predates the big mbox overhaul (phew :-). The new mbox code writes out bare "From " separators, without dates or other metadata, while the corruption shown in comment 5 uses the old-style separators (eg "From - Tue Jan 23 16:51:41 2024").

Assignee: nobody → benc
Keywords: reproducible

(In reply to Ben Campbell from comment #6)

My guess is that it's bad error handling - not rolling back a partially-downloaded message when the download is aborted (by the user clicking on another message).

My guess it that it is a concurrent fetching of messages to generate thread summary. I have no idea if it is possible to disable summaries for NNTP as it was in Thunderbird-102, so I can not test it.

Anyway summaries are broken for NNTP. If offline storage is disabled then it means unnecessary server load in the case of mailing lists with long threads. If offline storage is enabled then it is easy to get corrupted messages.

It is painful to not use collapsed threads or to not use threads at all, but it allows at least to read messages.

(In reply to Ben Campbell from comment #6)

My guess is that it's bad error handling - not rolling back a partially-downloaded message when the download is aborted (by the user clicking on another message).

Oh. There's nowhere even attempting to abort message downloading when a new download is kicked off.
When a message is downloaded, the nntp code calls the folder NotifyMessageBegin()/NotifyMessageDownloadLine()/NotifyMessageEnd() functions. But those functions assume there's only ever one message downloaded at once.
When you click multiple messages rapidly, or on a collapsed thread, multiple downloads will be kicked off and calls to those functions will overlap.
This might just be due to the fact that it'll use up to 2 connections by default (I haven't tried restricting it to a single NNTP connection).

I've done a patch to make the message store fail if attempts are made to simultaneous writes to an mbox, which has the effect of making NotifyMessageBegin() et al fail.
But I think there's somewhere where NotifyMessageEnd() is not being called with a failing code, because it can leave an outstanding write upon the messagestore, which leaves the folder "locked".

Magnus: got any suggestions for how the nntp code should work with the folder functions to properly serialise this stuff?

Flags: needinfo?(mkmelin+mozilla)
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Flags: needinfo?(mkmelin+mozilla)

I guess you meant notifyDownloadBegin/notifyDownloadedLine/notifyDownloadEnd.
How nntp does it all seems very risky. Maybe it was ok/needed in the past, idk, nntp used to do things its very own way...

With imap-js it looks like it's better. A this._pendingMessage is kept, and only once the message is complete, the sink will be involved to write it all out at once (https://searchfox.org/comm-central/rev/99dbb845084169619f1c286762b049f14a9e6473/mailnews/imap/src/ImapClient.jsm#1705 it seems). I'd think that makes it pretty robust. We could use the same approach for nntp and drop those line methods from nsIMsgNewsFolder which doesn't seem like it should have any line centric functions.

Severity: -- → S2
Priority: -- → P1

(In reply to Ben Campbell from comment #9)

But I think there's somewhere where NotifyMessageEnd() is not being called with a failing code, because it can leave an outstanding write upon the messagestore, which leaves the folder "locked".

NNTP code fails in various ways in the case of intermittent network issues: Bug #1876261, Bug #1753195. (The latter one was closed being reproducible, maybe now it happens not so frequently, but the real workaround is disabling animation.)

(In reply to Magnus Melin [:mkmelin] from comment #10)

With imap-js it looks like it's better. A this._pendingMessage is kept, and only once the message is complete, the sink will be involved to write it all out at once (https://searchfox.org/comm-central/rev/99dbb845084169619f1c286762b049f14a9e6473/mailnews/imap/src/ImapClient.jsm#1705 it seems). I'd think that makes it pretty robust. We could use the same approach for nntp and drop those line methods from nsIMsgNewsFolder which doesn't seem like it should have any line centric functions.

I think you're right. I'd condense them down to a single "Here's a new message" function.
Strawman arguments against:

  1. There's a very obscure issue of a malicious server sending a never-ending stream of crap until you're out of memory (but I think some arbitrary "bad server! feck off!" cut-off point would be fine).
  2. There could be some performance benefit in streaming to local disk while data is coming in from the network, rather than just dumping it all in one burst at the end? (but on modern hardware.. I doubt it. I think the overwhelming simplicity of delivering a message all at once would totally justify leaving some potential performance on the table).

For now, I think I'll alter my msgStore patch to abort existing writes if a new one is requested. (It's a bit tricky because the rollback info needed is currently stored in the message DB. sigh... but I've been meaning to change that, so maybe now's a good time)

(In reply to Ben Campbell from comment #12)

  1. There's a very obscure issue of a malicious server sending a never-ending stream of crap

Anyway some kind of protection is required otherwise even without concurrent writes single message may fill whole disk.

It's a bit tricky because the rollback info needed is currently stored in the message DB.

Again it is unrelated to concurrent writes. It may be necessary to rollback in the case of network failure before whole message is received.

Some kind of broker is necessary that receives concurrent requests to write data, but serializes them. If no data is received for the current message for some interval of time while other data streams are active then the current message can be put into memory, next rewind offset in the file and start writing another largest message actively receiving data. It is not unique to NNTP.

Attachment #9378440 - Attachment description: WIP: Bug 1857450 - Make nsMsgBrkMBoxStore fail if asked to perform multiple simultaneous writes to same folder. → Bug 1857450 - Ensure nsMsgBrkMBoxStore prevents multiple simultaneous writes to same mbox file. r=#thunderbird-reviewers

Try run here. https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=855e913c4904020b72e068e3720710eb6fea060c

Setting leave-open, as this isn't the whole story. It might help more with Bug 1872849 than with this one.
This just protects the mbox writing (which is still important!). But it still allows interleaved nntp => nsMsgNewsFolder message writes to cause data to end up in the wrong message. So I still need to sort that out.

Keywords: leave-open

Pushed by thunderbird@calypsoblue.org:
https://hg.mozilla.org/comm-central/rev/bbea0bbe317d
Ensure nsMsgBrkMBoxStore prevents multiple simultaneous writes to same mbox file. r=mkmelin

See Also: → 1881121

Magnus: that patch adds some code on the folder side to:

  1. abort the current write if .notifyDownloadBegin() is called while another article is already being written.
  2. make .notifyDownloadedLine() fail if it's a write to a different article.

This does seem to clear up the mbox corruption nicely (disclaimer, just did some quick testing - nothing really intensive).

However, the NntpClient has a tendency to stall if any of the .notifyDownload* calls throw an error. So it's easy to lock it up completely to the point where nothing happens when you click on messages to view them.

I think the NttpClient ._actionArticle() and ._actionArticleResponse() handlers need to handle the failure of .notifyDownloadBegin()/.notifyDownloadedLine() and maybe go into some kind of error state which just soaks up the rest of the ARTICLE response, then goes back into normal operation ready to go. I don't trust my knowledge of the Nttp code enough to implement that without screwing something up.
Do you feel like having a go at this?

Either that or (as you suggested previously) change the api to a single call which collects the whole message and delivers it to the folder in one go. I don't really like this, as I feel we should always be dealing with "streaming messages" - eg so we can parse the message headers into the database as soon as enough is read, same (eventually) with parsing out the mime structure and attachment data. That's not really an issue here, as the message is already in the DB by the time we're downloading the whole thing. But it's the principle of the thing :-) Also, glacially-slow servers and internet connections might be an issue too? (again probably doesn't make things any worse than they already are in this particular case).
Upshot: I'd prefer a proper error-handling state, but I'd live with a one-shot-delivery API for now :-)

In any case one day I'd like to sweep all this cruft away and implement a proper robust protocol->folder syncing mechanism to replace all the ad-hoc shambles we've currently got everywhere (see Bug 1881618). One day.... one day...

Flags: needinfo?(mkmelin+mozilla)
See Also: → 1854670

max,

The patch for this bug is in 125.0b1 (available 2024-03-20/21) and newer. Can you give it a test and post your results?

DL 125.0b4 from https://archive.mozilla.org/pub/thunderbird/releases/125.0b4/linux-x86_64/ or some other source, install it in a NEW program directory, starting it should then create a new Thunderbird data profile where you can create just a news account.

Flags: needinfo?(manikulin)

(In reply to Wayne Mery (:wsmwk) from comment #18)

The patch for this bug is in 125.0b1 (available 2024-03-20/21) and newer. Can you give it a test and post your results?

Linux 125.0b4 (64-bit), a NNTP account.

My impression, it is even worse than in 115 ESR. Perhaps, I am just trying to avoid thread summaries in my primary ESR profile.

  • shredded messages as in 115 ESR
  • sometimes the headers pane is missed for a body (perhaps it is a component of a multipart message that is not multipart-alternative)
  • sometimes the headers pane is displayed without the body

I just enabled off-line storage and clicked on several collapsed threads.

It seems setting mail.operate_on_msgs_in_collapsed_threads to false to avoid the tread summaries anti-feature helps.

Setting mail.server.server2.max_cached_connections to 1 instead of 2 helps after thunderbird restarting, but not immediately.

Flags: needinfo?(manikulin)
See Also: → 1890230

Especially as messages are usually small, just buffer the download in memory and write out on completion.

Flags: needinfo?(mkmelin+mozilla)
Target Milestone: --- → 127 Branch

Pushed by alessandro@thunderbird.net:
https://hg.mozilla.org/comm-central/rev/808b03c44b2b
Write out the whole retrieved nntp article in one go. r=BenC

Status: ASSIGNED → RESOLVED
Closed: 26 days ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: