Closed Bug 1777454 Opened 2 years ago Closed 2 years ago

After compacting, messages are no longer added to the messages file

Categories

(Thunderbird :: Mail Window Front End, defect, P1)

Thunderbird 102

Tracking

(thunderbird_esr102+ fixed, thunderbird102 wontfix, thunderbird103 fixed)

RESOLVED FIXED
104 Branch
Tracking Status
thunderbird_esr102 + fixed
thunderbird102 --- wontfix
thunderbird103 --- fixed

People

(Reporter: bugzilla-account, Assigned: benc)

References

(Blocks 1 open bug)

Details

(Keywords: dataloss, Whiteboard: [fixed for ESR 102.0.3])

User Story

(Find a better formatted version of this user story as originally compiled by :thomas8 in Bug 1777454 Comment 55)

Impact of this bug

- This bug is believed to be *the* or *a* cause of frequently reported message display corruption (involving inbox corruption, potentially dataloss, especially for certain POP constellations):
Bug 1777738 - Inbox corruption in version 102 (IMAP, POP) - wrong message shown (or html of other) - repair folder fixes the issue (expected to be fixed by compact bug 1777454)

STR (tentative)

1. Thunderbird 102.* up to 102.0.2 ("a real issue" fixed for 102.0.3; "there could still be other factors involved..." per BenC's Bug 1777454 Comment 37)
2. POP or IMAP account (per Gene's Bug 1777454 Comment 24)
3. Enable `Allow antivirus clients to quarantine incoming messages` under `Settings > Privacy & Security > Antivirus`.
  Not sure if this is a required condition - probably not, per Benc's  Bug 1777454 Comment 25, but, per BenC's Bug 1777454 Comment 49, "I think quarantining was especially affected, because since 91 I've changed quarantining to be handled inside the mbox...".
4. Restart TB to ensure quarantining takes effect
5. Delete some messages, then right-click on Inbox folder > Compact
6. Receive the first incoming message... (should show OK)
7. Delete some more messages, then right-click on Inbox folder > Compact
8. Receive another incoming message and click to display

Actual result (WIP)

- The first message that comes in shows OK even if messages were previously deleted and the folder compacted.
- After deleting a few more messages and compact and then receive another message into POP3 inbox, I only see it in the list and and the message doesn't display.
- This is suspected to cause the (reversible?) inbox corruption symptoms of bug 1777738 (We should investigate if it's also reversible for all kinds of POP constellations, esp. when messages are already deleted on server).
- Additional messages received also don't display and the new messages are not in the Inbox file.
- If I restart tb, the not displayed messages are still on the list but still don't display in the reading pane. But any new messages received now display OK.
- After restarting TB, repair fixes the problem (probably only if there's no compaction getting in the way before).

Expected result
- Display every message normally
- No inbox corruption symptoms per bug 1777738

Current patch:
> Thunderbird 102.0.3:
> https://hg.mozilla.org/releases/comm-esr102/rev/b063ac02d9c1


Risks of current patch per Andrei's Bug 1777454 Comment 48 :
> Could potentially cause new issues with mbox storage, but the risk that they are worse than the current problem seems relatively low.

Attachments

(2 files)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0

Steps to reproduce:

Fetch some messages (POP3).
Delete a couple of them.
Compact the folder.

Actual results:

Although fetching more messages for the same account resulted in them showing up in the list, they were not accessible. Looking at the actual messages file, I noticed this had not been updated after the compact and the newly fetched messages had indeed not been added to it.
After restarting Thunderbird, everything worked normally again, only to fail again after the next compact.

Expected results:

The new messages should have been present.

I can't seem to duplicate this with my POP3 account. You mention "couple of them" and "fetching more messages". Do the actual quantities matter?
Tested on self-built and up to date daily and on 102.

Ran some more tests and you actually don't have to delete anything. Just compacting is enough. Compact a folder and the messages file no longer accepts new messages. Quit and start Thunderbird anew and it works again.

By "message file" I assume you are referring to something like <ProfilePath>/Mail/<AccountName>/Inbox ?
For me, new messages get added to this "Inbox" file when I click "Get Messages" for the account. I can also read the new messages regardless of whether anything has been deleted or if "compact" has been done.
How big is your Inbox file? Maybe that has something to do with it.

Yes, that's the one I meant. It's not particularly big, about 400kiB. I usually either move mail to other folders or delete what I don't need to keep.
It happens with every messages file though, not just the Inbox (I have filters that move mail to other folders), but I ran my tests with Inbox.

Blocks: tb102found

I was testing on linux. Tried 102 on win10 and still can't duplicate the problem.

Although fetching more messages for the same account resulted in them showing up in the list, they were not accessible.

Maybe I'm not understanding "not accessible". When you click on the newly arrived message in the list what do you exactly see. Does the wrong message appear? Does an empty space appear in the reading pane?

Another thing that is different with 102 is that the messages now default to threaded mode. I don't suppose that could be what you are seeing and making the message appear "not accessible"? (I doubt it but I'll still ask.)

Edit: I also set up a filter for the pop3 account and still don't see the reported bug.

The "not accessible" bit was my initial observation. Clicking in the list produced an empty reading pane, unless another message was already there - then nothing changed and that other message stayed visible. Double clicking produced an empty window.
Then I started investigating and noticed the messages file had an older timestamp than its corresponding msf file. I put a message in my local dovecot setup, fetched it in Thunderbird and noticed Inbox.msf had been updated but Inbox hadn't. Size and timestamp were unchanged.
After quitting and restarting Thunderbird, fetching a message caused both files to be updated again and restored normal behaviour. Compacting brought the issue back. And so on...

So maybe this report belongs to another component, as the real problem is with adding newly fetched messages to the file.

Small correction: 'that other message stayed visible' should be 'the headers of that other message stayed visible and the pane became blank'.

Been doing some more testing and what appears in the reading pane is random (well, probably dependent on some pointer that points to where the actual (not) stored message is supposed to be). It can be blank or another message or 'somewhere' in the messages file (i.e. in the headers or half way in one message and another).

Also, when a message is fetched and not stored, you get this in the log:

2022-07-01 09:19:44.421459 UTC - [Parent 328031: Main Thread]: I/MsgDB error opening db 80550006

No idea what that 80550006 represents.

I don't know if this has any relevance to the issue, but when I started thunderbird in the terminal after setting up logging, these four lines appeared:

ATTENTION: default value of option mesa_glthread overridden by environment.
ATTENTION: default value of option mesa_glthread overridden by environment.
ATTENTION: default value of option mesa_glthread overridden by environment.
ATTENTION: default value of option mesa_glthread overridden by environment.

Checking my .xsession-errors file, I'd say this has been happening since I upgraded to 102.

Depends on: 1776727

(In reply to FrankZ from comment #8)

2022-07-01 09:19:44.421459 UTC - [Parent 328031: Main Thread]: I/MsgDB error opening db 80550006

This is probably just a general error that the .msf file can't be opened.
You've only mentioned a single pop account having a problem. Is it possible you have lots of other accounts and many folders in the pop account or possibly other accounts (imap of pop)? There was an issue (supposedly fixed in 102, bug 1726319) where the number of open MsgDB files exceeded the OS limit. However, this only seemed to be a issue with Windows.

(In reply to FrankZ from comment #9)

I don't know if this has any relevance to the issue, but when I started thunderbird in the terminal after setting up logging, these four lines appeared:

ATTENTION: default value of option mesa_glthread overridden by environment.

I see these too on linux and still can't duplicate the problem.
Maybe something is corrupted in the MsgDB (.msf file). Maybe if you repair the folder the problem will stop:

Right click folder
Choose properties
Click "repair folder" button

I would have suggested this sooner but better to check other issues first since this pretty much destroys the evidence (if there is any). This still may be worth a try and if too many files open it will still come back.

(In reply to gene smith from comment #10)

(In reply to FrankZ from comment #8)

2022-07-01 09:19:44.421459 UTC - [Parent 328031: Main Thread]: I/MsgDB error opening db 80550006

This is probably just a general error that the .msf file can't be opened.

But the .msf file is the one that gets updated when this happens. It's the file with the messages that doesn't. It's definitely related to the issue, because it never shows up when everything works as it should.

You've only mentioned a single pop account having a problem.

To keep things simple. It happens to every folder that is compacted. I actually noticed it first when I looked at the RSS feeds and no text showed up when I clicked the list.

Is it possible you have lots of other accounts and many folders in the pop account or possibly other accounts (imap of pop)? There was an issue (supposedly fixed in 102, bug 1726319) where the number of open MsgDB files exceeded the OS limit. However, this only seemed to be a issue with Windows.

I do have lots of accounts, but the log indicates only a few files are opened when a message is fetched from a single account.

Maybe something is corrupted in the MsgDB (.msf file). Maybe if you repair the folder the problem will stop:

Right click folder
Choose properties
Click "repair folder" button

I could do that, but since it's happening to every folder that gets compacted I doubt it would help.

LOL! No. It does the exact opposite, i.e. it removes the message that wasn't entered into the messages file from the .msf and the list. And afterwards fetching new messages still doesn't result in proper processing. Only stopping and restarting thunderbird does that.

Forgot to add: too many files open doesn't make sense anyway, because nothing like this ever happens after a restart, before a folder is compacted.

I do have lots of accounts, but the log indicates only a few files are opened when a message is fetched from a single account.

The MsgDB logging is not that great.
How many total folders do you have (approximately)?

If thousands of folders, there still may be a problem in there. Folder DBs start out with few open and more may get opened over time.
Maybe checking with command "lsof" would show something.

Also, make sure you have a folder at the root of your profile called FolderCache.json. It replaces a folder called panacea.dat that was last used with release 91. There is supposed to be an automatic migration from panacea.dat to FolderCache.dat on 91 -> 102 upgrade.
Were you previously using 91 release or any of the interim betas between 91 and 102?

Maybe grasping at straws...

Again, if open files was an issue, why would this only happen after compacting?

The number of folders with message files is about 60. Not all of those receive fetched messages directly though. Some are for archiving.

The file folderCache.json (note the lower case f) exists. I upgraded from 91 to 102, no betas.

(In reply to FrankZ from comment #14)

Again, if open files was an issue, why would this only happen after compacting?

Good point. Like I said, I'm grasping at straws here.

The number of folders with message files is about 60. Not all of those receive fetched messages directly though. Some are for archiving.

That's not a lot. Shouldn't be a problem even if all opened.

The file folderCache.json (note the lower case f) exists. I upgraded from 91 to 102, no betas.

Yes, lower case f is correct. Good to know at least it's there.

Do you have any imap accounts? If so, do you see a similar problem with any of them?

Do you see this on folders under the "Local Folders" account? They are handled very similarly to POP3 folders (both are local). I.e., if you compact a Local Folder and then copy a message from POP3 inbox to the same Local Folder, is the message not accessible?

I've duplicated the problem! (I think.)
I only see the problem if I enable "anti-virus quarantining" under Settings->Privacy & Security->Antivirus.
After I enable this and restart tb just be sure it takes effect, the first message that comes in shows OK even if messages were previously deleted and the folder compacted.
If I delete a few more message and compact and then receive another message into POP3 inbox, I only see it in the list and and the message doesn't display. Additional messages received also don't display and the new messages are not in the Inbox file.
If I restart tb, the not displayed messages are still on the list but still don't display in the reading pane. But any new messages received now display OK.

So if reporter FrankZ has the antivirus quarantining set, then that's most likely the problem.

A similar issue for virus quarantine for imap was fixed in bug 1742975 so I guess that patch didn't cover the POP3 case.

Status: UNCONFIRMED → NEW
Ever confirmed: true
See Also: → 1742975

(In reply to gene smith from comment #16)

I've duplicated the problem! (I think.)
I only see the problem if I enable "anti-virus quarantining" under Settings->Privacy & Security->Antivirus.
After I enable this and restart tb just be sure it takes effect, the first message that comes in shows OK even if messages were previously deleted and the folder compacted.
If I delete a few more message and compact and then receive another message into POP3 inbox, I only see it in the list and and the message doesn't display. Additional messages received also don't display and the new messages are not in the Inbox file.
If I restart tb, the not displayed messages are still on the list but still don't display in the reading pane. But any new messages received now display OK.

So if reporter FrankZ has the antivirus quarantining set, then that's most likely the problem.

A similar issue for virus quarantine for imap was fixed in bug 1742975 so I guess that patch didn't cover the POP3 case.

Is there a test that checks for this situation in xpcshell-test or in mochitest?
That is, a test that enables "anti-virus quarantining" under Settings->Privacy & Security->Antivirus." and do compact and receive a few messages. and see if they are in the message header list.
(Well, I would be able to do this manually, I suppose, but having a test in one of the test suites would be useful.)

Also, it sounds as if some uninitialized variable may play a role here. Maybe I should run valgrind to see if this is the case.

I doubt there are tests since according to Ben C. the quarantine to temp file, if enabled, was originally only for POP3. Ben changed it to apply to all mbox writers from external sources (imap, news and pop3). Maybe the change somehow messed up the implementation for POP3? I looked at his changes but I'm sure Ben can see where it going wrong a lot quicker.
Also, I think the new messages are showing up in the headers stored in .msf file. They are just not getting put into the mbox file for POP3.

Flags: needinfo?(benc)

Yes, that's exactly what's happening and what I've been saying all along...

And I do have "anti-virus quarantining" enabled of course.

And I can confirm that switching it off gets rid of the issue.

(In reply to ISHIKAWA, Chiaki from comment #17)
...

Also, it sounds as if some uninitialized variable may play a role here. Maybe I should run valgrind to see if this is the case.

I tried to run C-C TB (locally compiled full debug version) under valgrind.
Two months ago, I could run it to check the operation of TB after I applied my local patches for
Something has changed.

  • Base OS version, and flags to compile the OS (I use Debian GNU/Linux for development. I tried 5.18.0 and 5.17.x ).
  • Libraries linked with TB.
  • valgrind itself. (I am using git release. valgrind-3.20.0.GIT)
    CC TB + valgrind combination got stack somewhere and won't proceed after showing the folder and message header panes, etc.
    I cannot select a folder to compact. At that time, TB+valgrind does not respond and gets hung. :-(
    Maybe someone with working address-sanitizer version may want to check. But it cannot detect uninitialized variable, only clear memory violation.

(In reply to gene smith from comment #18)

I doubt there are tests since according to Ben C. the quarantine to temp file, if enabled, was originally only for POP3. Ben changed it to apply to all mbox writers from external sources (imap, news and pop3).

There are a few tests for the quarantining, but they're not great.
test_pop3DownloadTempFileHandling.js
test_localToImapFilterQuarantine.js
test_quarantineFilterMove.js

Currently working on expanding this unit testing to exhibit the issue to make it easier to work on (and to act as a canary to warn of future breakage)...

Assignee: nobody → benc
Status: NEW → ASSIGNED
Flags: needinfo?(benc)

If POP3 initial store to temp file (quarantine) in effect and some message deletion and then folder compact has been performed more than once, it appears that the temp file is receiving the new message OK (per added printfs and commenting out delete of temp file after the copy). The problem seems to be the next step where the temp file is closed and then re-opened and copied back into the real mbox file. It appears to go through the motions where all bytes are being written to the real mbox from the temp file with no errors, but nothing new gets appended to the real mbox file. It's almost like the write back to real mbox is going to the wrong place, maybe to old mbox file descriptor that was valid before the compact, or to /dev/null.

Attached patch quar.diffSplinter Review

Not sure why but this change seems to fix it. Added a flush and close on the stream to the real mbox file after the copy from temp is finished.
Also, only tested on POP3 but looks like this bug would affect IMAP too.

(In reply to gene smith from comment #24)

Not sure why but this change seems to fix it. Added a flush and close on the stream to the real mbox file after the copy from temp is finished.
Also, only tested on POP3 but looks like this bug would affect IMAP too.

Ahh, I think I understand what might be happening. The mbox storage has always cached mbox outputstreams, the thinking being that opening/closing them for each message is slow (I don't think this is actually the case - I think Chiaki did some tests and concluded that the OS-level caching means that our caching has no real speed benefit).

So when a mbox compaction takes place, the compactor writes out a new mbox file then replaces the old one.
This means that any cached outputstreams will be pointing at the old file. This would explain why the compaction screws it up.

The effect is masked by a few things. For a start, the outputstream cache entries are automagically discarded if the outputstream was closed.
In normal operation, the outputstream should be closed when the message is finishes writing - when nsIMsgPlugabbleStore.finishNewMessage() is called. But FinishNewMessage() seems to be missing the Close(). Your patch will sort it out for the quarantined case, but doesn't help for non-quarantined.

I'm just working up a patch to ensure the outputstreams are closed properly. Of course it completely removes the point of caching outputstreams, but we were headed that way anyway, so I'll probably snip the caching out in another patch.

Don't know about the Flush() call - I think Close() should be enough. I'm concerned that Flush() ends up performing a whole-system fsync, which would suck. Close() should flush out pending data to the filesystem layer at least.

OK, there's my patch - does it help for you, Gene? (I haven't had a chance to test it yet).
I'll continue on working up a proper unit test for this.

try run here just to see it doesn't break anything else:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=e80bedd5387607ec49fcbab103d005a602f6d5b6

I have not had the time to read the patch fully, but in my proposed patch for buffered output, the FinishNewMessage() and DiscardNewMessage() ALWAYS close the file stream. Just FYI.

I will follow the patches created here and update my local patches for buffered output.

(In reply to Ben Campbell from comment #25)

(In reply to gene smith from comment #24)

Not sure why but this change seems to fix it. Added a flush and close on the stream to the real mbox file after the copy from temp is finished.
Also, only tested on POP3 but looks like this bug would affect IMAP too.

Ahh, I think I understand what might be happening. The mbox storage has always cached mbox outputstreams, the thinking being that opening/closing them for each message is slow (I don't think this is actually the case - I think Chiaki did some tests and concluded that the OS-level caching means that our caching has no real speed benefit).

I think this is also true, but the real improvement would be once the buffered output really begins to work, the reduced number of write system calls far outweighs the few saved open/close calls here and there.

In order to save a photo attachment (say, 1MB or so ) and call write() 125,000 times (due to MIME-encoded 75+ octet lines) is a performance pig.

Aha. You may have noticed that in my proposed patch for buffered output, I had to disable the use of safefilestream, etc.
Because it simply did not work for whatever the reason.
This may be related to the forgotten close(s) in the code.

I will see how the modified code fares better.

So when a mbox compaction takes place, the compactor writes out a new mbox file then replaces the old one.
This means that any cached outputstreams will be pointing at the old file. This would explain why the compaction screws it up.

The effect is masked by a few things. For a start, the outputstream cache entries are automagically discarded if the outputstream was closed.
In normal operation, the outputstream should be closed when the message is finishes writing - when nsIMsgPlugabbleStore.finishNewMessage() is called. But FinishNewMessage() seems to be missing the Close(). Your patch will sort it out for the quarantined case, but doesn't help for non-quarantined.

I'm just working up a patch to ensure the outputstreams are closed properly. Of course it completely removes the point of caching outputstreams, but we were headed that way anyway, so I'll probably snip the caching out in another patch.

Don't know about the Flush() call - I think Close() should be enough. I'm concerned that Flush() ends up performing a whole-system fsync, which would suck. Close() should flush out pending data to the filesystem layer at least.

I am not sure about it. Flush() only writes the in-memory pending buffered data to the output stream.
( I don't think it calls fsync, but I am not sure.)

At least under linux/posix, there is
fsync, and
fdatasync

The latter seems to be a bit more efficient than former. Also, for important data such as user e-mail message, I think the use
of at least fdatasync is warranted. On a busy system, I read sometimes, the in-memory OS buffer may reside unwritten to disk close to 30 seconds or so. I hate to think the system crashes during that time.

Actually, in my proposed patch to enable buffered write, I explicitly invoke Flush() in a few places in order to obtain
fine-granule error information. Instead of obtaining errors at the Close() time altogether, my patch allows us to
check for errors caused by overlowing filesystem, unavailable remote filesystem, etc. at an earlier date/time, but I digress.

FYI, before applying the patch here,
I have a local C-C tree updated a few days ago.
If I apply my local patches to

  • remove reusable flag
  • enabled buffered output properly for pop3 (and local folder copy, I think)
    and create a debug version of C-C TB, then that version does not suffer from this bug.
    So proper flushing and closing in key places seems to solve the issue.

Maybe I should upload the update removable reusable flag now. I will do so in the next few minutes.
(Actually, I wondered whether I should remove the caching for the open stream for POP3, but since the code is NOT that complex, I left it as is.
Maybe I should remove it also.)

(In reply to Ben Campbell from comment #27)

OK, there's my patch - does it help for you, Gene? (I haven't had a chance to test it yet).
I'll continue on working up a proper unit test for this.

try run here just to see it doesn't break anything else:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=e80bedd5387607ec49fcbab103d005a602f6d5b6

I gave your patch a good test and it still works OK with quarantine enabled. Also tested with quarantine off and that works fine too (of course, even without my diff or your patch it worked OK with quarantine off).

I wasn't sure it needed the flush() call either. However, with quarantine on and after initial save to /tmp/newmsg there is a flush and close of that stream so thought maybe flush was needed after write to mbox file finished. But I suspect close is enough.

@Ben Campell

https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=e80bedd5387607ec49fcbab103d005a602f6d5b6

Can you make a try run with Windows, mac and Linux opt builds, and post the relevant artifact URLs which users would install, so that users can test it?

Flags: needinfo?(benc)

Can you make a try run with Windows, mac and Linux opt builds, and post the relevant artifact URLs which users would install, so that users can test it?

I kicked off another try build using Ben's patch. It just started so will be a while before it finishes:
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=bdca5b36f835557853c5a655cdf21bc71beaa101
This will produce a daily build for all archs, optimized and in debug mode if anyone want to test it.

Try build is finished. See here for how to download an installer: bug 1777776 comment 38

FrankZ, just to be clear, Wayne's links in comment 34 are direct links to my optimized 64 bit "try" builds of comment 32. Only the target.tar.bz2 is for linux.

(In reply to gene smith from comment #35)

FrankZ, just to be clear, Wayne's links in comment 34 are direct links to my optimized 64 bit "try" builds of comment 32. Only the target.tar.bz2 is for linux.

Yes, I get that.

With "anti-virus quarantining" switched on I see no more issues after compacting.

Flags: needinfo?(bugzilla-account)
Flags: needinfo?(benc)

I think the patch fixes a real issue, so I'm setting it for landing.
There could still be other factors involved, but if we get any more positive feedback I'd say uplift it to 102.x

(In reply to Ben Campbell from comment #37)

I think the patch fixes a real issue, so I'm setting it for landing.
There could still be other factors involved, but if we get any more positive feedback I'd say uplift it to 102.x

Great progress in just a couple days. Thanks!

We won't be doing another build of beta or esr for a couple days. So please request uplift tomorrow with risk assesment, etc

Pushed by geoff@darktrojan.net:
https://hg.mozilla.org/comm-central/rev/1846b7604187
Ensure mbox outputstreams are closed after writing messages. r=gds,darktrojan

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 104 Branch

In bug 1777776 comment 41 and following comments, one user reports that the fix here reliably fixed the corruption for him. Congrats!

(In reply to Rachel Martin from comment #40)

Why is the Close() call here https://hg.mozilla.org/comm-central/rev/1846b7604187#l3.12 necessary when it wasn't necessary in prior versions like 91?
https://searchfox.org/comm-esr91/rev/83b2f2b16fcbed289bfd0d5b09f386245fa5e749/mailnews/local/src/nsMsgBrkMBoxStore.cpp#638-646

I think this is to also close the stream to the real mbox when quarantine is NOT in effect.

From Ben's comment 25:

The effect is masked by a few things. For a start, the outputstream cache entries are automagically discarded if the outputstream was closed.
In normal operation, the outputstream should be closed when the message is finishes writing - when nsIMsgPlugabbleStore.finishNewMessage() is called. But FinishNewMessage() seems to be missing the Close(). Your patch will sort it out for the quarantined case, but doesn't help for non-quarantined.

(In reply to gene smith from comment #42)

(In reply to Rachel Martin from comment #40)

Why is the Close() call here https://hg.mozilla.org/comm-central/rev/1846b7604187#l3.12 necessary when it wasn't necessary in prior versions like 91?
https://searchfox.org/comm-esr91/rev/83b2f2b16fcbed289bfd0d5b09f386245fa5e749/mailnews/local/src/nsMsgBrkMBoxStore.cpp#638-646

I think this is to also close the stream to the real mbox when quarantine is NOT in effect.

From Ben's comment 25:

The effect is masked by a few things. For a start, the outputstream cache entries are automagically discarded if the outputstream was closed.
In normal operation, the outputstream should be closed when the message is finishes writing - when nsIMsgPlugabbleStore.finishNewMessage() is called. But FinishNewMessage() seems to be missing the Close(). Your patch will sort it out for the quarantined case, but doesn't help for non-quarantined.

(In reply to ISHIKAWA, Chiaki from comment #29)

FYI, before applying the patch here,
I have a local C-C tree updated a few days ago.
If I apply my local patches to

  • remove reusable flag
  • enabled buffered output properly for pop3 (and local folder copy, I think)
    and create a debug version of C-C TB, then that version does not suffer from this bug.
    So proper flushing and closing in key places seems to solve the issue.

Maybe I should upload the update removable reusable flag now. I will do so in the next few minutes.
(Actually, I wondered whether I should remove the caching for the open stream for POP3, but since the code is NOT that complex, I left it as is.
Maybe I should remove it also.)

My local patch that did not see the problem reported here contained Close() as in the latest patch Ben posted.
That is why it did not see the problem mentioned here. Hmm...
However, I had to disable the use of safeoutputstream due to misbehaving TB in the face of simulated I/O error.
I will see if this addition of Close() is good to re-enable safeoutputstream.
But (!), I think I will opt out to disable safeoutputstream because TB does not recover from an output error gracefully. (It repeatedly displayed modal warning and did not let me finish TB cleanly at all) But I will discuss this in my buffered output patch.
Just FYI as of today.

See Also: → 1777776

Comment on attachment 9284075 [details]
Bug 1777454 - Ensure mbox outputstreams are closed after writing messages. r=gds

[Approval Request Comment]

User impact if declined:
This patch likely fixes data loss for at least this bug, but potentially some others. It's important we get it out ASAP.

Testing completed (on c-c, etc.):
Risk to taking this patch (and alternatives if risky):
There is some small risk that this patch makes things worse, but it's been tested by a number of people with the issue and confirmed to fix it with no other data loss issues reported.

Attachment #9284075 - Flags: approval-comm-beta?

Comment on attachment 9284075 [details]
Bug 1777454 - Ensure mbox outputstreams are closed after writing messages. r=gds

[Triage Comment]
Approved for beta.

Attachment #9284075 - Flags: approval-comm-beta? → approval-comm-beta+

Comment on attachment 9284075 [details]
Bug 1777454 - Ensure mbox outputstreams are closed after writing messages. r=gds

[Approval Request Comment]

User impact if declined:
Message corruption after compacting continues.

Testing completed (on c-c, etc.):
On c-c and on 103.0b4 since 2022-07-08

Risk to taking this patch (and alternatives if risky):
Could potentially cause new issues with mbox storage, but the risk that they are worse than the current problem seems relatively low.

Attachment #9284075 - Flags: approval-comm-esr102?

(In reply to Rachel Martin from comment #40)

Why is the Close() call here https://hg.mozilla.org/comm-central/rev/1846b7604187#l3.12 necessary when it wasn't necessary in prior versions like 91?
https://searchfox.org/comm-esr91/rev/83b2f2b16fcbed289bfd0d5b09f386245fa5e749/mailnews/local/src/nsMsgBrkMBoxStore.cpp#638-646

Specifically for this bug:
The new Close() is required to invalidate the outputstream cached by the mbox store.
We need to do this, otherwise folder compaction blats over the mbox file, leaving the mbox outputstream cache pointing at the old mbox (which is what caused this bug).
Ideally, folder compaction should be a service provided by the mbox store (it's an mbox-specific operation), so that any such cache can be invalidated, but right it's a separate thing and there's no compaction<->mboxstore coordination.

Previously, this issue was masked because most calls to FinishOutputStream() also called Close() directly on the outputstream.
I think quarantining was especially affected, because since 91 I've changed quarantining to be handled inside the mbox, returning an outputstream pointing to a temp file. So even if code calls Close() as well as FinishedOutputStream(), the underlying mbox stream won't be closed, and the cache will persist beyond the compaction, pointing at the old mbox.

Going forward:

  • For now we should just ditch the unused outputstream caching inside mbox store (it could be reinstated one day in a folder-compation-aware if shown to be a benefit). I didn't do that for this bug just because I wanted to keep the changes as minimal as possible.

In general:
you call .getNewMsgOutputStream() on a store to get a writable stream, write out your message, then call either .DiscardNewMessage() to abort (and roll back) or .FinishOutputStream() to commit your message to the store.
Anything more fancy than that (stream caching/reuse, quarantining, whatever) should be entirely internal to the store.

Blocks: 1778241
See Also: → 1778353
See Also: → 1777738

(In reply to Ben Campbell from comment #49)

Previously, this issue was masked because most calls to FinishOutputStream() also called Close() directly on the outputstream.

Thanks for the reply. You mean FinishNewMessage():
https://searchfox.org/comm-esr102/search?q=FinishNewMessage&path=&case=false&regexp=false

Some callers close the stream, some leave it open and some close it depending on whether its reusable:
https://searchfox.org/comm-esr102/rev/352c49b382c7768a76af75f82d7f756da02c855f/mailnews/local/src/nsParseMailbox.cpp#2243
https://searchfox.org/comm-esr102/rev/352c49b382c7768a76af75f82d7f756da02c855f/mailnews/local/src/nsLocalMailFolder.cpp#2098
https://searchfox.org/comm-esr102/rev/352c49b382c7768a76af75f82d7f756da02c855f/mailnews/import/src/nsBeckyMail.cpp#470
https://searchfox.org/comm-esr102/rev/352c49b382c7768a76af75f82d7f756da02c855f/mailnews/import/src/nsAppleMailImport.cpp#562
Closing the stream "wholesale" now certainly changes the behaviour. We're also not convinced that this is a necessary measure given that the idea of reusable streams was to continue writing to the stream as more messages are added to the mbox. If folder compaction is a special case now, it should close the stream itself. BTW, we can't see folder compaction call FinishNewMessage().

We also noticed that the Close() in nsParseMailbox.cpp was removed now: https://hg.mozilla.org/comm-central/rev/e0eb4bb286d8#l3.94, but that is not in 102. We didn't see any other call sites where a Close() call was removed between 91 and 102.

Comment on attachment 9284075 [details]
Bug 1777454 - Ensure mbox outputstreams are closed after writing messages. r=gds

[Triage Comment]
Approved for esr102

Attachment #9284075 - Flags: approval-comm-esr102? → approval-comm-esr102+
Blocks: 1777738
See Also: 1777738
Whiteboard: [fixed for ESR 102.0.3]

In preparation for testing 102.0.3, I've been trying hard to nail down steps for reproducing this using information distributed across various comments of this bug... here they are!
I succeeded to reproduce once and have one message's display corrupted, showing the date stamp of the previous message, and a cut-off inline-preview of the image attachment of another message. Restart TB and repair folder fixed the problem.

Given that we believe this might fix most of the Inbox corruption issues which are currently reported in bug 1777738, having reproducable steps looks crucial, also with a view on producing an automated test for this bug.


STR (tentative)

  1. Thunderbird 102.* up to 102.0.2 ("a real issue" fixed for 102.0.3; "there could still be other factors involved..." per BenC's Bug 1777454 Comment 37)
  2. POP or IMAP account (per Gene's Bug 1777454 Comment 24)
  3. Enable Allow antivirus clients to quarantine incoming messages under Settings > Privacy & Security > Antivirus.
    Not sure if this is a required condition - probably not, per Benc's Bug 1777454 Comment 25, but, per BenC's Bug 1777454 Comment 49, "I think quarantining was especially affected, because since 91 I've changed quarantining to be handled inside the mbox...".
  4. Restart TB to ensure quarantining takes effect
  5. Delete some messages, then right-click on Inbox folder > Compact
  6. Receive the first incoming message... (should show OK)
  7. Delete some more messages, then right-click on Inbox folder > Compact
  8. Receive another incoming message and click to display

Actual result (WIP)

  • The first message that comes in shows OK even if messages were previously deleted and the folder compacted.
  • After deleting a few more messages and compact and then receive another message into POP3 inbox, I only see it in the list and and the message doesn't display.
  • This is suspected to cause the (reversible?) inbox corruption symptoms of bug 1777738 (We should investigate if it's also reversible for all kinds of POP constellations, esp. when messages are already deleted on server).
  • Additional messages received also don't display and the new messages are not in the Inbox file.
  • If I restart tb, the not displayed messages are still on the list but still don't display in the reading pane. But any new messages received now display OK.
  • After restarting TB, repair fixes the problem (probably only if there's no compaction getting in the way before).

Expected result

  • Display every message normally
  • No inbox corruption symptoms per bug 1777738

Current patch:

Thunderbird 102.0.3:
https://hg.mozilla.org/releases/comm-esr102/rev/b063ac02d9c1

Risks of current patch per Andrei's Bug 1777454 Comment 48 :

Could potentially cause new issues with mbox storage, but the risk that they are worse than the current problem seems relatively low.

User Story: (updated)
User Story: (updated)
See Also: → 1777953
User Story: (updated)

(In reply to b1 from comment #52)

(In reply to Ben Campbell from comment #49)

Previously, this issue was masked because most calls to FinishOutputStream() also called Close() directly on the outputstream.

Thanks for the reply. You mean FinishNewMessage():
https://searchfox.org/comm-esr102/search?q=FinishNewMessage&path=&case=false&regexp=false

Some callers close the stream, some leave it open and some close it depending on whether its reusable:
https://searchfox.org/comm-esr102/rev/352c49b382c7768a76af75f82d7f756da02c855f/mailnews/local/src/nsParseMailbox.cpp#2243
https://searchfox.org/comm-esr102/rev/352c49b382c7768a76af75f82d7f756da02c855f/mailnews/local/src/nsLocalMailFolder.cpp#2098
https://searchfox.org/comm-esr102/rev/352c49b382c7768a76af75f82d7f756da02c855f/mailnews/import/src/nsBeckyMail.cpp#470
https://searchfox.org/comm-esr102/rev/352c49b382c7768a76af75f82d7f756da02c855f/mailnews/import/src/nsAppleMailImport.cpp#562
Closing the stream "wholesale" now certainly changes the behaviour. We're also not convinced that this is a necessary measure given that the idea of reusable streams was to continue writing to the stream as more messages are added to the mbox. If folder compaction is a special case now, it should close the stream itself. BTW, we can't see folder compaction call FinishNewMessage().

We also noticed that the Close() in nsParseMailbox.cpp was removed now: https://hg.mozilla.org/comm-central/rev/e0eb4bb286d8#l3.94, but that is not in 102. We didn't see any other call sites where a Close() call was removed between 91 and 102.

FYI, in my buffered output stream patch which I have been working on (at least for pop3 and local message copy), FinishNewMessage() now closes the passed stream all the time. It makes processing quite uniform. Bug 1242030
Also, the removing reusable flag is also in progress. bug 1773787

The latter should bring uniformity with regard to Close() call because the stream is no longer reusable. (I have found the saving of a few close/open calls here and there is far outweighed by the merit of using buffered output, and the buffered output patch is made way too complex if reusable flag and its scheme exists in the code. Thus it was deemed well remove reusable flag and its mechanism.)

Our concern is that by always closing the stream in FinishNewMessage(), the overall behaviour is changed since other parts of the code expect that function not to close reusable streams. It also leads to multiple closing of the same stream (which is not detected since there is no error handling on the close calls). Once the concept of a reusable stream is removed, it becomes a different story.

I have smoketested TB 102.0.3 candidate build today (not yet released), and using my steps from Comment 55, I have not been able to reproduce the inbox display corruption.

It would be great if users affected by this bug 1777454 or bug 1777738 respectively could try the following:

  1. On TB 102.0.2 (affected), execute STR of comment 55, which should cause message display corruption on the last received message (at least).
  2. On TB 102.0.3 candidate build, execute STR of comment 55 again (try to do it in exactly the same way) - which should no longer cause message display corruption.
  3. Let us know SUCCESS if you're seeing the problem in 102.0.2, but no longer seeing it in 102.0.3.
    Otherwise report FAIL if you cannot reproduce with comment 55 in 102.0.2 OR if you can still reproduce in 102.0.3, and say where/in which way it failed. Keep your comment as short as possible.

(In reply to b1 from comment #52)

(In reply to Ben Campbell from comment #49)

Previously, this issue was masked because most calls to FinishOutputStream() also called Close() directly on the outputstream.
Thanks for the reply. You mean FinishNewMessage():

Oops - yes, I meant FinishNewMessage()!

Closing the stream "wholesale" now certainly changes the behaviour. We're also not convinced that this is a necessary measure given that the idea of reusable streams was to continue writing to the stream as more messages are added to the mbox.

The ultimate aim is to remove the mbox-isms which are currently scattered all about the code and confine them all inside nsMsgBrkMBoxStore. To do this properly, getNewMsgOutputStream() (and getMsgInputStream()) will need to return mbox-aware stream objects which handle "From "-separators and "From "-quoting in the message body.

The mbox code could still potentially cache an open mbox filehandle internally, but that's an implementation issue rather than something that should leak out through the API.

Even with a raw mbox stream, the reusable flag in the API shouldn't really be required if the stream is handed back via finishNewMessage()/discardNewMessage() - They can call Close() internally if they desire, or keep it open ready for the next getNewMsgOutputStream().

So, yes - it is a bit of a roundabout way of disabling the caching, where "proper" solution would be to have the folder compaction coordinate better with the mailstore. But we needed a quick fix, and the "hacky" solution helps move forward the general de-mbox-ification by making the mailstore more responsible for creating/closing/cleaning up message streams, rather that placing responsiblity upon the calling code.

If folder compaction is a special case now, it should close the stream itself. BTW, we can't see folder compaction call FinishNewMessage().

Folder compaction bypasses the mail store completely and just opens/closes the stream itself. That's not too unreasonable, but really it's an mbox-specific operation, so should eventually be hidden behind (and coordinate with) the nsMsgBrkMBoxStore implementation.
Folder compaction could do with a good cleanup - there are currently two implementations: one for local folders and one for imap. These should really be unified.

See Also: → 1770373

Thanks Thomas for the analysis. And thanks Benc and others for finishing this off.

(In reply to Thomas D. (:thomas8) from comment #55)

In preparation for testing 102.0.3, I've been trying hard to nail down steps for reproducing this using information distributed across various comments of this bug... here they are!
...
Actual result (WIP)

  • The first message that comes in shows OK even if messages were previously deleted and the folder compacted.
  • After deleting a few more messages and compact and then receive another message into POP3 inbox, I only see it in the list and and the message doesn't display.
  • This is suspected to cause the (reversible?) inbox corruption symptoms of bug 1777738 (We should investigate if it's also reversible for all kinds of POP constellations, esp. when messages are already deleted on server).
  • Additional messages received also don't display and the new messages are not in the Inbox file.
  • If I restart tb, the not displayed messages are still on the list but still don't display in the reading pane. But any new messages received now display OK.
  • After restarting TB, repair fixes the problem (probably only if there's no compaction getting in the way before).

Yeah, compact on startup - Bug 1719072 - Compact should never run on (or near) startup - but not for those who have disabled it. And perhaps not those for whom it has been failing - Bug 1260698 - Compact folders - shows wrong estimated and compacted size. Compacts too often. totalExpungedBytes number is wrong. (which has a number of duplicates)

We can be partly guided user feedback in this bug and bug 1777738 as to how effective the patch is and whether their data is "restored" - three users have reported success in bug 1777738 - but ...

What is our wisdom about what a user might need to do when faced with the symptoms?
Do we expect a repair will be required for most or all users?
Do we expect repair to be reliable?
Do we know of, or expect, situations where the damage is not reversible?
Are there situations where compact is required instead of repair?

Flags: needinfo?(geoff)
Flags: needinfo?(gds)
Flags: needinfo?(bugzilla2007)

I know nothing.

Flags: needinfo?(geoff)
User Story: (updated)
Flags: needinfo?(bugzilla2007)

(In reply to Wayne Mery (:wsmwk) from comment #60)

TL;DR - I do not think we have secured wisdom on any of this, nor if bug 1777738 is really fixed by this bug 1777454 or not.

Thanks Thomas for the analysis. And thanks Benc and others for finishing this off.
**What is our wisdom about what a user might need to do when faced with the symptoms?

Looks like in many cases, repair does the trick. Sometimes compact may be needed, too.

Do we expect a repair will be required for most or all users?

Well, if we believe that things (this bug 1777454 and bug 1777738 corruption) are fixed for 102.0.3 / 102.1.0 (which we aren't really sure of), then hopefully there would be nothing to repair for users updating directly from 91 to those versions of 102.

Do we expect repair to be reliable?

I think we have only one or two reports of repair fails, so usually yes - with a caveat below (for POP with server deletion).

Do we know of, or expect, situations where the damage is not reversible?

If we believe that the corruption can also affect POP bugs (which is just a theory so far) AND IF the problems affect the actual folder mbox files (as opposed to just the MSF files), then YES (not reversible may occur), IF messages on server are already deleted.

However, after spending much time on testing and following STR of my comment 55 religiously, I have failed to reproduce the problem on a POP account on 102.0 release (with messages on server deleted on download).
Which could mean that my previous successful reproduction was just a one-time luck (or something else), or that it doesn't reproduce on POP.
It's quite hard to tell.

Are there situations where compact is required instead of repair?**

Can't tell - I'd say any combination may be needed between compact and repair.
I repeat - am not seeing any secured wisdom on this.

(In reply to b1 from comment #57)

Our concern is that by always closing the stream in FinishNewMessage(), the overall behaviour is changed since other parts of the code expect that function not to close reusable streams. It also leads to multiple closing of the same stream (which is not detected since there is no error handling on the close calls). Once the concept of a reusable stream is removed, it becomes a different story.

FYI, I would like to add that my patch set for buffered write adds checks to |Close()| calls because once the buffered write is enabled, the write error may be delayed until the final |Close()| call that may flush the pending data in the write buffer to external media/network.
(That is Bug 1242030 )

Flags: needinfo?(gds)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: