compacting drafts folder immediately after saving message causes corruption - STR in comment 21, comment 24
Categories
(MailNews Core :: General, defect, P1)
Tracking
(thunderbird_esr128 affected, thunderbird131 wontfix, thunderbird132 affected, thunderbird133 affected)
People
(Reporter: 52qtuqm9, Assigned: benc)
References
(Blocks 2 open bugs, Regression)
Details
(Keywords: dataloss, leave-open, regression)
Attachments
(1 file)
My add-on, Send Later, calls await messenger.compose.saveMessage
to save the message in the current compose window. Then immediately after that an experiment function compacts the drafts folder:
async compactFolder({ accountId, path }) {
let uri = SendLaterFunctions.folderPathToURI(accountId, path);
let folder = MailServices.folderLookup.getFolderForURL(uri);
let msgWindow = Cc["@mozilla.org/messenger/msgwindow;1"]
.createInstance()
.QueryInterface(Ci.nsIMsgWindow);
folder.compact(null, msgWindow);
},
I have a user on macOS for whom the result of this is that the locally stored copy of the message that was just saved is corrupted and ends up empty, i.e., when you click on it the message pane shows nothing, and when you View Source on it the window that opens up has nothing in it.
Repairing the folder brings back the correct content for the saved draft.
More bizarre: if the user schedules a message with Send Later and causes the problem described above, and then schedules a second message with send later, the second message has the correct contents and the first message suddenly looks like a duplicate of the second message.
If I put a breakpoint in the debugger right before the folder is compacted, the problem goes away, i.e., there appears to be a race condition here which is only triggered when the compact happens immediately after the draft is saved.
I am unable to duplicate this myself, but I did a screen-share with the user in question and confirmed that everything I wrote above is indeed happening on his computer.
He says he does not see the problem in the current TB129 beta.
This is a data-loss issue because when Send Later subsequently goes to send the messages that get corrupted locally this way, they don't send properly, but Send Later thinks they did, so it deletes them, i.e., they're entirely lost.
I'm straining my brain and can't figure out a way to reliably work around this issue in Send Later.
It would seem that there is a fix in TB129 that needs to be backported to TB128.
Reporter | ||
Comment 1•4 months ago
|
||
Note: Send Later has nearly 125,000 users. Even if this is a mac-only issue (which is not clear), the add-on has at least 10,000 mac users, so the risk of other people besides the user who noticed the issue getting bitten by this is high.
Updated•4 months ago
|
Comment 2•4 months ago
|
||
vlad, ramona can you reproduce?
Reporter | ||
Comment 3•4 months ago
|
||
It's going to be really hard to reproduce this.
I can't reproduce it on my MacBook, or on any of my Linux computers.
Whereas the user who is running into this issue is running into it on two different MacBooks. sigh
Assignee | ||
Comment 4•4 months ago
|
||
As a workaround, could you just not force a folder compaction?
Other thoughts:
Your compactFolder()
function will likely return while a compact is on-going (actually, I think it may even return before the compact actually starts).
In:
folder.compact(null, msgWindow);
If you pass a nsIURLListener
instead of null
as the first parameter, you can wait until the onStopRunningUrl()
callback has been called.
PromiseUrlListener
in the test code gives an example of using await
on the callback. I don't know if there's an equivalent anywhere in non-test code.
I don't know it'll help, but waiting for the compact to finish probably seems a little more "correct".
Also, I don't know too much about the GUI side of things (so forgive this stupid question!), but is your function creating a new msgWindow
or just getting hold of the existing one?
As to what's changed since 128, I'm really not too sure... there was some locking code added to the mbox to prevent interleaved writes, along with some more checking in the IMAP code... but I thought those had already been applied to 128. And I don't know how relevant it'd be to this issue anyway...
Oh - might be worth checking your user is in fact using mbox and not maildir.
Last thing: compaction logging can be turned on with MOZ_LOG="compact:3" (3 for an overview, 5 to see the compaction decisions for every message in the mbox).
Reporter | ||
Comment 5•4 months ago
|
||
(In reply to Ben Campbell from comment #4)
As a workaround, could you just not force a folder compaction?
No. The compact itself is a workaround for the fact that Gmail doesn't fully delete drafts when they're replaced, which means that if Send Later doesn't compact the folder after saving the draft then Send Later ends up seeing phantom drafts that cause errors and/or duplicate scheduled message transmissions.
Other thoughts:
YourcompactFolder()
function will likely return while a compact is on-going (actually, I think it may even return before the compact actually starts).
This is not a problem and not what is causing the issue.
I observed quite clearly that if I put a breakpoint immediately before the compact and paused briefly before continuing execution then the corruption did not occur, but if I put the breakpoint immediately after the compact and paused briefly the corruption did occur. It is the compact immediately after the save that is causing the corruption, so waiting for the compact to complete won't help.
More evidence of this: if I throw the compact into a setTimeout call with a 1-second delay, the problem goes away.
I don't know it'll help, but waiting for the compact to finish probably seems a little more "correct".
It shouldn't matter. If it does than there's a bug in Thunderbird. But, again, doing this will not solve the problem.
Also, I don't know too much about the GUI side of things (so forgive this stupid question!), but is your function creating a new
msgWindow
or just getting hold of the existing one?
Not creating a new window.
Oh - might be worth checking your user is in fact using mbox and not maildir.
They are.
Comment 6•4 months ago
•
|
||
Edit: Found the github issue and asked the user directly
Comment 7•4 months ago
|
||
In the tests, I do this to force a sync with the server:
https://searchfox.org/comm-central/rev/2560a83e426451a16357264032a95790624d1b10/mail/components/extensions/test/xpcshell/test_ext_messages_move_copy_delete.js#657-668
Maybe that could help, instead of compacting?
Comment 8•4 months ago
|
||
Hello,
I have tried to reproduce this issue using macOS 10.15, 11 , 12, 13 and 14 , Windows 11 and Ubuntu 24 , using the STR from the Description but did not manage to reproduce it on 128.0.1esr(20240717233102), 115.13.0(20240715172621), nor 130.0a1(20240724103108).
(In reply to Vlad Lucaci, Thunderbird QA (:vlucaci) from comment #8)
Hello,
I have tried to reproduce this issue using macOS 10.15, 11 , 12, 13 and 14 , Windows 11 and Ubuntu 24 , using the STR from the Description but did not manage to reproduce it on 128.0.1esr(20240717233102), 115.13.0(20240715172621), nor 130.0a1(20240724103108).
I can reliably reproduce it on all of my Macs with 128.0 and 128.0.1, but not with any v115 nor any v129beta (haven't tried v130 yet) with either a Fastmail IMAP connection or a Gmail (standard, OAuth) connection. My Macs are all running macOS 14.5. Perhaps I am running something else that also impacts this, though there were no other Thunderbird Plugins in the blank/default test profiles we created.
In thinking about this, I can't begin to imagine what else I could be running that would cause this kind of thing. When I get home I'll test it with a blank/fresh user account and see what happens.
Comment 10•4 months ago
|
||
Ok. New, fresh data.
This bug is reproducible (by me, anyway) on both 128.0.1esr as well as 129.0b5. I couldn't find an installer for 115 with which to test, though.
The trick to making it happen is to install Send Later plugin (10.5.1), go into Send Later preferences, and uncheck the "Mark scheduled drafts as read" box. Then schedule a message to be sent later and, boom, it's corrupted. Leaving it as the default with that box checked seems to keep the message uncorrupted.
I tested this in a brand new user account on my Mac, and with brand new profiles for Thunderbird.
Updated•3 months ago
|
Comment 11•3 months ago
|
||
This sounds like the MSF is corrupt, not the actual message in the MBOX file. Anyone had a look at the actual data in the mbox for corruption?
Is there any correlation between antivirus use/brand and being able to reproduce?
Comment 12•3 months ago
|
||
(In reply to Matt from comment #11)
Is there any correlation between antivirus use/brand and being able to reproduce?
I'm on a Mac and not running any antivirus software (nor can I think of anything else that I'm running that would be watching/intercepting access to files).
Reporter | ||
Comment 13•3 months ago
|
||
sigh I now have a report of this happening on Windows 10, not just on macOS. The workaround implemented in v10.5.2 of Send Later isn't working for them, and unchecking the preference to mark scheduled drafts as read doesn't fix the problem either.
I'm at a loss here, all I can think to do at this point is to make the delay before compact longer than 1 second and see if that helps.
Reporter | ||
Updated•3 months ago
|
Comment 14•3 months ago
|
||
Vlad, can you re-test with the additional information from comment 10?
Comment 15•2 months ago
•
|
||
Hello,
Managed to reproduce this issue only once on 129.0b5(20240725052412) using macOS 14.6.1 using the additional information from comment 10.
Also did not manage to reproduce it on 130.0b3(20240826151843) and 128.1.1esr(20240820161839).
However once I have tried to search for a regression range for this issue(went as far back as 2024-01-15, I could no longer reproduce it. Further attempts to reproduce it again, even with the affected (129.0b5) build and clean profiles were not successful. It is worth mentioning that the version used for Send Later plugin was 10.5.2.
Comment 16•2 months ago
|
||
(In reply to Jonathan Kamens from comment #0)
I am unable to duplicate this myself, but I did a screen-share with the user in question and confirmed that everything I wrote above is indeed happening on his computer.
He says he does not see the problem in the current TB129 beta.
Has the user seen this in the last 2-3 months with a newer beta?
Reporter | ||
Comment 17•2 months ago
|
||
N.B. I believe I just saw this folder corruption, or something very similar, happen in a way that had nothing to do with Send Later.
Immediately after I was upgraded automatically to 131.0b3, the message at the top of my inbox was empty. I.e., when I clicked on the message it displayed a completely blank preview pane, and when I hit Ctrl-U the window that popped up was empty.
Restarting Thunderbird didn't make the problem go away.
I had to repair the folder to be able to view the contents of the message.
Comment 18•2 months ago
|
||
(In reply to Jonathan Kamens from comment #17)
N.B. I believe I just saw this folder corruption, or something very similar, happen in a way that had nothing to do with Send Later.
Immediately after I was upgraded automatically to 131.0b3, the message at the top of my inbox was empty. I.e., when I clicked on the message it displayed a completely blank preview pane, and when I hit Ctrl-U the window that popped up was empty.
Restarting Thunderbird didn't make the problem go away.
I had to repair the folder to be able to view the contents of the message.
I also see this kind of message corruption not-irregularly, especially in my Archive folder. Twice per month I have to repair the folder to be able to properly view the contents of a message.
Reporter | ||
Comment 19•2 months ago
|
||
I am now able to reproduce the Send Later drafts folder corruption pretty reliably with 131.0b4.
As Dave mentioned earlier in the thread, it is much easier to reproduce when you uncheck the "Mark scheduled drafts as read" Send Later preference. I don't know why.
Interesting data: I emptied and repaired my drafts folder so that the Drafts file underneath ImapMail was completely empty. I then scheduled four messages in a row with Send Later. The first one was corrupt, and the drafts file underneath ImapMail remained empty, After scheduling the second message, that one did show up in ImapMail/Drafts. The third one did not show up there, and the fourth one did show up there. So there ended up being two messages in ImapMail/Drafts when there should have been four.
I have a suspicion that this has something to do with the code that was added recently to stash a copy of the message into Drafts with a fake IMAP ID, then download the actual message from the server and replace the local copy and fake IMAP ID with the copy and ID downloaded from the server. This is just a guess, though, I can't point to anything specific which proves it. It's just that this seems like the most recent significant change to this part of the code.
Version 10.5.6 of Send Later has a new attempted workaround for this issue which appears to work (at least for me; I can't speak to whether it will work for everyone experiencing this issue): it waits until the Drafts and Drafts.msf files have been idle (i.e., not modified) for five seconds before compacting the Drafts folder.
Comment 20•2 months ago
|
||
(In reply to Jonathan Kamens from comment #19)
I am now able to reproduce the Send Later drafts folder corruption pretty reliably with 131.0b4.
Can you produce a stripped-down version of the add-on where you press a button or do some other simple action and get to the corruption?
Reporter | ||
Comment 21•2 months ago
•
|
||
str |
Can you produce a stripped-down version of the add-on where you press a button or do some other simple action and get to the corruption?
I do not have time to do this. I and others have spent a great deal of time figuring out how to work around this TB bug and collecting detailed information about it to help the TB developers find and fix the bug. IMO this additional request on top of all of that is unreasonable.
Detailed reproduction steps have already been provided. Here are the most basic repro steps I can come up with:
- Install Send Later version 10.5.1.
- Open the Send Later preferences and disable "Mark scheduled drafts as read".
- Open a compose window for an IMAP account with the Drafts folder stored on the server.
- Enter a recipient.
- Enter a subject.
- Click the Send Later button (or tab into the message body and type Ctrl-Shift-Enter).
- Hit enter to accept the default scheduled send time (5 minutes from now).
- Click on the scheduled message in the Drafts folder and it should say "This is a draft message." If it doesn't, the local copy of the folder is corrupt.
If it doesn't repro the first time, try it several times in a row with different subject lines so you can distinguish between the message in your drafts folder.
If you are unable to repro it after several times, try attaching a large attachment to the message between steps 5 and 6. I have a theory that the problem may happen more often with larger messages.
Comment 22•2 months ago
|
||
(In reply to Jonathan Kamens from comment #17)
N.B. I believe I just saw this folder corruption, or something very similar, happen in a way that had nothing to do with Send Later.
Immediately after I was upgraded automatically to 131.0b3, the message at the top of my inbox was empty. I.e., when I clicked on the message it displayed a completely blank preview pane, and when I hit Ctrl-U the window that popped up was empty.
I'm sorry, but what is "Ctrl-U"? I thought that was underline. I tried looking it up, and that was all that came back.
Comment 23•2 months ago
|
||
(In reply to Worcester12345 from comment #22)
(In reply to Jonathan Kamens from comment #17)
N.B. I believe I just saw this folder corruption, or something very similar, happen in a way that had nothing to do with Send Later.
Immediately after I was upgraded automatically to 131.0b3, the message at the top of my inbox was empty. I.e., when I clicked on the message it displayed a completely blank preview pane, and when I hit Ctrl-U the window that popped up was empty.
I'm sorry, but what is "Ctrl-U"? I thought that was underline. I tried looking it up, and that was all that came back.
I believe that's the shortcut for View > Message Source (just to see the raw source of things to confirm that it's blank and not some display error).
Comment 24•2 months ago
|
||
str |
I reproduced the issue on Windows. I wasn't able to reproduce on a O365-based account, neither on a Fastmail account, but a more "generic" IMAP account on a local hoster did the trick. This issue happened on the fourth message, regardless whether the message contained a single word or a large attachment. In another test cycle it happened on the 8th message. Since this appears to be some soft of timing issue, it likely depends on the speed of the IMAP server and connection.
Reproduced with 128 and 131 beta. To speed up testing, once you have a non-corrupt draft saved, you can do another iteration in four clicks: Click the draft, Ctrl+E (edit as new message), click "Send Later", click on the "Send Later" panel to send in 5 minutes.
Comment 25•2 months ago
|
||
It also reproduces in TB 126 beta, so not caused by bug 1890448. Looks like the IMAP folder corruption from bug 1890230. I also had a case where the message showed the MIME boundary.
Using the add-on (semi-)reliably creates the IMAP folder corruption. The add-on puts a new message into the drafts folder, which is async, and then calls compact (old or new) on the folder, so those two operations likely run interleaved and cause the issue. Just a hypothesis, of course.
Comment 26•2 months ago
|
||
From comment 19:
I have a suspicion that this has something to do with the code that was added recently to stash a copy of the message into Drafts with a fake IMAP ID, then download the actual message from the server and replace the local copy and fake IMAP ID with the copy and ID downloaded from the server. This is just a guess, though, I can't point to anything specific which proves it. It's just that this seems like the most recent significant change to this part of the code.
This doesn't sound like something I've worked on regarding IMAP. Don't know what it means by imap ID (maybe UID) or how it can be "fake".
From comment 25:
Using the add-on (semi-)reliably creates the IMAP folder corruption. The add-on puts a new message into the drafts folder, which is async, and then calls compact (old or new) on the folder, so those two operations likely run interleaved and cause the issue. Just a hypothesis, of course.
May be better to just force an imap expunge of drafts so it doesn't directly cause a compact of mbox. Not sure if that's possible from addon code.
Reporter | ||
Comment 27•2 months ago
•
|
||
(In reply to gene smith from comment #26)
From comment 19:
I have a suspicion that this has something to do with the code that was added recently to stash a copy of the message into Drafts with a fake IMAP ID, then download the actual message from the server and replace the local copy and fake IMAP ID with the copy and ID downloaded from the server. This is just a guess, though, I can't point to anything specific which proves it. It's just that this seems like the most recent significant change to this part of the code.
This doesn't sound like something I've worked on regarding IMAP. Don't know what it means by imap ID (maybe UID) or how it can be "fake".
I was referring to this:
https://hg.mozilla.org/comm-central/rev/b7bb63ac09ccfc8635dad5a46a986caa679ae11c
Though I could certainly be off-base.
From comment 25:
Using the add-on (semi-)reliably creates the IMAP folder corruption. The add-on puts a new message into the drafts folder, which is async, and then calls compact (old or new) on the folder, so those two operations likely run interleaved and cause the issue. Just a hypothesis, of course.
May be better to just force an imap expunge of drafts so it doesn't directly cause a compact of mbox. Not sure if that's possible from addon code.
This is a good idea, thanks. I've implemented this in version 10.5.8 of Send Later which I've just put out as a beta release.
However, it's important to note that this is just working around the corruption issue. As others have noted, Send Later isn't the only thing that causes the corruption, and so we really should get to the root cause of the corruption and fix it.
Having said that IMPORTANT NOTE if you want to reproduce this issue using Send Later as described above, you need to use version 10.5.1, because newer versions have workarounds for the corruption issue in them.
Comment 28•2 months ago
|
||
(In reply to Jonathan Kamens from comment #27)
https://hg.mozilla.org/comm-central/rev/b7bb63ac09ccfc8635dad5a46a986caa679ae11c
That's from bug 1873282 in TB 124. Why don't you try TB 122, 123, 124 and 125 beta. Unless I made a mistake, the bug reproduced in 126 beta.
Having said that IMPORTANT NOTE if you want to reproduce this issue using Send Later as described above, you need to use version 10.5.2, because newer versions have workarounds for the corruption issue in them.
I was using 10.5.7 which was current yesterday and the bug was reproduced. What am I missing?
Reporter | ||
Comment 29•2 months ago
|
||
Several things:
- The reason why you can reproduce it with 10.5.7 and I can't is because, as you mentioned earlier, it's a timing-related issue and apparently the timing works differently for you than for me.
- I misspoke before. The newest release without any workarounds in it is 10.5.1, so that's the best release to try to reproduce the issue with. I have updated my comment above to reflect this.
- In one of my tests I observed a slightly different failure mode for one of the messages I scheduled with Send Later: it wasn't completely blank after I scheduled it, but Thunderbird didn't think it was a draft message, and when I double-clicked it to edit the message it didn't get loaded properly into the compose window.
- Now that I am able to reproduce the issue myself I was able to regress it with mozregression. Here is the result:
9:53.86 INFO: No more integration revisions, bisection finished.
9:53.86 INFO: Last good revision: c25819f500eb8146b59403d22a6db8c57360a586
9:53.86 INFO: First bad revision: f46382a0d7a9145e49fc8c096df4bc8c5351e14e
9:53.86 INFO: Pushlog:
https://hg.mozilla.org/comm-central/pushloghtml?fromchange=c25819f500eb8146b59403d22a6db8c57360a586&tochange=f46382a0d7a9145e49fc8c096df4bc8c5351e14e
It seems like Ben Campbell want to might want to take a look at this because it's apparently his commit that triggered the issue. I've tagged him here.
Comment 30•2 months ago
|
||
Yes, bug 1719121 was the likely candidate based on the symptoms, see bug 1890230.
Reporter | ||
Comment 31•2 months ago
|
||
Wait, we knew about this TB128 corruption issue five months ago, and yet we still opted to make TB128 the default download on thunderbird.net, and also start upgrading people to it automatically?
That seems... imprudent?
sigh
Comment 32•2 months ago
|
||
I must report, this is still happening to me, and I use local drafts folder, contrary to the recent SL release notes saying using local drafts fixes it. It does not, they still show empty. Even when not using Send Later. Thunderbird wipes the contents of drafts after saving them and reopening after some time.
Windows 11 Enterprise 23H2 22631.4169
128.2.3esr (64-bit) using maildir
SendLater 10.6.0
Comment 33•2 months ago
|
||
I reproduced the error again and here are my findings:
When the issue happens, the mbox file is not corrupt. However, the latest stored message is not in the mbox, but it is in the .msf. This can be verified by using a distinct subject for each message "sent later" and searching for that subject in both raw data and .msf. So the corruption lies in the mismatch between mbox and database.
It can also be seen running with
set MOZ_LOG_FILE=c:\temp\sendlater
set MOZ_LOG=mbox:5
Looking for lines containing the word "finish", I got this log which I annotated:
Start with one draft of size 2715 and let "Send Later" save another one:
[Parent 12852: Main Thread]: D/mbox MboxMsgOutputStream::Finish() startPos=2715
[Parent 12852: Main Thread]: I/mbox FINISH MSG stream=0x2318f775f20 folder=imap:// ... /INBOX/Drafts filesize=3445
Draft count 2.
Let "Send Later" save another draft:
[Parent 12852: Main Thread]: D/mbox MboxMsgOutputStream::Finish() startPos=3445
[Parent 12852: Main Thread]: I/mbox FINISH MSG stream=0x2318b60ccf0 folder=imap:// ... /INBOX/Drafts filesize=4177
Draft count 3.
Now compact from the first "Send Later" kicks in, finding three Drafts:
[Parent 12852: Main Thread]: D/mbox MboxMsgOutputStream::Finish() startPos=0
[Parent 12852: Main Thread]: D/mbox MboxMsgOutputStream::Finish() startPos=2715
[Parent 12852: Main Thread]: D/mbox MboxMsgOutputStream::Finish() startPos=3445
Let "Send Later" save another draft:
[Parent 12852: Main Thread]: D/mbox MboxMsgOutputStream::Finish() startPos=4177
[Parent 12852: Main Thread]: I/mbox FINISH MSG stream=0x23192b6a4a0 folder=imap:// ... /INBOX/Drafts filesize=4911
Draft count 4.
Let "Send Later" save another draft:
[Parent 12852: Main Thread]: D/mbox MboxMsgOutputStream::Finish() startPos=4911
[Parent 12852: Main Thread]: I/mbox FINISH MSG stream=0x23196071350 folder=imap:// ... /INBOX/Drafts filesize=5647
Draft count 5.
Now compact kicks in again and only finds four drafts because the fifth draft written from 4911 to 5647 was lost somehow:
[Parent 12852: Main Thread]: D/mbox MboxMsgOutputStream::Finish() startPos=0
[Parent 12852: Main Thread]: D/mbox MboxMsgOutputStream::Finish() startPos=2715
[Parent 12852: Main Thread]: D/mbox MboxMsgOutputStream::Finish() startPos=3445
[Parent 12852: Main Thread]: D/mbox MboxMsgOutputStream::Finish() startPos=4177
So the draft was written as 5th message at position 4911 bringing the Drafts mbox to length 5647. However, despite being written, a subsequent compact only finds four messages, the last one, which was written at position 4911 is gone. Being in the database leads to an empty display when clicking the 5th message in the message list.
For the record: I am using Send Later 10.5.7 which appears to have the longest delay between saving the draft and kicking off the compact. That allows saving another draft before the compact of the previous one kicks in, leading to some interleaving action that ultimately appears to cause the corruption.
Reporter | ||
Comment 34•2 months ago
|
||
(In reply to venomtek from comment #32)
I must report, this is still happening to me, and I use local drafts folder, contrary to the recent SL release notes saying using local drafts fixes it. It does not, they still show empty. Even when not using Send Later. Thunderbird wipes the contents of drafts after saving them and reopening after some time.
I am sorry to hear that. I'm curious: do you typically schedule multiple messages to be sent with Send Later at the same time? Also, do you use recurring messages in Send Later? Given that this appears to be some sort of race condition / missing file locking issue which is causing multiple things to be written into the Drafts folder at the same time, it seems like having multiple Send Later messages go out at the same time might trigger it.
Comment hidden (offtopic) |
Comment hidden (offtopic) |
Comment hidden (offtopic) |
Comment 38•2 months ago
|
||
Please stay on the topic here. @Jonathan: Can you confirm comment 33?
Reporter | ||
Comment 39•2 months ago
|
||
(In reply to Francesco from comment #38)
Please stay on the topic here. @Jonathan: Can you confirm comment 33?
Not sure what you're looking for confirmation of. What you wrote sounds correct to me.
Comment 40•2 months ago
|
||
When you see the issue (message saved as draft with a unique subject, when clicking on the draft, it shows empty and no "this is a draft message" notification) open the raw message file Drafts and Drafts.msf with a text editor and search for the unique subject. If my testing is correct, it will only be found in the .msf.
Reporter | ||
Comment 41•2 months ago
|
||
I don't understand why you need me to confirm this. It's a waste of my time.
I feel like this is not the first time in this bug you have asked me to do work that either was unnecessary or it was unfair to ask me to do.
Especially since, as we've established, (a) this bug was known long before I created this bug ticket, (b) we've identified the specific commit that caused it, and (c) we've identified a reproducible test case. At this point the Thunderbird developers responsible for the bug could be using that test case to reproduce, identify, and fix it, rather than continuing to ask end users (me or anyone else) to collect additional information.
Please stop asking me to do additional work to collect additional information about this bug. The developers already have all the information they need to be able to find and fix it. I am frustrated by how much of my time this has taken up, frustrated by how much time it is going to take up for me to support all of the Send Later users who are going to run into this issue as they are upgraded automatically to Thunderbird 128, and frustrated by the negative impact this is going to cause on the user experience I try to provide to users of my extension.
Comment 42•2 months ago
|
||
You've thankfully already invested the time to find the regression, so with another five minutes effort you could confirm that the corruption is of the form "message in database but not raw mbox file". That would give the developers more clues on where to start looking. Note that at least two other corruption types would be thinkable. 1) mbox file corrupt as such 2) database has incorrect offsets into the raw data. As far as I know, so far no other skilled user has done a "post mortem" on a real corruption case.
Comment 43•2 months ago
|
||
Another update: Using version 10.5.1 of the add-on leads to immediate corruption, it's not even necessary to run it twice, no interleaving, it's broken straight away. Here is the log:
Start with one draft of size 737 and let "Send Later" save another one:
[Parent 2188: Main Thread]: D/mbox MboxMsgOutputStream::Finish() startPos=737
[Parent 2188: Main Thread]: I/mbox FINISH MSG stream=0x14042b5b890 folder=imap:// ... /INBOX/Drafts filesize=1423
Draft count 2.
Now compact from "Send Later" kicks in, finding only the original draft:
[Parent 2188: Main Thread]: D/mbox MboxMsgOutputStream::Finish() startPos=0
If you open the raw mbox file, it's 737 bytes long with the original draft in it, not the one saved by "Send Later". That got apparently got compacted away.
Comment 44•2 months ago
|
||
Removed - with "Mark scheduled drafts as read"
from the summary since it appears to be confusing. According to comment 21 the option "Mark scheduled drafts as read" should be disabled.
Updated•2 months ago
|
Assignee | ||
Comment 45•1 months ago
|
||
(In reply to Jonathan Kamens from comment #27)
From comment 25:
Using the add-on (semi-)reliably creates the IMAP folder corruption. The add-on puts a new message into the drafts folder, which is async, and then calls compact (old or new) on the folder, so those two operations likely run interleaved and cause the issue. Just a hypothesis, of course.
Out of interest, I did just stumbled upon an example of waiting for the send to complete:
https://searchfox.org/comm-central/source/mailnews/imap/test/unit/test_saveImapDraft.js#59
However, it's important to note that this is just working around the corruption issue. As others have noted, Send Later isn't the only thing that causes the corruption, and so we really should get to the root cause of the corruption and fix it.
Definitely. I'm digging into it. I figured test_saveImapDraft.js would be a good place to put a boiled down, easy-trigger version of the bug to aid the process, but I've run out of time on it today. Will pick it up tomorrow.
If anyone fancies getting that test to properly fail for me, that'd be great :-) it needs to do a compact after the save - not too hard, but complicated by the fact you need to add and delete a message or two to get the compaction to actually run, rather than doing an early-out...
Comment 46•1 month ago
|
||
FWIW, I didn't have the impression that anything was deleted, "Send Later" just adds a new draft.
Assignee | ||
Comment 47•1 month ago
|
||
(In reply to Francesco from comment #46)
FWIW, I didn't have the impression that anything was deleted, "Send Later" just adds a new draft.
No, but compaction early-outs if there's nothing to be compacted... it's all a little convoluted for IMAP because it's all annoyingly tied in with the expunge, but having some deleted messages in the folder will be required to trigger the bug I think. The deletions can occur before the send - I think the compaction just needs something to chew on or it'll just exit without doing anything.
Comment 48•1 month ago
|
||
I don't want to insist, but the bug is triggered when you start with an empty Drafts folder which was previously compacted and repaired. Or you start with a drafts folder with one draft, compact and repair, Ctrl+E on the draft send later with the add-on, and you get the logging from comment 43. I think it depends on the timing of the IMAP server, if you can't get it to fail, I can supply a test account where it fails for me every time.
Assignee | ||
Comment 49•1 month ago
•
|
||
Thanks - I managed to get it to occur first time with a tame local imap server, so it looks promising, and I shall keep digging.
(this was with the 10.5.1 addon, with a hacked manifest to allow it to load into my local debug build)
Assignee | ||
Comment 50•1 month ago
|
||
Spotted some weird behaviour with the IMAP drafts folder and am looking at it over in Bug 1921694. Not sure if it's causing the issue here, or if it's just an extra obfuscating factor...
(but when I was trying to replicate the bug, I did notice that when I manually killed the Drafts-1
and Drafts-1.msf
files and did a folder repair, I wasn't able to get this bug to trigger, so my suspicions are that it's related).
Comment 51•1 month ago
•
|
||
What's happening is:
- the message is recorded in the database
- the message is sent to the server
- compacting begins, this blocks retrieving messages from the server
- compacting ends
- the message is retrieved from the server, but not stored in the mbox file
You can see this happening by running comment 21's STR with MOZ_LOG=IMAP:4,compact:5
environment variables.
I think 1 is done as an optimisation so that moving messages appears to happen quickly.
I think 5 doesn't store the retrieved message because the code to do so has a the old (pre-compact) file handle. We should confirm and fix this.
Additionally, I don't think that add-ons should be calling for folders to be compacted, unless that is what the add-on's purpose is. Surely there is a better way to solve the problem.
Comment 52•1 month ago
|
||
the code to do so has a the old (pre-compact) file handle.
Would a stale file handle also explain bug 1890230?
Reporter | ||
Comment 53•1 month ago
|
||
(In reply to Geoff Lankow (:darktrojan) from comment #51)
Additionally, I don't think that add-ons should be calling for folders to be compacted, unless that is what the add-on's purpose is. Surely there is a better way to solve the problem.
- I've switched from doing a compact to doing an IMAP expunge, as suggested above, which is better.
- I agree, I shouldn't have to do the expunge either, but I don't know of any other way to tell Thunderbird to clear the deleted drafts from Gmail's Drafts folder. If you know of a better way, I'm all ears.
Comment 54•1 month ago
|
||
(In reply to Geoff Lankow (:darktrojan) from comment #51)
What's happening is:
- the message is recorded in the database
- the message is sent to the server
- compacting begins, this blocks retrieving messages from the server
- compacting ends
- the message is retrieved from the server, but not stored in the mbox file
I'm not quite right here. The message is written to the mbox file before the compact.
It doesn't show up for some reason when we enumerate the original database, even though it is there. And it is in the compacted database afterwards, maybe that is from the network operations finishing.
Comment 55•1 month ago
|
||
I haven't yet tried to reproduce using the add-on.
I found a different way to create corruption and dataloss.
Use an IMAP account, ensure it is configured to store drafts on the server (not locally).
I used an empty draft folder.
Compose an email.
Attach several image files, I used a total size of around 100 MB.
Hit CTRL-S to save a draft, which will take a couple of seconds.
Very quickly switch to the main thunderbird window, right click the folder, and select compact.
After the save operation is done, the folder will show one entry.
Clicking the entry will not show any message.
The file on disk is zero bytes.
After restarting TB is still zero bytes.
I think this shows that in current versions 128.x there is no (or insufficient or nonworking) protection against multiple operations modifying the same file in parallel.
Using 115, I see active protection. Using the same steps, the save operation continues, and afterwards I get an error message, saying that it's not possible to run compact currently, because another operation is ongoing.
Afterwards, the folder contains the expected saved draft.
Assignee | ||
Comment 56•1 month ago
|
||
Thanks, Kai - that's a nice alternate way to reproduce this!
I'm reasonably confident that the mbox-writing and compaction operations are coordinating correctly.
The logging shows them in relation to each other:
MOZ_LOG="IMAP:4,compact:5,mbox:4"
The problem is that the compaction doesn't see the message in the database, so it removes it from the mbox.
So I think it's down to the way the database is used by compose and draft saving, rather than mbox and compact conflicting.
The vast bulk of the time in your 100MB message case is uploading the message to the server, and that shouldn't be blocking local mbox writing or compaction (although it won't be able to issue and IMAP expunge).
(I've definitely seen compaction attempts pop up a message saying that it's not possible to run compact currently, because another operation is ongoing. I hacked up a script yesterday to continuously pump messages into a local IMAP server, one every 100 milliseconds, and it was easy to get that to happen).
I don't think this is the cause of bug 1890230. The symptoms there are corrupted messages rather than completely missing ones, as in this case.
Assignee | ||
Comment 57•1 month ago
|
||
Comment 58•1 month ago
|
||
(In reply to Geoff Lankow (:darktrojan) from comment #51)
What's happening is:
- the message is recorded in the database
- the message is sent to the server
- compacting begins, this blocks retrieving messages from the server
Step 2b: the message is removed from the database. Investigating why but it's probably a temporary entry until the message is uploaded to the server.
Assignee | ||
Comment 59•1 month ago
|
||
That patch adds some logging to the compaction when it's enumerating the database. It now also shows the messages in the DB which it rejects, either because they've got pendingRemoval set
, or because they just don't have an offline copy in the first place (i.e. no storeToken
).
Doesn't really help in this case - it's that Step 2b causing the problem by removing the message from the DB.
Comment 60•1 month ago
|
||
Pushed by alessandro@thunderbird.net:
https://hg.mozilla.org/comm-central/rev/25a40c2b958c
Tweak logging on mbox and folder compaction. r=tobyp
Comment 61•25 days ago
|
||
This can still be reproduced in Daily, even after bug 1923541, bug 1720047, bug 1923747 after bug 1923765 were fixed.
Comment 62•25 days ago
|
||
(In reply to Francesco from comment #61)
This can still be reproduced in Daily, even after bug 1923541, bug 1720047, bug 1923747 after bug 1923765 were fixed.
Can you provide a log to Ben?
Comment 63•25 days ago
|
||
No need, according to comment #49 he can reproduce it himself.
Assignee | ||
Comment 64•25 days ago
|
||
Yep, I think I know what's happening, but just haven't had time to try a fix.
It's more a quirk of how IMAP works.
What's happening is that a message entry (just the metadata - title, from, to etc etc) gets added to the database and the whole raw message is written to the local message store. Then the IMAP starts a copy to the the server, and moves the message entry out of the database into a holding place. When the server responds saying the server-side operation has finished, it then moves the message entry back into the database again. It's not entirely clear why it does this.
This causes a problem here, because the folder compaction can easily be kicked off before the server responds. It sees that the message isn't in the database and throws it away the full message in the local message store (that's exactly what folder compaction is supposed to do - remove messages which are no longer in the database!).
I'd like to change it so it doesn't do the temporary move of message entry out of the DB at all... but there could be some obscure reason for doing that way, so the ramifications need to be tracked down.
Comment 65•25 days ago
|
||
Thanks for the explanation, but why is this regression caused but the mbox rewrite? Not even by the new compaction.
Assignee | ||
Comment 66•25 days ago
|
||
(In reply to Francesco from comment #65)
Thanks for the explanation, but why is this regression caused but the mbox rewrite? Not even by the new compaction.
Hard to tell. The reason might become obvious as I work on a fix, or it might not :-)
The old mbox code was spread out all over the place, and there were hacks and workarounds everywhere. A lot of code ended up depending on very specific side effects or timings, and systems were very tightly intertwined with each others implementation. So when you tease it apart, the undocumented assumptions bite you in subtle ways. The whole drive behind this stuff is to tackle that tech debt and get the codebase into a tractable state.
Comment 67•23 days ago
|
||
While working on the best fix for this issue, we're trying an experiment in parallel.
If you're willing to test the experimental fix, please have a look at bug 1925085.
Thank you!
Comment 68•17 days ago
|
||
Dear Johnathan or others, we have updated another version with potential fixes, 128.3.3
If it's easy for you, would you be please able to help us test whether the issue has been fixed?
Thank you in advance
Comment 69•17 days ago
|
||
Not fixed in Daily (see comment 61) and current beta (132.0b6).
Comment 70•17 days ago
|
||
128.3.3 is available via Help > About
Updated•17 days ago
|
Reporter | ||
Comment 71•17 days ago
•
|
||
Presumably if, as Francesco indicates in comment 69, it's not fixed in daily, then it's also not fixed in 128.3.3, since fixes can't go into production releases without passing through main first, right?
Updated•17 days ago
|
Comment 72•17 days ago
|
||
Thanks, you're probably right.
Comment 73•17 days ago
|
||
I confirm the steps from comment 21 still reproduce the bug with version 128.3.3
Comment 74•17 days ago
|
||
(In reply to Kai Engert (:KaiE:) from comment #67)
While working on the best fix for this issue, we're trying an experiment in parallel.
If you're willing to test the experimental fix, please have a look at bug 1925085.
FTR, version 10.5.1 works fine with the experimental version from bug 1925085.
Comment 75•17 days ago
|
||
(In reply to Kai Engert (:KaiE:) from comment #74)
FTR, version 10.5.1 works fine with the experimental version from bug 1925085.
Not surprising, since that backs out the mbox rewrite which is the regressor here. But is it a valid way forward to back out about 10 months worth of related patches? And it doesn't fix the beta channel nor future release or ESR versions. Shouldn't all available resources be focused on fixing the problem?
Comment 76•17 days ago
|
||
(In reply to Francesco from comment #75)
Not surprising, since that backs out the mbox rewrite which is the regressor here. But is it a valid way forward to back out about 10 months worth of related patches? And it doesn't fix the beta channel nor future release or ESR versions. Shouldn't all available resources be focused on fixing the problem?
I usually refrain from engaging in the philosophical discussions that invariably arise in a long-toothed bug report, however this question seems worth of treating non-rhetorically.
Yes, I agree that all available resources should be focused on fixing the problem.
I also believe that this bug is terrible enough (corruption of user data!) that it should immediately be taken out of production until such time as a provable fix can be rolled back into production. This has been in production for over THREE months, and it sounds like the experiment of rolling back this (faulty) mbox rewrite eliminates it. Hooray, we have a fix!
If I were the product manager — and, let's be clear, I am not — I would immediately roll out releases for all production versions that remove the faulty mbox rewrite code (reverting to the previous, stable code) and then work (behind the scenes) on fixing the new mbox code for a potential future release once it can be proven stable.
User data corruption is nothing to sneeze at, and the fact that we have a known fix for it that hasn't yet been rolled into production feels irresponsible to me.
But again, I'm just a user. Or perhaps that makes me the most qualified participant in this exchange? ¯_(ツ)_/¯
Comment 77•14 days ago
|
||
We are all users. But backout is not a subject for this bug report, resolving the issue through code improvement is the sole purpose. (And FWIW it is recently been determined that backout isn't going to be possible).
(In reply to Jonathan Kamens from comment #71)
Presumably if, as Francesco indicates in comment 69, it's not fixed in daily, then it's also not fixed in 128.3.3, since fixes can't go into production releases without passing through main first, right?
True. But a possibility exists that the issue, "it", is slightly different for some users on daily?
Comment 78•13 days ago
|
||
(In reply to Jonathan Kamens from comment #21)
Detailed reproduction steps have already been provided. Here are the most basic repro steps I can come up with:
I just reproduced this behavior on Linux (Fedora 40), Thunderbird 128.3.2esr (64 bits), Send Later 10.5.1, with the steps you describe.
Message in the drafts empty, but if I repair the folder, it appears complete again.
Comment 79•13 days ago
|
||
(In reply to nicod from comment #78)
Oh, sorry, I see this information isn't relevant any more...
Comment 80•4 days ago
|
||
I've tried to investigate what's happening.
When we reach BuildKeepMap, it doesn't see the new message and decides there's nothing to keep.
I see that the draft message is saved to a file, /tmp/nsemail.eml
Next, the message is copied to the local IMAP mbox file.
While doing so, we arrive at
https://searchfox.org/comm-esr128/rev/6ac92c44b4f3c3edcc968bf1916f75f7f5793c23/mailnews/imap/src/nsImapMailFolder.cpp#7660-7673
#0 nsMailDatabase::GetOfflineOpForKey (this=0x7fffa9ed70c0, msgKey=46, create=true, offlineOp=0x7fffffffae40) at /home/user/moz/comm-esr128/mozilla/comm/mailnews/db/msgdb/src/nsMailDatabase.cpp:155
#1 0x00007fffed02d9d7 in nsImapMailFolder::CopyFileToOfflineStore (this=0x7fffa0cc1c00, srcFile=0x7fffaf98bd00, msgKey=46)
at /home/user/moz/comm-esr128/mozilla/comm/mailnews/imap/src/nsImapMailFolder.cpp:7683
#2 0x00007fffed00ac4a in nsImapMailFolder::OnCopyCompleted (this=0x7fffa0cc1c00, srcSupport=0x7fffaf98bd00, rv=nsresult::NS_OK)
at /home/user/moz/comm-esr128/mozilla/comm/mailnews/imap/src/nsImapMailFolder.cpp:7775
#3 0x00007fffed01a798 in nsImapMailFolder::OnStopRunningUrl (this=0x7fffa0cc1c00, aUrl=0x7fffa7a1bd08, aExitCode=nsresult::NS_OK)
at /home/user/moz/comm-esr128/mozilla/comm/mailnews/imap/src/nsImapMailFolder.cpp:5162
#4 0x00007fffece4b096 in nsMsgMailNewsUrl::SetUrlState (this=0x7fffa7a1bd08, aRunningUrl=false, aExitCode=nsresult::NS_OK)
at /home/user/moz/comm-esr128/mozilla/comm/mailnews/base/src/nsMsgMailNewsUrl.cpp:236
#5 0x00007fffed023efa in nsImapMailFolder::SetUrlState (this=0x7fffa0cc1c00, aProtocol=0x7fff94d5c300, aUrl=0x7fffa7a1bd08, isRunning=false, aSuspend=false, statusCode=nsresult::NS_OK)
at /home/user/moz/comm-esr128/mozilla/comm/mailnews/imap/src/nsImapMailFolder.cpp:6392
#6 0x00007fffed0c8f9b in (anonymous namespace)::SyncRunnable5<nsIImapMailFolderSink, nsIImapProtocol*, nsIMsgMailNewsUrl*, bool, bool, nsresult>::Run (this=0x7fffa62f7c00)
at /home/user/moz/comm-esr128/mozilla/comm/mailnews/imap/src/nsSyncRunnableHelpers.cpp:223
The comment above explains that it queues an offline move operation.
Shortly after this code, I see that the mbox file is successfully written, and the .msf index file is updated, too.
Later we go through nsImapOfflineSync::ProcessNextOperation
#0 nsImapOfflineSync::ProcessNextOperation (this=0x7fffa9f95980) at /home/user/moz/comm-esr128/mozilla/comm/mailnews/imap/src/nsImapOfflineSync.cpp:750
#1 0x00007fffecff1ad7 in nsImapMailFolder::UpdateFolderWithListener (this=0x7fffa0cc1c00, aMsgWindow=0x7fff97d50ca0, aUrlListener=0x0)
at /home/user/moz/comm-esr128/mozilla/comm/mailnews/imap/src/nsImapMailFolder.cpp:772
#2 0x00007fffed01a7f2 in nsImapMailFolder::OnStopRunningUrl (this=0x7fffa0cc1c00, aUrl=0x7fff969aa408, aExitCode=nsresult::NS_OK)
at /home/user/moz/comm-esr128/mozilla/comm/mailnews/imap/src/nsImapMailFolder.cpp:5165
#3 0x00007fffece4b096 in nsMsgMailNewsUrl::SetUrlState (this=0x7fff969aa408, aRunningUrl=false, aExitCode=nsresult::NS_OK)
at /home/user/moz/comm-esr128/mozilla/comm/mailnews/base/src/nsMsgMailNewsUrl.cpp:236
#4 0x00007fffed023efa in nsImapMailFolder::SetUrlState (this=0x7fffa0cc1c00, aProtocol=0x7fff935cf300, aUrl=0x7fff969aa408, isRunning=false, aSuspend=false, statusCode=nsresult::NS_OK)
at /home/user/moz/comm-esr128/mozilla/comm/mailnews/imap/src/nsImapMailFolder.cpp:6392
#5 0x00007fffed0c8f9b in (anonymous namespace)::SyncRunnable5<nsIImapMailFolderSink, nsIImapProtocol*, nsIMsgMailNewsUrl*, bool, bool, nsresult>::Run (this=0x7fffa6c26800)
at /home/user/moz/comm-esr128/mozilla/comm/mailnews/imap/src/nsSyncRunnableHelpers.cpp:223
There is a call to DeleteMessage above, which might be related.
So apparently there is a race between this queued offline operation and the compaction happening in parallel.
I don't fully understand this yet, however,
if I completely remove the above code block, the bug doesn't occur.
I think we should find a way to check whether there are any queued actions, pending messages or offline actions.
If there are any, we shouldn't compact.
Comment 81•4 days ago
|
||
But regardless whether we start compact, or postpone it in this special situation.
The effect currently is that we get an entry in the index .msf file, which points to a message position in the mbox file that doesn't exist (pointing to the position at the end of the file).
And worse, if I continue with that incorrect state of files, and manually save another message to the drafts folder, then I will get an additional entry, which points to the new message, which is the same position that was previously the end of the file, and thus the same position as the previous message.
As a result, we have two messages showing in the message list, both pointing to the same message.
Comment 82•4 days ago
|
||
Ok, this happens:
nsImapMailFolder::CopyFileToOfflineStore
- adds a header/entry to the msf database.
- queues nsIMsgOfflineImapOperation::kMoveResult
- writes to the mbox file
- very soon after, the queued offline operation is processed,
and the mesage is deleted from the database (that's causing the issue) - soon after, we run through compact
- it doesn't see the message in the msf database, so it doesn't add it for keeping
- compact proceeds and rewrites the mbox file, the message is gone
- later, apparently a second part of that offline operation continues,
and it adds the header to the database (which will point to nowhere)
We have insufficient protection against parallel activity on the message storage.
I think this must to be solved more generally.
We need to track that an operation is currently ongoing, and any parallel activity needs to be able to see that and wait/postpone.
The queued operation must not be allowed to run while our initial storage operation hasn't been completed yet.
Assignee | ||
Comment 83•4 days ago
|
||
(just pasting this for reference, from another discussion)
It's down to the wacky way IMAP handles "prospective" message entries in the database. There are a few different mechanisms involved that seem to have evolved separately, but the one in play here actually removes messages from the main table and puts them into a pseudohdr list, waiting for the server to tell us that the message has appeared on the server. The message body is written to the mbox file, but when the compaction runs before the server has got back to us, the db doesn't have the message (it's waiting in pseudohdrs), so it gets removed. Later on, when the server responds, it copies the pseudohdr back out to the db proper, but the mbox data has already been compacted away.
I think the problem is that the pseudohdr is retaining the mbox data (storeToken, offlineMessageSize), which have been invalidated by the compaction.
There are possibly some workarounds for this (allowing the compaction to also peek into the pseudoheaders data to update the storeToken there)... but it's pretty icky (the pseudohdrs aren't currently visible outside imap code).
That's what I'm looking at at the moment.
The proper fix is to fix copying on server-backed folders. There should be no separate pseudohdr list (or any of the other mechanisms IMAP uses!). The immediate local copy should be marked 'speculative', pending eventual reconciliation (or rejection) from the server.
The sequence here is:
- There's a draft message in a file on disk.
- IMAP is told to upload that file to the mailbox on the server.
- We wait until that's done (we have no concept of that message locally yet, it's just in the file)
- nsImapMailFolder::CopyFileToOfflineStore() is invoked to install the message into the local database
- The messageheader is shunted out to pseudohdrs (ie removed from the db)
- compaction is invoked. byebye message in mbox
- the server says "Hey! I've got a new message!"
- It's matched to the pseudohdr and shifted back out into the database proper
so we've got the message in the db, but the local mbox data has been zapped.
And the entry in the db is probably still pointing into the mbox, at the now-missing message.
I'm got a few ideas on how to fix this, just running through the alternatives now.
Description
•