Corrupted Imap folder leads to 100% CPU and unbounded memory growth
Categories
(Thunderbird :: General, defect)
Tracking
(thunderbird_esr128 verified, thunderbird132 fixed, thunderbird133 fixed)
People
(Reporter: KaiE, Assigned: KaiE)
References
(Blocks 1 open bug)
Details
(Keywords: regression)
Attachments
(2 files)
48 bytes,
text/x-phabricator-request
|
corey
:
approval-comm-release+
corey
:
approval-comm-esr128+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
corey
:
approval-comm-beta+
corey
:
approval-comm-release+
|
Details | Review |
I have the folder corruption that is described in bug 1890230 (a cached IMAP folder mbox file).
Prior to the corruption, I had several old messages open in tabs.
When starting Thunderbird 128.x with the corruption, Thunderbird becomes unusable.
I have identified that it is related to corrupted messages being open in tabs.
If I remove the session.json file prior to starting Thunderbird, which means no tabs will be open on startup, the resource problem goes away, and Thunderbird is usable (with the corrupted messages present).
I have identified one of the corrupted messages, which is a mixed text and html message.
With no tabs open, if I manually select that specific message, I get the same symtpom, stuck process, CPU and memory exhaustion.
In a previous session, while hunting for the cause of the resource issue, I used the debugger, and saw it was 1400 levels, recursively, deep inside the moz platform layout code, trying to render the dom/html.
(I haven't yet finally confirmed with this particular message that it's the same issue. I could do that later, if it's deemed necessary.)
When in this situation, the OS repeatedly shows "thunderbird doesn't respond, wait or force quit?".
My only chance to get out of the situation is to force quit.
Assignee | ||
Comment 1•1 month ago
|
||
For everyone who is affected by this bug, you can delete the session.json file from the profile folder, but this will close all of your open tabs.
(If you had used this as a way to remind yourself about some old messages, which want to follow up to, you'll lose that reminder.)
Assignee | ||
Comment 2•1 month ago
|
||
We should investigate what message contents is given to the mail display.
I took a shortcut, and set MOZ_LOG="MIME:5".
I see that will print out the full message.
When clicking my test message, a large block of data is used. 28 MB.
I found out, this is the full contents of my mailbox file, until the end of the file.
I'm guessing, the code is searching for the trailing MIME boundary, cannot find it, and proceeds to take all remaining input data.
In addition, there is an transformation applied to that data, which surprises me:
In the headers that follow in the rest of the data, several character sequences and replaced with special ASCII codes.
For example, this header:
ARC-Seal: i=3; a=rsa-sha256; t=1725542399; cv=pass;
is replaced by
ARC-Seal: i=3; a=rsa-sha256; t<NON-PRINTABLE BYTE VALUE 17>25542399; cv=pass;
It seems our display-conversion code, or our rendering code, isn't able to handle that input in a reasonably way, and simply tries to process, which requires a lot of memory and CPU.
Comment 3•1 month ago
|
||
(In reply to Kai Engert (:KaiE:) from comment #2)
I found out, this is the full contents of my mailbox file, until the end of the file.
If you look at the mbox file, is the message intact? The mbox reader should stop as soon as it sees the separator indicating the end of the message (i.e. blank line + "From "...).
Is the .storeToken for the message in the database correct? Does it point to the "From " line indicating the beginning of the message?
(before 128.3.1 there was also .messageOffset, so that's worth checking too)
I'm guessing, the code is searching for the trailing MIME boundary, cannot find it, and proceeds to take all remaining input data.
Sounds like the mime code could do with a little sanity checking, but I think that's a secondary symptom here. The core issue is that it's being fed crap data for one reason or another.
Assignee | ||
Comment 4•1 month ago
|
||
(In reply to Ben Campbell from comment #3)
(In reply to Kai Engert (:KaiE:) from comment #2)
I found out, this is the full contents of my mailbox file, until the end of the file.
If you look at the mbox file, is the message intact? The mbox reader should stop as soon as it sees the separator indicating the end of the message (i.e. blank line + "From "...).
--------------9y6NuBmfkwTOfs9vxFxgKFrom
...
Content-Type: multipart/alternative; boundary="0000000000002675fb06215e4fb1"
...
--0000000000002675fb06215e4fb1
Content-Type: text/plain; charset="UTF-8"
...
Hello...
--0000000000002675fb06215e4fb1
Content-Type: text/html; charset="UTF-8"
...
<div ...
--0000000000002675fb06215e4From
Assignee | ||
Comment 5•1 month ago
|
||
(In reply to Ben Campbell from comment #3)
Is the .storeToken for the message in the database correct? Does it point to the "From " line indicating the beginning of the message?
(before 128.3.1 there was also .messageOffset, so that's worth checking too)
Can you please explain how I find out?
Assignee | ||
Comment 6•1 month ago
|
||
(In reply to Ben Campbell from comment #3)
Sounds like the mime code could do with a little sanity checking, but I think that's a secondary symptom here. The core issue is that it's being fed crap data for one reason or another.
Can you explain why the old code (pre bug 1719121) can handle the same mbox file fine, and doesn't continue using the following messages?
Comment 7•1 month ago
|
||
(In reply to Kai Engert (:KaiE:) from comment #4)
--------------9y6NuBmfkwTOfs9vxFxgKFrom ... Content-Type: multipart/alternative; boundary="0000000000002675fb06215e4fb1" ... --0000000000002675fb06215e4fb1 Content-Type: text/plain; charset="UTF-8" ... Hello... --0000000000002675fb06215e4fb1 Content-Type: text/html; charset="UTF-8" ... <div ... --0000000000002675fb06215e4From
Ahh, yep. The corrupted mbox has "From " separators starting in the middle of messages, without the preceding blank line.
So the mbox parser never sees them, and it all just looks like one huge big message, all the way to EOF.
Comment 8•1 month ago
|
||
(In reply to Kai Engert (:KaiE:) from comment #5)
(In reply to Ben Campbell from comment #3)
Is the .storeToken for the message in the database correct? Does it point to the "From " line indicating the beginning of the message?
(before 128.3.1 there was also .messageOffset, so that's worth checking too)Can you please explain how I find out?
Unfortunately there's no simple way to peek about in mork databases directly - it's easiest to do it from within TB. You need to get the nsIMsgDBHdr object for the message. You ca
(In reply to Kai Engert (:KaiE:) from comment #6)
(In reply to Ben Campbell from comment #3)
Sounds like the mime code could do with a little sanity checking, but I think that's a secondary symptom here. The core issue is that it's being fed crap data for one reason or another.
Can you explain why the old code (pre bug 1719121) can handle the same mbox file fine, and doesn't continue using the following messages?
The old code everywhere assumed that everything was an mbox. Code wanting to consume a message was use an inputstream on the raw mbox file, an offset (where the message begins) and the message size. And it would seek to the offset and have to be careful about not reading past the size. And it would have to unescape "From " lines in the message body. This was pretty inconsistent - there were little quirks all over the place and it made maildir very tricky.
The new code returns an inputstream which contains just that single message with all the From-unescaping already applied. Internally it works by parsing the mbox stream, looking for separator lines. And if the separator lines are borked (as in this case), it won't see the end of a message.
The messagesize data is still in the msg hdr, so it could potentially be used as a sanity check to detect reads on corrupted messages... pass it into the mbox parser as a "here's roughly how much we're expecting, bail out early with an error if you overshoot too much" kind of thing...
Comment 9•1 month ago
|
||
So there's two separate problems here:
- The mbox file is corrupted. Why? It looks to me like something has either miscalculated the number of bytes to write (failed to write
fb1--\r\n\r\n
), or gone to the wrong point in the file before writing the next message (overwrote). Kai, what does the end of the file look like? - The file to display pipeline no longer detects the start of the next message, so it carries on.
Let's not conflate them. Pick one to fix here and file another bug about the other.
Comment 10•1 month ago
|
||
(In reply to Geoff Lankow (:darktrojan) from comment #9)
So there's two separate problems here:
- The mbox file is corrupted. Why? It looks to me like something has either miscalculated the number of bytes to write (failed to write
fb1--\r\n\r\n
), or gone to the wrong point in the file before writing the next message (overwrote). Kai, what does the end of the file look like?- The file to display pipeline no longer detects the start of the next message, so it carries on.
Regarding 2, the new mbox msgStore code provides a stream limited to a single message, so the display/mime side of things can't be responsible for overrunning the end of the message as it should never even see it. However in this case the mbox separators are not at the start of a line, so the mbox code thinks the message just runs and runs...
Let's not conflate them. Pick one to fix here and file another bug about the other.
I 'd say Bug 1890230 covers the issue of the mbox getting borked, so this bug should be concentrating on the display/mime code.
In the description here, Kai said he opened a debugger and "saw it was 1400 levels, recursively, deep inside the moz platform layout code, trying to render the dom/html."
So I think the first step should be trying to craft a test which can put it into that death spiral.
Assignee | ||
Comment 11•1 month ago
|
||
(In reply to Geoff Lankow (:darktrojan) from comment #9)
Kai, what does the end of the file look like?
The trailing MIME boundary is complete, followed by 0d 0a.
There is just one pair of 0d 0a, not two. But that seems normal? I'm looking at IMAP mbox files in other profiles, which I haven't touched recently, and they (often) have no blank line in between the trailing MIME boundary and the following "From - $date" line, and no extra blank line at the end.
Assignee | ||
Comment 12•1 month ago
|
||
No extra/wrong data at the end of the file.
Assignee | ||
Comment 13•1 month ago
|
||
(In reply to Kai Engert (:KaiE:) from comment #2)
In addition, there is an transformation applied to that data, which surprises me:
In the headers that follow in the rest of the data, several character sequences and replaced with special ASCII codes.
For example, this header:
ARC-Seal: i=3; a=rsa-sha256; t=1725542399; cv=pass;
is replaced by
ARC-Seal: i=3; a=rsa-sha256; t<NON-PRINTABLE BYTE VALUE 17>25542399; cv=pass;
Do you have a theory what could be causing that?
Comment 14•1 month ago
|
||
(In reply to Kai Engert (:KaiE:) from comment #13)
(In reply to Kai Engert (:KaiE:) from comment #2)
For example, this header:
ARC-Seal: i=3; a=rsa-sha256; t=1725542399; cv=pass;
is replaced by
ARC-Seal: i=3; a=rsa-sha256; t<NON-PRINTABLE BYTE VALUE 17>25542399; cv=pass;Do you have a theory what could be causing that?
I was thinking that the transformation you mentioned was being performed by the mime/display code, but reading back it sounds more like that's what you're seeing in the mbox file?
17 is 0x11 hex
'=' is 0x61 hex
That's a single bitflip, and I have seen it before. I experienced it a while back and it turned out I had a bad RAM stick in my PC - there was a single bad bit in 24GB of memory, so it was pretty intermittent and frustrating!
Not saying it's bad RAM (or disk) for sure, but I'd certainly consider the possibility and run some tests just in case.
I can't really think of anywhere that does a lot of bit-oriented operations, other than some flag twiddling in the DB code... everything dealing with message data tends to be whole byte copying. Theoretically a rogue pointer which thinks it's pointing at a flag variable, but is really pointing at a buffer somewhere? I'm kind of stumped to come up with other plausible explanations really.
Assignee | ||
Comment 15•22 days ago
|
||
The seriousness of this bug is influenced by the setting "view message body as".
I can reproduce my original report using setting "original html".
Using setting "simple html", the effect is less serious.
There is a freezing period of a couple of seconds on my fast machine, but it eventually stops.
The message display is wrong. After the message, a lot of additional text is shown.
But at least, Thunderbird can become usable interactively.
With setting "plain text" there is even less blocking/freezing, probably because we don't feed distorted html to the rendering engine.
Also, in multipart/alternative messages, the text/plain part might be placed earlier than the text/html part, and maybe our code will stop processing when it finds the end marker of that part (which usually isn't truncated, if another part is following).
Assignee | ||
Comment 16•22 days ago
|
||
I have a fix in hand.
Assignee | ||
Updated•22 days ago
|
Assignee | ||
Comment 17•22 days ago
|
||
Assignee | ||
Comment 18•22 days ago
|
||
Tests do not yet pass, more work on the patch is needed.
Updated•21 days ago
|
Updated•19 days ago
|
Assignee | ||
Comment 19•19 days ago
|
||
Updated•19 days ago
|
Updated•19 days ago
|
Assignee | ||
Comment 20•19 days ago
|
||
Assignee | ||
Comment 21•18 days ago
|
||
(In reply to Kai Engert (:KaiE:) from comment #20)
comm-esr128 try:
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=2070b4df8d24b05059e06c643c7c1ac52a90ac99
The above looks good.
Started another try build for comm-central:
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=16fa30a41ab9d42e2d17ff7ca5b0d897ff848a47
Assignee | ||
Updated•14 days ago
|
Comment 22•14 days ago
|
||
Pushed by arschmitz@thunderbird.net:
https://hg.mozilla.org/comm-central/rev/94511c5cfd8d
Use a limit when when streaming exactly one message from mbox storage. r=BenC
Updated•13 days ago
|
Assignee | ||
Comment 24•7 days ago
|
||
Comment on attachment 9432487 [details]
Bug 1924291 - Use a limit when when streaming exactly one message from mbox storage. comm-esr128 version. r=benc
[Approval Request Comment]
Regression caused by (bug #): yes
User impact if declined: certain folder corruption causes thunderbird to freeze and become unusable
Testing completed (on c-c, etc.): yes
Risk to taking this patch (and alternatives if risky): In theory, this could result in messages being shown shorter than they are. But I haven't seen that scenario, all I've seen is messages being shown with extra "junk" at the end. This patch limits the amount of such "Junk" to a small size.
Comment 25•5 days ago
|
||
Comment on attachment 9433352 [details]
Bug 1924291 - Use a limit when when streaming exactly one message from mbox storage. r=benc
[Triage Comment]
Approved for beta
Comment 26•5 days ago
|
||
bugherder uplift |
Thunderbird 133.0b3:
https://hg.mozilla.org/releases/comm-beta/rev/f5aa072c0b65
Comment 27•5 days ago
|
||
Comment on attachment 9432487 [details]
Bug 1924291 - Use a limit when when streaming exactly one message from mbox storage. comm-esr128 version. r=benc
[Triage Comment]
Approved for esr128
Comment 28•5 days ago
|
||
Comment on attachment 9432487 [details]
Bug 1924291 - Use a limit when when streaming exactly one message from mbox storage. comm-esr128 version. r=benc
[Triage Comment]
Approved for release
Comment 29•5 days ago
|
||
Comment on attachment 9433352 [details]
Bug 1924291 - Use a limit when when streaming exactly one message from mbox storage. r=benc
[Triage Comment]
Approved for release
Comment 30•4 days ago
|
||
bugherder uplift |
Thunderbird 128.4.3esr:
https://hg.mozilla.org/releases/comm-esr128/rev/736137f9b4f0
Comment 31•2 days ago
|
||
bugherder uplift |
Thunderbird 132.0.1:
https://hg.mozilla.org/releases/comm-release/rev/287d23c71145
Comment 32•2 days ago
|
||
Hello!
We tried to corrupt our folders as described in bug 1890230 (a cached IMAP folder mbox file) but we did not succeed.
Kai, is there sa specific set of STR that QA could follow in order to properly reproduce the faulty behaviour and verify the fix?
Thank you in advance!
Assignee | ||
Comment 33•2 days ago
|
||
Vlad, I sent you an email with a sample mailbox.
Comment 34•2 days ago
|
||
I have been able to reproduce the faulty behaviour with the sample files provided by Kai using the affected build 133.0a1(20241011112458) from 2024-10-12.
Confirming this issue as verified fixed with 128.4.3esr(20241108210256) on macOS 14, Windows 11 and Ubuntu 24.
Description
•