Open Bug 1749276 Opened 3 years ago Updated 6 days ago

Crash getting mail [@ nsMsgLocalMailFolder::EndMessage ] with message filter copy/move, because server can be deleted while a copy operation is still in progress.

Categories

(MailNews Core :: Networking: IMAP, defect)

Thunderbird 94
Unspecified
All
defect

Tracking

(Not tracked)

ASSIGNED
101 Branch

People

(Reporter: wsmwk, Assigned: benc)

References

(Depends on 1 open bug, Blocks 1 open bug, )

Details

(Keywords: crash, leave-open)

Crash Data

Attachments

(6 files)

I crashed when Opening macbook, which triggered getting new mail.

Last fixed bug with this signature - Bug 772760 - crash in nsMsgLocalMailFolder::EndMessage, which was a regression of Bug 402392 - Support other message storage formats. (prelude to pluggable mail stores)

No other reports of this

Crash report: https://crash-stats.thunderbird.net/report/bp-91c517b7-a6b7-4e77-b7d5-115b61220110 96.0b4
MOZ_CRASH Reason: invalid-access.
Top 10 frames of crashing thread:

0 XUL nsMsgLocalMailFolder::EndMessage mailnews/local/src/nsLocalMailFolder.cpp:06340a7526cd2ad540b9c374a8c80b4c7ee0a9bb
1 XUL {virtual override thunk({offset(-8)}, nsCopyMessageStreamListener::EndMessage	?:
2 XUL nsImapMailFolder::EndMessage mailnews/imap/src/nsImapMailFolder.cpp:06340a7526cd2ad540b9c374a8c80b4c7ee0a9bb
3 XUL (anonymous namespace)::SyncRunnable2<nsIImapMailFolderSink, nsIMsgMailNewsUrl*, unsigned int>::Run mailnews/imap/src/nsSyncRunnableHelpers.cpp:06340a7526cd2ad540b9c374a8c80b4c7ee0a9bb
4 XUL mozilla::RunnableTask::Run releases/comm-beta:xpcom/threads/TaskController.cpp 

Correction, just one other report of this for 94 beta, which might have been me some months ago...

Crash report: https://crash-stats.mozilla.org/report/index/d9913510-6378-4e8c-a384-94b010211031

Reason: EXC_BAD_ACCESS / KERN_INVALID_ADDRESS

Top 10 frames of crashing thread:

0 XUL nsMsgLocalMailFolder::EndMessage comm/mailnews/local/src/nsLocalMailFolder.cpp:2381
1 XUL {virtual override thunk} 
2 XUL nsImapMailFolder::EndMessage comm/mailnews/imap/src/nsImapMailFolder.cpp:5248
3 XUL  comm/mailnews/imap/src/nsSyncRunnableHelpers.cpp:121
4 XUL mozilla::RunnableTask::Run xpcom/threads/TaskController.cpp:467
5 XUL mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:770
6 XUL mozilla::TaskController::ExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:606
7 XUL mozilla::detail::RunnableFunction<mozilla::TaskController::InitializeInternal xpcom/threads/nsThreadUtils.h:531
8 XUL nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1148
9 XUL NS_ProcessPendingEvents xpcom/threads/nsThreadUtils.cpp:432

Hmm, why don't we get a line number?

Flags: needinfo?(sancus)

I don't know what's going on there. It doesn't seem to be a missing symbol to me. There is this virtual override thunk in front of the function name which I don't understand, eg "{virtual override thunk({offset(-8)}, nsCopyMessageStreamListener::EndMessage". This only seems to be happening with Mac crash reports specifically.

I'll make a Backtrace support request.

Flags: needinfo?(sancus)

(In reply to Wayne Mery (:wsmwk) from comment #0)

I crashed when Opening macbook, which triggered getting new mail.

I crashed again, same macbook, same release 96.0b4. Also just after opening MacBook. In this case, I clicked on three messages in Inbox while message download and filtering is in progress. None of the message contents displayed because (I assume) header download was still in progress. So we might have steps to reproduce.

And same stack, with the same funky line numbers from backtrace.

Flags: needinfo?(mkmelin+mozilla)

I'm a bit confused here. Were you guys looking in Backtrace? The line numbers are all there except for the "virtual override thunk" line and it refers to a very short function so I don't think that missing one matters? There was a bug in crash-stats.thunderbird.net that was causing the line numbers to be skipped, though.

If you weren't looking in Backtrace, please do that if you see missing data on the crash-stats website...the link to Backtrace is "More" right next to the App Notes at the bottom.

Unless I'm missing something, the one line with a missing line number is just this function? https://searchfox.org/comm-central/source/mailnews/base/src/nsCopyMessageStreamListener.cpp#33

1 nsMsgLocalMailFolder::EndMessage(unsigned int)
hg:hg.mozilla.org/releases/comm-beta:comm/mailnews/local/src/nsLocalMailFolder.cpp:06340a7526cd2ad540b9c374a8c80b4c7ee0a9bb:2404
2 {virtual override thunk({offset(-8)}, nsCopyMessageStreamListener::EndMessage(unsigned int))}
‎/Applications/Thunderbird.app/Contents/MacOS/XUL
3 nsImapMailFolder::EndMessage(nsIMsgMailNewsUrl*, unsigned int)
‎hg:hg.mozilla.org/releases/comm-beta:comm/mailnews/imap/src/nsImapMailFolder.cpp:06340a7526cd2ad540b9c374a8c80b4c7ee0a9bb:5248
4 (anonymous namespace)::SyncRunnable2<nsIImapMailFolderSink, nsIMsgMailNewsUrl*, unsigned int>::Run()
‎hg:hg.mozilla.org/releases/comm-beta:comm/mailnews/imap/src/nsSyncRunnableHelpers.cpp:06340a7526cd2ad540b9c374a8c80b4c7ee0a9bb:121
5 mozilla::RunnableTask::Run()
‎hg:hg.mozilla.org/releases/comm-beta:xpcom/threads/TaskController.cpp:06340a7526cd2ad540b9c374a8c80b4c7ee0a9bb:468
6 mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&)
‎hg:hg.mozilla.org/releases/comm-beta:xpcom/threads/TaskController.cpp:06340a7526cd2ad540b9c374a8c80b4c7ee0a9bb:771
7 mozilla::TaskController::ExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&)
‎hg:hg.mozilla.org/releases/comm-beta:xpcom/threads/TaskController.cpp:06340a7526cd2ad540b9c374a8c80b4c7ee0a9bb:607
8 mozilla::detail::RunnableFunction<mozilla::TaskController::InitializeInternal()::$_0>::Run()
‎hg:hg.mozilla.org/releases/comm-beta:xpcom/threads/nsThreadUtils.h:06340a7526cd2ad540b9c374a8c80b4c7ee0a9bb:531
9 nsThread::ProcessNextEvent(bool, bool*)
‎hg:hg.mozilla.org/releases/comm-beta:xpcom/threads/nsThread.cpp:06340a7526cd2ad540b9c374a8c80b4c7ee0a9bb:1183
10 NS_ProcessPendingEvents(nsIThread*, unsigned int)
‎hg:hg.mozilla.org/releases/comm-beta:xpcom/threads/nsThreadUtils.cpp:06340a7526cd2ad540b9c374a8c80b4c7ee0a9bb:432
11 nsBaseAppShell::NativeEventCallback()
‎hg:hg.mozilla.org/releases/comm-beta:widget/nsBaseAppShell.cpp:06340a7526cd2ad540b9c374a8c80b4c7ee0a9bb:87
12 nsAppShell::ProcessGeckoEvents(void*)
‎hg:hg.mozilla.org/releases/comm-beta:widget/cocoa/nsAppShell.mm:06340a7526cd2ad540b9c374a8c80b4c7ee0a9bb:505
0x1b6d28c58
‎/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation

Got it - the line numbers are at the very end of each line.

I crashed again - this time I wasn’t in the Thunderbird UI, and the macbook had not just opened.

Flags: needinfo?(mkmelin+mozilla)
Version: Thunderbird 91 → Thunderbird 94

Is it possible that Ben's patch in Bug 1744167 - nsIMsgLocalMailFolder.addMessageBatch() skips last line of message without line endings - made my crash more likely?

It's hard to say. I wouldn't expect so.
Ben? What do you think, see also comment 7.

Flags: needinfo?(mkmelin+mozilla) → needinfo?(benc)

Never say never, but I don't think my patch is a likely culprit. It does two things:

  1. Fixes the Flush() method of nsMsgLineBuffer, which was very broken... but I don't think it was ever called anyway (!).
  2. Fixes nsIMsgLocalMailFolder.addMessageBatch() to actually call Flush() on the linebuffer, so that it copes with messages that don't have a final terminating line ending (without flush, that last line would be discarded).

I think all other uses of nsMsgLineBuffer bodge their around the non-terminated-last-line issue (eg by adding a synthetic CRLF).

addMessageBatch() (and addMessage(), which just calls the batch fn) are used almost exclusively by test code. The only non-test use I could find is in the Feed reader (in FeedItem.jsm). But even there I don't see it causing a crash...

Flags: needinfo?(benc)

Digging deeper into the crash callstack, it looks to me like the crash occurs when nsLocalMailFolder::EndMessage(int) calls FinishNewLocalMessage() here:
https://searchfox.org/comm-central/source/mailnews/local/src/nsLocalMailFolder.cpp#2404

The copyState ptr is there (otherwise it would crash earlier) so I think it's crashing inside FinishNewLocalMessage(). Doesn't show up on the call stack, but it's a small function and I think it's probably being inlined.

So my money is on one of the fields in copyState being unexpectedly null. This doesn't surprise me too much - copyState is a very convoluted structure, and has loads of fields which are used in some types of copies but not others and it's all very unclear and brittle.

I think the next step would be to add in some debug code which checks that the required fields in copyState are non-null, and prints a warning identifying which ones are the offenders (just before crashing).
Wayne, I can work one up for you if you'd like, although I think it'd be a custom build (via try server maybe)? Not something I'd be aiming to land.

Flags: needinfo?(vseerror)

Yes, I'm game. Beta compatible preferred.

Oddly, no crashes in last 24 hours. That previous one late last night has a different signature [@ mozilla::dom::quota::(anonymous namespace)::PrincipalVerifier::Run ] ... Bug 1674954 - Thunderbird startup Crash in [@ mozilla::dom::quota::(anonymous namespace)::PrincipalVerifier::Run] via DoExecuteNextTaskOnlyMainThreadInternal

Flags: needinfo?(vseerror) → needinfo?(benc)

OK. I wasn't able to figure out anything that would be more suitable for mainstream landing, so it's a nasty hack instead :-)
The printf is there because I want to know which pointers are missing - there might be more than one.
It'll then assert if any of the pointers are null, even in a release build.

What's the easiest way to get this into a build for you, Wayne? I can kick off a try build with it? If so, what platform(s) would you like, and do you want both debug and release?

I did try to test this patch out manually (by nulling some of the pointers just before my hack)... but it turns out this code path isn't what I expected. I figured it'd be used for any copy into a local folder, but I couldn't seem to get that to happen. I think it might be for filter-triggered moves? Automatically moving a message from IMAP to local?

In any case I can dig into it more, but I figured it'd be good to get this patch to you to get a better idea of what in the copyState is actually broken...

Flags: needinfo?(benc) → needinfo?(vseerror)

Try build for Mac would be good. Debug is fine

Flags: needinfo?(vseerror) → needinfo?(benc)
See Also: → 1674954
Flags: needinfo?(benc)

Ben, did you have an idea that involved a code change, that I can test via debug try build? Or, will any old debug build do?

Flags: needinfo?(benc)

Sorry about that Wayne - had a few false starts at getting the try server to actually produce a mac build!
I think this one did it:
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=33f2fb95e655201a0da9404a3afef9410128c935
Does that work for you?

Either release or debug should be fine.
The patch has a separate assert for each possibly-borked field, so hopefully should narrow it down.
There's also a printf beforehand which will show the values of all the fields I'm interested in (there could be more than one null, and the assert will only catch the first one). But I'm not sure what hoops you need to go through to get stdout going somewhere useful on the mac. I wouldn't worry too much about stdout - let's see if we can get any asserts to trigger first.

Good luck!

Flags: needinfo?(benc)
Flags: needinfo?(vseerror)

Debug build constantly crashes. Even in safe mode. I haven't figured out how to get stdout.

Flags: needinfo?(vseerror)

I am unable to make this crash on Windows using the 32bit artifact. It has a couple newsgroups. But it doesn't have a primary password set. (windows 32bit debug artifact flat out fails to start)

I'm wondering whether primary password on Mac has an impact. Some of the Mac crashes
https://crash-stats.thunderbird.net/report/bp-3f6efa65-1361-4a5a-8a9d-44fbc1220216
https://crash-stats.thunderbird.net/report/bp-e423fde6-f8b6-4275-9fa2-b8dc11220216
https://crash-stats.thunderbird.net/report/bp-4dea2a45-8915-44ba-a122-86c9d1220216

Flags: needinfo?(benc)

I tried one more startup not in safe mode, with mail.server.default.fetch_by_chunks set to false

Last entries

2022-02-16 19:38:42.843176 UTC - [Parent 77841: Socket Thread]: D/nsSocketTransport SocketContext::TimeoutIn socket=14b7ecc00, timeout=100s
2022-02-16 19:38:42.843178 UTC - [Parent 77841: Socket Thread]: D/nsSocketTransport remains 99s
2022-02-16 19:38:42.843180 UTC - [Parent 77841: Socket Thread]: D/nsSocketTransport SocketContext::TimeoutIn socket=148a8e400, timeout=65535s
2022-02-16 19:38:42.843182 UTC - [Parent 77841: Socket Thread]: D/nsSocketTransport not engaged
2022-02-16 19:38:42.843183 UTC - [Parent 77841: Socket Thread]: D/nsSocketTransport poll timeout: 99
2022-02-16 19:38:42.843185 UTC - [Parent 77841: Socket Thread]: D/nsSocketTransport timeout = 99873 milliseconds
JavaScript warning: resource://gre/modules/Troubleshoot.jsm, line 703: WebGL context was lost.
JavaScript warning: resource://gre/modules/Troubleshoot.jsm, line 703: WebGL context was lost.
2022-02-16 19:38:42.969318 UTC - [Parent 77841: Main Thread]: D/nsStreamPump nsInputStreamPump::OnInputStreamReady [this=14ecf9530]
2022-02-16 19:38:42.969340 UTC - [Parent 77841: Main Thread]: D/nsStreamPump OnStateStart [this=14ecf9530]
2022-02-16 19:38:42.969354 UTC - [Parent 77841: Main Thread]: D/nsStreamPump OnStateTransfer [this=14ecf9530]
2022-02-16 19:38:42.969358 UTC - [Parent 77841: Main Thread]: D/nsStreamPump Available returned [stream=13c3bbf70 rv=0 avail=575]
2022-02-16 19:38:42.969361 UTC - [Parent 77841: Main Thread]: D/nsStreamPump calling OnDataAvailable [offset=0 count=575(575)]
2022-02-16 19:38:42.969404 UTC - [Parent 77841: Main Thread]: D/nsStreamPump OnStateStop [this=14ecf9530 status=0]
2022-02-16 19:38:42.969612 UTC - [Parent 77841: Main Thread]: D/nsStreamPump nsInputStreamPump::OnInputStreamReady [this=14d43bd50]
2022-02-16 19:38:42.969616 UTC - [Parent 77841: Main Thread]: D/nsStreamPump OnStateStart [this=14d43bd50]
2022-02-16 19:38:42.969673 UTC - [Parent 77841: Main Thread]: D/nsStreamPump OnStateTransfer [this=14d43bd50]
2022-02-16 19:38:42.969678 UTC - [Parent 77841: Main Thread]: D/nsStreamPump Available returned [stream=13c3bb0d0 rv=0 avail=1041]
2022-02-16 19:38:42.969680 UTC - [Parent 77841: Main Thread]: D/nsStreamPump calling OnDataAvailable [offset=0 count=1041(1041)]
2022-02-16 19:38:42.969691 UTC - [Parent 77841: Main Thread]: D/nsStreamPump OnStateStop [this=14d43bd50 status=0]
Exiting due to channel error.

That crash report unfortunately did not submit successfully https://crash-stats.thunderbird.net/report/bp-c7672507-0bc0-48c9-a9e2-a4fcb1220216

(In reply to Wayne Mery (:wsmwk) from comment #23)

...
That crash report unfortunately did not submit successfully https://crash-stats.thunderbird.net/report/bp-c7672507-0bc0-48c9-a9e2-a4fcb1220216

https://github.com/thundernest/crash-diverter/issues/11

Ben,
jrmuizel sez the printfs should just work.
Can you create a build that has an added printf in an unconditional code path so I can confirm that at least is working?

very very overdue, sorry, but I've tweaked the patch to print to stdout in local folder constructors and destructors (which is rather illuminating in itself - there's a lot to tidy up in folder creation!). There's also a printf at the beginning of nsMsgLocalMailFolder::EndMessage() so we can tell when it's heading into the danger zone...

try build here, which you should be able to grab from:
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=3e11b0876f38246e100f516e4ed159494798db57

Frustratingly, I couldn't get it to enter nsMsgLocalMailFolder::EndMessage() by just copying messages about... I haven't investigated too deeply yet, but the copy code is pretty convoluted and there are short-circuit routes that avoid EndMessage(), so I think it's just bypassing it in the cases I've tried. Obviously your crash case is hitting EndMessage().
Anyway, you should definitely be able to pick up "nsMsgLocalMailFolder::nsMsgLocalMailFolder()" and "nsMsgLocalMailFolder::~nsMsgLocalMailFolder()" lines to confirm you're seeing stdout.
And from there you should see nsMsgLocalMailFolder::EndMessage() whenever it enters there... and if it gets that far, a bogus null pointer should trigger the asserts and diagnostic printfs!

Thanks Ben. I was never seriously worried :)

I haven't tested it yet. For the record, been a long time since I crashed, but I just did @ nsMsgLocalMailFolder::EndMessage https://crash-stats.thunderbird.net/report/bp-9e3479b4-f154-4009-9e9b-8222e1220328 (I thought it was going to be a calendar crash, because I had just brought up the open invitations list.)

Hmm.. interesting... looks like the problem is during a IMAP->local folder copy, and the local folder isn't expecting the EndMessage() call.
That copy mechanism is insanely convoluted, and there are so many places where things could go wrong (Bug 1731177), so it's not tooooo surprising to me that there are odd cases where things fall over. But hard to tell for sure what's going on without solid reproduction steps for this bug :-(

Flags: needinfo?(benc)
See Also: → 1731177
Blocks: 1757101

I sent Ben my logs from using the try build.

(Forgot to NI)

Flags: needinfo?(benc)
See Also: → 1743138, 818325

I had a string of crashes this morning after I restarted TB (I had been running fine for a few days with no restarts), but I don't know that it was this bug. Beta, daily, safe mode, didn't help.

I was desperate, so deleted foldercache.json, session.json, addons.json and the extensions directory. Next startup was good. I'm not sure what caused this, but the last two things I did of any consequence is install folderflags and glodaquilla addons. Some tabs were probably open - some search results, maybe a pdf or two. some messages.

Two of the crashes

THe universe can be strange - TB crashed two consecutive days, 1:03am and 1:04am, both unattended
bp-fd6529e4-b2a9-4baf-8198-7184b0220410 nsMsgLocalMailFolder::EndMessage
bp-207c54fd-e998-41fe-8fbc-866370220410

On my Mac, I just changed mail.password_protect_local_cache back to false, so let's see if that affects the crash rate - I shouldn't think so. But I've had it set to true for a good long while to test bug 1346268 Master password prompt twice (or three times) with mail.password_protect_local_cache back is true

Gah. Wayne: It's obviously crashing reasonably reliably for you, but I just can't find the printf diagnostics in the output log you emailed me.
We're definitely seeing the printf() output - as a check I added a printf() every time a local folder is created or destroyed, and so you can see lots of "nsMsgLocalMailFolder::nsMsgLocalMailFolder()" lines in the output you sent.
(lots of them - like 454! That's just a byproduct of the crazy way folders currently come into existance just now and will be refactored into something sane one day!)

But searching for "EndMessage" in that output doesn't come up with anything, so the diagnostic printfs I put into the crashing function aren't showing up. Maybe they're being buffered and the crash causes the buffered data to be discarded? In any case, I'm going to try again, using MOZ_LOG() instead of printf(). I'll kick off a try build for you to pick up soon.

(But I suspect the real solution is to fix Bug 1731177 and refactoring the message copy mechanisms. It's just waaay too complicated at the moment, and huge numbers of ways that things can go wrong...)

OK, new build is building:
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=d98aa7721add7a9cc5d96d0371e86b315976cc3a

There's a new log, just for you :-)
MOZ_LOG="WAYNE:5"

I also added an extra line to log the url of the imap message being processed, which may or may not help.

Assignee: nobody → benc
Flags: needinfo?(benc)
Status: NEW → ASSIGNED
Flags: needinfo?(vseerror)

Crashed early this morning and sent logs. Opening Mac lid, so similar cirumstances to that reported in this bug, but different signature.
bp-703a9f70-11d7-4c70-85c5-a9f6e0220415 [@ EMPTY: no crashing thread identified; MissingThreadList ]

Flags: needinfo?(vseerror) → needinfo?(benc)

Summary of email conversation with Wayne:

He caught the following lines of stdout:

failed opening offline store for mailbox://nobody@Local%20Folders/foo/bar
bad mCopyState m_fileStream=0x0 m_newHdr=0x0 m_msgStore=0x12ea40700 m_parseMsgState=0x197a9ec00

The second line is the diagnostic one I added.
The first line is a printf that's been live since early 2012

What's happening is that a copy operation is trying to open the mbox file for "Local Folders/moz/foo/bar", but failing. But rather than aborting, it stampedes on even with a null file handle as a destination!

Turns out in this case that the "bar" mbox file has the wrong permissions, and TB cannot open it.
It should fail gracefully at that point and stop the operation.
The failing function is nsMsgBrkMBoxStore::GetNewMsgOutputStream(), so there's somewhere which is not correctly handling the return code.

Flags: needinfo?(benc)

That patch hits all the places I could see where the error wasn't being checked. Hopefully it'll improve things.
I couldn't actually replicate the crash locally, but I suspect that might be because I'm running debug builds...

In any case, here's a try build with both the fix, and my hacky diagnostic printf patch included:
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=388e710ad18b71953a24cdd4373dc97964d25168

That patch might help. Hard to say for sure as I can't replicate it properly here. So I'll leave the bug open for a bit and Wayne can see how it goes for him.

Target Milestone: --- → 101 Branch

Pushed by alessandro@thunderbird.net:
https://hg.mozilla.org/comm-central/rev/7d1d815986ff
Add some missing error checking in message copy code. r=mkmelin

(In reply to Ben Campbell from comment #40)

That patch hits all the places I could see where the error wasn't being checked. Hopefully it'll improve things.
I couldn't actually replicate the crash locally, but I suspect that might be because I'm running debug builds...

In any case, here's a try build with both the fix, and my hacky diagnostic printf patch included:
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=388e710ad18b71953a24cdd4373dc97964d25168

Crashed pretty quickly with the new build (after also crashing with the previous build when opening the lid)
EMPTY: no crashing thread identified; MissingThreadList
bp-41cc3dbc-b2f8-4104-9fa7-a10b20220422

But the second startup with the new build is working for 15 minutes.

Flags: needinfo?(benc)
Blocks: 1674954
See Also: 1674954

Ben, A crash from last night nsMsgLocalMailFolder::EndMessage bp-e0830f1f-5ce3-4d9e-90e3-88a720220512 - buildid 20220502202835 which is 101.0b1

OnCopyCompleted will null out mCopyState. I would suspect due to the imap thread, that happens here and nsMsgLocalMailFolder::EndMessage crashes on null mCopyState

new signature
nsImapIncomingServer::DiscoveryDone - 102.0b1 didn't start in safe mode - bp-678ca790-04cf-4348-83b8-e52a20220601
0 XUL nsImapIncomingServer::DiscoveryDone() mailnews/imap/src/nsImapIncomingServer.cpp:1429
1 XUL (anonymous namespace)::SyncRunnable0<nsIImapServerSink>::Run() mailnews/imap/src/nsSyncRunnableHelpers.cpp:78
2 XUL mozilla::RunnableTask::Run() xpcom/threads/TaskController.cpp:475
3 XUL mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) xpcom/threads/TaskController.cpp:788

I think that's something else. Filed bug 1772064.

Flags: needinfo?(benc)

I'm still crashing, actually more frequently.
bp-f5452412-ab38-47cc-a84f-9a26d0220708
bp-fcea0a41-2029-412f-90c3-f9af70220708
bp-750d7e41-105e-4123-9c2f-115cd0220708

Certainly could be imap threads. And I'm virtually certain that Filters is involved, which I know exposes the fragility of imap threads. Maybe that's where we should focus?

Are more logs, or further enhanced logging code needed?
MOZ_LOG="WAYNE:5"

Flags: needinfo?(mkmelin+mozilla)
Flags: needinfo?(benc)

The crash stack doesn't make sense. https://hg.mozilla.org/releases/comm-beta/file/tip/mailnews/local/src/nsLocalMailFolder.cpp#l2399 can't crash. Agreed it's very very likely some imap threading issue

Flags: needinfo?(mkmelin+mozilla)

A big spike on August 2 with 102.1.0, but otherwise the crash signature is rare

Severity: S2 → S3

(In reply to Ben Campbell from comment #15)

OK. I wasn't able to figure out anything that would be more suitable for mainstream landing, so it's a nasty hack instead :-)
The printf is there because I want to know which pointers are missing - there might be more than one.
It'll then assert if any of the pointers are null, even in a release build.

I did try to test this patch out manually (by nulling some of the pointers just before my hack)... but it turns out this code path isn't what I expected. I figured it'd be used for any copy into a local folder, but I couldn't seem to get that to happen. I think it might be for filter-triggered moves? Automatically moving a message from IMAP to local?

Ben,

I disabled my filters for a couple days (renamed msgfilterrules.dat) and no crashes. So yes, definitely filter related.

I do have filters that move messages to local folders. But I'm in disbelief that every crash coincides with a local folder move - I don't think I have that many incoming messages that would match such a filter.

Summary: Crash getting mail [@ nsMsgLocalMailFolder::EndMessage ] → Crash getting mail [@ nsMsgLocalMailFolder::EndMessage ] with message filter copy/move

Thanks Wayne - good to confirm it's filter-related!
I'm not looking directly at this right now, but I am picking my way through a lot of copy-related code at the moment, and I do have a bunch of failing filter-related unit tests (as part of my ever-growing uber patch set to address Bug 1719121) ,so I'll be looking at how filter-initiated copying works soon.

Flags: needinfo?(benc)
See Also: → 138208

(In reply to Ben Campbell from comment #12)

Digging deeper into the crash callstack, it looks to me like the crash occurs when nsLocalMailFolder::EndMessage(int) calls FinishNewLocalMessage() here:
https://searchfox.org/comm-central/source/mailnews/local/src/nsLocalMailFolder.cpp#2404

The copyState ptr is there (otherwise it would crash earlier) so I think it's crashing inside FinishNewLocalMessage(). Doesn't show up on the call stack, but it's a small function and I think it's probably being inlined.

So my money is on one of the fields in copyState being unexpectedly null. This doesn't surprise me too much - copyState is a very convoluted structure, and has loads of fields which are used in some types of copies but not others and it's all very unclear and brittle.

Speaking of which, FinishNewLocalMessage is the top of stack for bp-584e1a49-2c9d-496e-b5d5-63da90220928 I had a couple hours ago with 106.0b2

I don't know why, but I am crashing more frequently the last couple days - several times a day. I still think the crashes correlate to filters, but I wonder if the increase is because I unpaused my RSS feeds account - it has been paused for months. bp-3a90ce88-56af-4555-b9bc-2784f0220928 just an hour ago.

Very glad to read Benc is unravelling the backend spaghetti.

I have many filters which rely on bugzilla X headers, for example X-Bugzilla-Component: Search. Most users won't have such filters on their system. Perhaps this makes my crashes unique?

I have six crashes with new signature FinishNewLocalMessage which started with beta 106, and so far looks like I am the only one - they are all mine
https://crash-stats.mozilla.org/signature/?product=Thunderbird&signature=nsMsgLocalMailFolder%3A%3AFinishNewLocalMessage&date=%3E%3D2022-03-28T14%3A19%3A00.000Z&date=%3C2022-09-28T14%3A19%3A00.000Z&_columns=date&_columns=version&_columns=build_id&_columns=reason&_columns=address&_columns=install_time&_columns=platform_version&_sort=-date&page=1

Perhaps related? PL_strncpy | PL_strncat | nsParseMailMessageState::GetAggregateHeader bp-0bda6091-cb9a-44f5-8543-703de0220922 from the much older bug 1581079

Crash Signature: [@ nsMsgLocalMailFolder::EndMessage ] → [@ nsMsgLocalMailFolder::EndMessage ] [@ nsMsgLocalMailFolder::FinishNewLocalMessage ]
See Also: → 871622

(In reply to Wayne Mery (:wsmwk) from comment #54)

(In reply to Ben Campbell from comment #12)
[snip]

The copyState ptr is there (otherwise it would crash earlier) so I think it's crashing inside FinishNewLocalMessage(). Doesn't show up on the call stack, but it's a small function and I think it's probably being inlined.

So my money is on one of the fields in copyState being unexpectedly null. This doesn't surprise me too much - copyState is a very convoluted structure, and has loads of fields which are used in some types of copies but not others and it's all very unclear and brittle.

Speaking of which, FinishNewLocalMessage is the top of stack for bp-584e1a49-2c9d-496e-b5d5-63da90220928 I had a couple hours ago with 106.0b2

Ooh, that's a nice, crunchy crash report! It's the localfolder mCopyState->m_fileStream which is null there, and definitely shouldn't be. unfortunately the stack doesn't give us any clues on what happened before that (yay, async code!)... but it definitely seems like the classic case that should be caught by asserts, so I'll have a poke through and try and add some in in sensible places...

should be caught by asserts, so I'll have a poke through and try and add some in in sensible places...

ready freddy? :)

Flags: needinfo?(benc)

Sorry it took so long to get back to this!
I got totally bogged down trying to trace my way through the message copying code. In the end I couldn't figure out which paths it was taking to end up where it did, so I've got rather a sticking-plaster patch.
I'm hoping this'll fix the crash, even if it doesn't fix the underlying issue (the over-complexity of message copying, which really just needs massive refactoring and simplification).

Flags: needinfo?(benc)

Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/0148a01c1be8
Allow for null fileStream in nsMsgLocalMailFolder::EndMessage(). r=mkmelin

(In reply to Ben Campbell from comment #59)

I'm hoping this'll fix the crash, even if it doesn't fix the underlying issue (the over-complexity of message copying, which really just needs massive refactoring and simplification).

The patch appears to have no effect on beta:

Other crash signatures:

Blocks: 818325
Depends on: 1731177
See Also: 1731177, 818325
Crash Signature: [@ nsMsgLocalMailFolder::EndMessage ] [@ nsMsgLocalMailFolder::FinishNewLocalMessage ] → [@ nsMsgLocalMailFolder::EndMessage ] [@ nsMsgLocalMailFolder::FinishNewLocalMessage ] [@ nsMsgCopyService::ClearRequest ]

I'm getting bitten by this crash in [@ nsMsgLocalMailFolder::FinishNewLocalMessage ], this time it's m_msgStore that's null.

There's a way to consistently repro, by changing the number of messages generated in the test from D183880 (browser_archive.js:35) from 2 to 5. The crash happens when trying to archive messages (browser_archive.js:65).

From spending some time investigating this specific occurrence of the crash, it looks like when copying the 3rd message, the call to nsMsgLocalMailFolder::InitCopyMsgHdrAndFileStream fails with this warning:

[Parent 13229, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, (nsresult)(((uint32_t)(1) << 31) | ((uint32_t)(16 + 69) << 16) | ((uint32_t)(22)))) failed with result 0x80004005 (NS_ERROR_FAILURE): file /[...]/mozilla-central/comm/mailnews/base/src/nsMsgDBFolder.cpp:731

The interesting bit of backtrace from the nsMsgLocalMailFolder::InitCopyMsgHdrAndFileStream that generates this warning (as per gdb) is:

#0  nsMsgLocalMailFolder::InitCopyMsgHdrAndFileStream() (this=0x7fffabc9d100) at /[...]/mozilla-central/comm/mailnews/local/src/nsLocalMailFolder.cpp:1890
#1  0x00007fffed8cfdd0 in nsMsgLocalMailFolder::StartMessage() (this=0x0) at /[...]/mozilla-central/comm/mailnews/local/src/nsLocalMailFolder.cpp:2435
#2  0x00007fffed8da3c8 in nsMailboxProtocol::OnStopRequest(nsIRequest*, nsresult) (this=0x7fffa7d144c0, request=0x7fffa8182c50, aStatus=nsresult::NS_OK)
    at /[...]/mozilla-central/comm/mailnews/local/src/nsMailboxProtocol.cpp:246
#3  0x00007fffe86e3a10 in nsInputStreamPump::OnStateStop() (this=this@entry=0x7fffa8182c50) at /[...]/mozilla-central/netwerk/base/nsInputStreamPump.cpp:695
#4  0x00007fffe86e2e6a in nsInputStreamPump::OnInputStreamReady(nsIAsyncInputStream*) (this=0x7fffa8182c50, stream=<optimized out>) at /[...]/mozilla-central/netwerk/base/nsInputStreamPump.cpp:415
#5  0x00007fffe86e3bc5 in non-virtual thunk to nsInputStreamPump::OnInputStreamReady(nsIAsyncInputStream*) () at /[...]/mozilla-central/netwerk/base/nsInputStreamPump.cpp:671

After this call, mCopyState->m_msgStore becomes null.

So from the look of it, something is causing retrieving the server to fail, which in turns causes m_msgStore (and maybe a few other bits) to become null (m_msgStore was definitely not null before this).

Does Brendan's comment 63 further inform the refactoring work? Or does it, "something is causing retrieving the server to fail", suggest a standalone patch or more investigation needed?

Flags: needinfo?(benc)

OK, I think I've figured out Brendan's crash in comment 63, and I'm not totally sure it's the same issue, although it definitely gives me some ideas.

./mach mochitest --headless comm/mail/base/test/browser/browser_archive.js

I'll spare the gory details (soooooo much printf debugging!), but the upshot is that the test deletes (archives) a thread, then clicks "undo". It then waits until it sees an item appear in back the thread tree.
Buuuuut... if you augment the test to have more messages in the thread, the undo (really a message copy) operation is still running when the test end conditions are met (at least one message has been re-instated, but the rest are still in-progress). The test then ends, and the test cleanup code is run. And the cleanup code deletes the account being used for the test - which includes the incomingserver and its msgStore. So it's no wonder we get catastrophic failure :-)

As a side note, the existing code in browser_archive.js uses a thread with 2 messages in it. I'd kind of expect that to suffer the same issue (the cleanup happening before the undo operation is complete).
But maybe with 2 messages it completes quickly enough that the timing doesn't become an issue?
In any case, I think it'd be worth seeing if that test can be changed to ensure all the messages have been copied back before finishing the test. The line in question is:
https://searchfox.org/comm-central/source/mail/base/test/browser/browser_archive.js#93

It seems like this'd be a prime "intermittant failure" kind of thing...

Brendon: I'm not sure if you dug far into this test, just NIing you on the off chance you know all about it :-)
Geoff: any opinion on this test? Seems to me that it should wait until all the messages of the thread re-appear in the gui, but I've no idea how you'd go about doing that...

Flags: needinfo?(geoff)
Flags: needinfo?(brendan)
Flags: needinfo?(benc)

Following on from comment 65, picking up the bits that might be relevant to this bug:

The folder implementations only hold a weak reference to the nsIMsgIncomingServer they belong to.
This means that a server can be deleted while a copy operation is still in progress, and the copy operation will go extremely pear-shaped soon afterwards.
This is nasty and brittle and I could see it causing all kinds of icky problems. Like this bug, maybe.

So one theory is that perhaps a copy operation is still in flight when a shutdown is happening. But I don't think that matches the conditions in the original reports of this bug - comment 4 for example doesn't mention anything about shutting down...
So I think this is a red herring (albeit a potentially explosive red herring :- )

Yes, that test is only checking for the existence of one row in the message list, and it really wait for the same number of rows as it had before. Just changing the getRowAtIndex argument would do the trick. Would the crash still happen? I don't know, but if we're moving on too early and deleting the server because the test is finished that's a likely reason it happens.

Flags: needinfo?(geoff)

It's a thread tree, so I don't know if the getRowAtIndex() includes nested messages or not, or only looks at the thread roots...
In any case, it's out-of-scope here, so I've split it off to Bug 1852480

I've replied on the new bug.

Flags: needinfo?(brendan)

bp-f2a78da0-0e7b-4359-bc3a-b91150240429 115.10.1 user is deleting a duplicate message, whose "size is different."

OS: Unspecified → All
Summary: Crash getting mail [@ nsMsgLocalMailFolder::EndMessage ] with message filter copy/move → Crash getting mail [@ nsMsgLocalMailFolder::EndMessage ] with message filter copy/move, because server can be deleted while a copy operation is still in progress.
See Also: → 1942876
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: