Closed Bug 1586494 Opened 5 years ago Closed 3 years ago

Crash in [@ NS_CycleCollectorSuspect3] from nsImapMockChannel.

Categories

(MailNews Core :: Networking: IMAP, defect, P1)

Unspecified
All
defect

Tracking

(thunderbird_esr78+ fixed, thunderbird84 fixed)

RESOLVED FIXED
85 Branch
Tracking Status
thunderbird_esr78 + fixed
thunderbird84 --- fixed

People

(Reporter: emilio, Assigned: benc)

References

(Depends on 1 open bug)

Details

(Keywords: crash, topcrash-thunderbird)

Crash Data

Attachments

(7 files, 1 obsolete file)

This bug is for crash report bp-6e4e9c87-b863-4b7d-812e-82d560190830.

Top 10 frames of crashing thread:

0 libxul.so NS_CycleCollectorSuspect3 xpcom/base/nsCycleCollector.cpp:3763
1 libxul.so <name omitted> uriloader/base/nsDocLoader.cpp:171
2 libxul.so nsImapMockChannel::~nsImapMockChannel comm/mailnews/imap/src/nsImapProtocol.cpp:8480
3 libxul.so nsImapMockChannel::~nsImapMockChannel comm/mailnews/imap/src/nsImapProtocol.cpp:8473
4 libxul.so <name omitted> comm/mailnews/imap/src/nsImapProtocol.cpp:8460
5 libxul.so nsImapProtocol::RetryUrl comm/mailnews/imap/src/nsImapProtocol.cpp:1896
6 libxul.so nsImapProtocol::ProcessCurrentURL comm/mailnews/imap/src/nsImapProtocol.cpp
7 libxul.so nsImapProtocol::ImapThreadMainLoop comm/mailnews/imap/src/nsImapProtocol.cpp:1420
8 libxul.so nsImapProtocol::Run comm/mailnews/imap/src/nsImapProtocol.cpp:1103
9 libxul.so non-virtual thunk to nsImapProtocol::Run comm/mailnews/imap/src/nsImapProtocol.cpp

Summary: Crash in [@ NS_CycleCollectorSuspect3] → Crash in [@ NS_CycleCollectorSuspect3] from nsImapMockChannel.

I've been crashing on imap code two or three times. It wouldn't be fun if these
assertions were the culprit, but an NS_WARNING_ASSERTION is not going to find
out.

So make these MOZ_DIAGNOSTIC_ASSERT.

Hmm, looking at the crash data, there are mostly crashes in TB 68.1. here:
0 xul.dll NS_CycleCollectorSuspect3 xpcom/base/nsCycleCollector.cpp:3761 context
1 xul.dll CompositeDataSourceImpl::Release() comm/rdf/base/nsCompositeDataSource.cpp:515 cfi
2 xul.dll nsImapMailCopyState::~nsImapMailCopyState() comm/mailnews/imap/src/nsImapMailFolder.cpp:7507 cfi

We've finally killed RDF, so that code path no longer exists. Let's see how that affects the crash situation.

Keywords: leave-open

Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/cda2408c98e8
Strengthen threading assertions in IMAP code. r=jorgk

Keywords: checkin-needed

(In reply to Jorg K (GMT+2) from comment #2)

Hmm, looking at the crash data, there are mostly crashes in TB 68.1. here:
0 xul.dll NS_CycleCollectorSuspect3 xpcom/base/nsCycleCollector.cpp:3761 context
1 xul.dll CompositeDataSourceImpl::Release() comm/rdf/base/nsCompositeDataSource.cpp:515 cfi
2 xul.dll nsImapMailCopyState::~nsImapMailCopyState() comm/mailnews/imap/src/nsImapMailFolder.cpp:7507 cfi

I don't want to discourage in any way the investigation but I want to bring to your attention:

So what is interesting about nsImapMockChannel, are there 71.0a1 crashes, and do you actually find an increase correlating to nsImapMockChannel?

Component: General → Networking: IMAP
Product: Thunderbird → MailNews Core
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/58c46e562118
Follow-up: Reformat. rs=reformat

Here's a crash id from a minute ago, crashing on this assertion: bp-345ec887-e60b-4e6f-ba56-0485b0191009

https://searchfox.org/comm-central/rev/becdd368fb58a470c60d247e235ca07da787f694/mailnews/imap/src/nsImapProtocol.cpp#8467 :-(

I wonder why this matters in the destructor. But we could proxy to the main thread. Maybe we should file a new bug for this crash since bug is about another crash, unless I misunderstand something.

Emilio, any STR that provoke this crash?

Flags: needinfo?(benc)

(In reply to Jorg K (GMT+2) from comment #8)

https://searchfox.org/comm-central/rev/becdd368fb58a470c60d247e235ca07da787f694/mailnews/imap/src/nsImapProtocol.cpp#8467 :-(

I wonder why this matters in the destructor. But we could proxy to the main thread. Maybe we should file a new bug for this crash since bug is about another crash, unless I misunderstand something.

I don't think it's just a matter of proxying to the main thread, necessarily... There's some code around that doesn't make much sense to me. I'm not familiar with it of course but:

https://searchfox.org/comm-central/rev/becdd368fb58a470c60d247e235ca07da787f694/mailnews/imap/src/nsImapProtocol.cpp#1298 is in a function which asserts that it is called off the main thread. But it calls a main-thread-only function, unless I'm holding it wrong:

https://searchfox.org/comm-central/rev/becdd368fb58a470c60d247e235ca07da787f694/mailnews/imap/src/nsImapProtocol.cpp#1138

That code looks somewhat broken.

Emilio, any STR that provoke this crash?

Nothing reliable I fear :(

Yep, as Emilio says, the code looks very confused as to what it expects to be running on :-(

There is a comment in the imap thread cleanup:

 // Release protocol object on the main thread to avoid destruction of 'this'
 // on the IMAP thread, which causes grief for weak references.

https://searchfox.org/comm-central/rev/becdd368fb58a470c60d247e235ca07da787f694/mailnews/imap/src/nsImapProtocol.cpp#1127

Maybe a similar motivation for the mockchannel destruction?

The stack frame for the latest crash capture bp-0bd5b09f-3485-4643-a6ef-2f0ac0191010:

nsImapMockChannel::~nsImapMockChannel() 	comm/mailnews/imap/src/nsImapProtocol.cpp:8467  (ASSERT)
nsImapMockChannel::~nsImapMockChannel() 	comm/mailnews/imap/src/nsImapProtocol.cpp:8463 	
<name omitted> 	comm/mailnews/imap/src/nsImapProtocol.cpp:8450 	
nsImapProtocol::RetryUrl() 	comm/mailnews/imap/src/nsImapProtocol.cpp:1894
nsImapProtocol::ProcessCurrentURL() 	comm/mailnews/imap/src/nsImapProtocol.cpp:0
nsImapProtocol::ImapThreadMainLoop() 	comm/mailnews/imap/src/nsImapProtocol.cpp:1418 	
...

It appears to me that the dtor is being called when RetryUrl() exits, and it's local saveMockChannel goes out of scope.
(which, indeed, happens in the imap thread, not the main thread).

RetryUrl():
https://searchfox.org/comm-central/rev/becdd368fb58a470c60d247e235ca07da787f694/mailnews/imap/src/nsImapProtocol.cpp#1881

If the saveMockChannel release inside RetryUrl() can be deferred to the main thread, maybe that'd help? But it seems a bit of a nasty band-aid fix without really understanding what's going on...

(Either way, I think the proper approach - for this and other IMAP bugs - is to sit down and pick through, mapping out all the IMAP code and figure out just what's happening (and on which threads), and if it's sane. Which is on my TODO list, but it's not a small job :-(

Flags: needinfo?(benc)

Uff, that call to main tread nsImapProtocol::CloseStreams() from a non-main thread function nsImapProtocol::TellThreadToDie() is really confused. The former only MOZ_ASSERTS(), but I've never seen it crash there in a debug build, so I guess that code protected by if (m_imapProtocolSink) never runs.

As for saveMockChannel going out of scope in RetryUrl() on the IMAP thread (which I haven't) checked, how about adding
NS_ReleaseOnMainThreadSystemGroup("nsImapProtocol::RetryUrl", saveMockChannel .forget());since this is how we destroy it in other places like:
https://searchfox.org/comm-central/rev/29706036071c4629c2b44512d1acecb64008cc46/mailnews/imap/src/nsImapProtocol.cpp#1011

At least that will fix Emilio's crash. Ben, can you do a patch with that. I don't want to rip off your research and it's just a two-liner.

Flags: needinfo?(benc)

Hmm. There's a patch which might address that specific crash...
But I'm really unhappy with it. There are lots of other places in RetryUrl() which should also cause an assert
eg:

  if (m_imapServerSink)
    (void)m_imapServerSink->PrepareToRetryUrl(kungFuGripImapUrl,
                                              getter_AddRefs(saveMockChannel));

PrepareToRetryUrl() calls nsImapUrl::GetMockChannel() which has a similar thread assert in it.
It's not called if m_imapServerSink is null... but then saveMockChannel isn't set either. It's a rabbithole.

I'm going to leave the NI on it, and next week I'll try and sit down and map out all the IMAP code and get some kind of understanding of it. Without that I'm just looking at an elephant through a microscope :-)

IOW: For saveMockChannel to be set PrepareToRetryUrl()
https://searchfox.org/comm-central/rev/29706036071c4629c2b44512d1acecb64008cc46/mailnews/imap/src/nsImapIncomingServer.cpp#448
and GetMockChannel() would need to run and that would crash before the channel is even set if run on the IMAP thread:
https://searchfox.org/comm-central/rev/29706036071c4629c2b44512d1acecb64008cc46/mailnews/imap/src/nsImapUrl.cpp#984

So in conclusion: The patch can't really help. It's crazy since the crash is at the end of RetryUrl() where a mock channel is being destroyed that can't even exist.

See Also: → 1132339
OS: Linux → All
See Also: → 1581075
Version: unspecified → Trunk
Flags: needinfo?(benc)
Blocks: 1517464

(bringing assignment from bug 1586494. note also bug 1175168 comment 7 )

Assignee: nobody → benc
Status: NEW → ASSIGNED
See Also: → 1616581

Magnus, as soon as we can manage it, I'd suggest we have Ben spend significant additional time to slay this, because we are approaching 1% of users affected - based on stats of 77k - 92k users affected in the past month [1] and a baseline of 10M users measured [2]. And this is just one crash signature - there are surely other signatures.

[1] https://crash-stats.mozilla.org/signature/?signature=NS_CycleCollectorSuspect3&date=%3E%3D2020-02-10T11%3A14%3A00.000Z&date=%3C2020-03-10T11%3A14%3A00.000Z
[2] https://stats.thunderbird.net/#default

Yeah we should see if we can track this down. Ben is going to spend some time on imap threading (bug 1618934). If we're lucky that could fix things. Likely not, but should give a better understanding and hopefully enough clues to figure this out.

Priority: -- → P1
Depends on: 1618934

FYI, got a crash again in 74.0b2 (32-bit)...
bp-0dc515a3-75af-4a2e-bce3-6e4a90200312
I was not even using the app when it happened... just crashed by itself in the background...

OK, so the patch I proposed back in comment #12 looked plausible, except for the fact that Jorg and I both thought it was completely bonkers and could never possibly work.

The Short Version:

I've changed my mind and think it could help. I've got a try run going here:

https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=6c9481e9ef055152f3e14800101c3102acf85a13

If no problems, we just apply the existing patch as is.

The Long Version:

(mainly just to leave myself a trail of breadcrumbs!)
This is the code in nsImapProtocol::RetryUrl() that had Jorg (comment #14) and I both scratching our heads:
(In reply to Ben Campbell from comment #13)

  if (m_imapServerSink)
    (void)m_imapServerSink->PrepareToRetryUrl(kungFuGripImapUrl,
                                              getter_AddRefs(saveMockChannel));

PrepareToRetryUrl() will assert if called on any thread other than the main thread. And the whole problem here is that we're asserting because saveMockChannel is being deleted on the IMAP thread. So the patch I posted was entirely academic - the stack trace must have been wrong, because PrepareToRetryUrl() wasn't asserting, so we must be on the main thread. The patch would never work.

Except.

After poring through the IMAP code then coming back here and taking a fresh look, I spot that m_imapServerSink isn't actually an nsIMapServerSink. It is, in fact, a ImapServerSinkProxy. This uses the magic of macros to synchronously dispatch nsIMapServerSink
calls to the main thread (see https://searchfox.org/comm-central/source/mailnews/imap/src/nsSyncRunnableHelpers.cpp#419 ).

So it looks like it's being called on the IMAP thread, but it is actually doing the call on the main thread, blocking the IMAP thread until it returns.

Phew.

Doesn't appear to have broken anything (try build), so let's check it in.

Pushed by thunderbird@calypsoblue.org:
https://hg.mozilla.org/comm-central/rev/30c25adcb3d8
Workaround for mainthread assertion in nsImapMockChannel dtor. r=jorgk

Target Milestone: --- → Thunderbird 76.0

Hmm, I can't remember reviewing this. Anyway, there's a grammar error here:

  • // (this is a workaround to try and prevent a specific crash, and
  • // does nothing clarify the threading mess!)

There's a "to" missing, right?

May be this crash can be forced by saving a draft into a nearly full (or overquota) IMAP folder.
Small mails were okay, but a mail with "corona homeschooling homework" jpg attachements crashed.
(Yes, I know SMTP is not a file transfer protocol.)

FYI on Windows 10 (64 bit) home:

bp-34432a1c-ead1-4ca1-8b8b-3d83f0200508

TB 68.8.0 (32-Bit)
Thunderbird 68.8.0 Crash Report [@ NS_CycleCollectorSuspect3 ]

Crashing Thread (53)
Frame Module Signature Source Trust
0 xul.dll NS_CycleCollectorSuspect3 xpcom/base/nsCycleCollector.cpp:3761 context
1 xul.dll CompositeDataSourceImpl::Release() comm/rdf/base/nsCompositeDataSource.cpp:515 cfi
2 xul.dll nsImapMailCopyState::~nsImapMailCopyState() comm/mailnews/imap/src/nsImapMailFolder.cpp:7506 cfi
3 xul.dll void nsImapMailCopyState::~nsImapMailCopyState() comm/mailnews/imap/src/nsImapMailFolder.cpp:7496 cfi
4 xul.dll mozilla::TransceiverImpl::Release() comm/mailnews/imap/src/nsImapMailFolder.cpp:7508 cfi
5 xul.dll nsImapUrl::~nsImapUrl() comm/mailnews/imap/src/nsImapUrl.cpp:76 cfi
6 xul.dll [thunk]:nsImapUrl::vector deleting destructor'adjustor{8}' (unsigned int) cfi
7 xul.dll nsMsgMailNewsUrl::Release() comm/mailnews/base/util/nsMsgMailNewsUrl.cpp:81 frame_pointer
8 xul.dll nsImapUrl::Release() comm/mailnews/compose/src/nsSmtpUrl.cpp:583 cfi
9 xul.dll nsImapProtocol::~nsImapProtocol() comm/mailnews/imap/src/nsImapProtocol.cpp:662 cfi
10 xul.dll [thunk]:nsImapProtocol::vector deleting destructor'adjustor{24}' (unsigned int) cfi
11 xul.dll nsDSURIContentListener::Release() comm/mailnews/base/util/nsMsgProtocol.cpp:50 frame_pointer
12 xul.dll [thunk]:nsImapProtocol::Release`adjustor{4}' () cfi
13 xul.dll nsThread::ProcessNextEvent(bool, bool*) xpcom/threads/nsThread.cpp:1222 frame_pointer
14 xul.dll NS_ProcessNextEvent(nsIThread*, bool) xpcom/threads/nsThreadUtils.cpp:486 cfi
15 xul.dll mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) ipc/glue/MessagePump.cpp:303 cfi
16 xul.dll MessageLoop::RunHandler() ipc/chromium/src/base/message_loop.cc:308 cfi
17 xul.dll MessageLoop::Run() ipc/chromium/src/base/message_loop.cc:290 cfi
18 xul.dll nsThread::ThreadFunc(void*) xpcom/threads/nsThread.cpp:454 cfi
19 nss3.dll _PR_NativeRunThread nsprpub/pr/src/threads/combined/pruthr.c:397 cfi
20 nss3.dll static unsigned int pr_root(void*) nsprpub/pr/src/md/windows/w95thred.c:137 cfi
21 ucrtbase.dll thread_start<unsigned int (__stdcall*)(void*), 1> cfi
22 kernel32.dll BaseThreadInitThunk cfi
23 mozglue.dll static void patched_BaseThreadInitThunk(int, void*, void*) mozglue/build/WindowsDllBlocklist.cpp:625 cfi
24 ntdll.dll _RtlUserThreadStart cfi
25 ntdll.dll _RtlUserThreadStart cfi

(In reply to Ben Campbell from comment #20)

...
So it looks like it's being called on the IMAP thread, but it is actually doing the call on the main thread, blocking the IMAP thread until it returns.

Phew.

Holy cow!

Ben, these graphs for beta for signatures of bug 1517464 look promising :

Even though the crash is not totally gone from betas

  • bp-b21bd9fb-517d-4e4a-9e97-8512f0200516 nsXPCWrappedJS::Release 77 beta
  • 642842a1-925d-482a-b623-bb2050200516 NS_CycleCollectorSuspect3 77 beta
    it has been on the beta channel since 76 without apparent ill effects, so I'd like to suggest we take the current patch on a point release for ESR to help us understand the effect on that channel - and we can be cautious about rolling it out. Unless you see a downside? What do you think?

[1] https://crash-stats.mozilla.org/signature/?product=Thunderbird&release_channel=beta&signature=xul.dll%20%7C%20_PR_NativeRunThread%20%7C%20pr_root&date=%3E%3D2020-02-16T15%3A55%3A00.000Z&date=%3C2020-05-16T15%3A55%3A00.000Z#graphs

Flags: needinfo?(benc)

Comment on attachment 9100325 [details]
Bug 1586494 - Workaround for mainthread assertion in nsImapMockChannel dtor. r=jorgk

it has been on the beta channel since 76 without apparent ill effects, so I'd like to suggest we take the current patch on a point release for ESR to help us understand the effect on that channel - and we can be cautious about rolling it out. Unless you see a downside? What do you think?

[Approval Request Comment]
Regression caused by (bug #): na
User impact if declined: many crashes
Testing completed (on c-c, etc.): c-b
Risk to taking this patch (and alternatives if risky): Magnus deems this to be low risk and ready to go

[Triage Comment]
Approved for ESR

Flags: needinfo?(benc)
Attachment #9100325 - Flags: approval-comm-esr68+

Should know in a couple more days the effect of this patch. So far, it is unclear because of the weekend and yesterday's Monday US holiday https://crash-stats.mozilla.org/signature/?release_channel=release&signature=NS_CycleCollectorSuspect3&date=%3E%3D2020-05-12T00%3A00%3A00.000Z&date=%3C2020-05-26T23%3A59%3A00.000Z#graphs

I hate to say it, but unlike beta it doesn't look like release has substantially changed. At best maybe 5-10%.
Unless for release, other crash signatures have been reduced.

I mean, that's probably explained by most of the crashes being a diagnostic assert isn't it? That isn't compiled in release mode.

I've been thrashing my way clumsily through the crash-stats, and it looks to me that the vast bulk of the crashes are in the 68.x versions.
(go onto the "Graphs" tab, click on "Crashes per day.." and add "version" to get the stats broken down by version).
78.x does show up, but in much smaller numbers.
Wayne: Is this just a reflection of vastly more people still running the 68.x series, or an actual improvement?
(I'm kind of guessing the former, but we can hope ;-)

A lot of crashes in earlier 68.x versions seemed to ultimately occur in the RDF code, which (I think) was removed at the point of transition to 78.x. So it'd be nice and neat if that was the culprit.
Except that it seems unlikely as a lot of later 68.x version signatures don't seem to have crashed in RDF code. sigh.

Flags: needinfo?(vseerror)

(In reply to Ben Campbell from comment #32)

I've been thrashing my way clumsily through the crash-stats, and it looks to me that the vast bulk of the crashes are in the 68.x versions.
(go onto the "Graphs" tab, click on "Crashes per day.." and add "version" to get the stats broken down by version).
78.x does show up, but in much smaller numbers.
Wayne: Is this just a reflection of vastly more people still running the 68.x series, or an actual improvement?
(I'm kind of guessing the former, but we can hope ;-)

Hard to say this early in the 78 cycle exactly where this signature is headed. We will know better when we hit 78.2 and we turn on updates. But yes, the crash numbers for 78 are affected by the relatively small number of users on that version. Approximately 6% on 78 vs 85% on 68 according to the top graph of https://stats.thunderbird.net/#version

To break down within the signature I prefer to look at the summary, order by version and then math - roughly 4% on 78.x vs 93% 68.x (and the rest other products)

Based on these two sets of numbers I'd say there is no significant change in the crash rate of 78.x.

Flags: needinfo?(vseerror)

Not sure if is affects the crash though it's on the crash stack, but what is this code supposed to do?
https://searchfox.org/comm-central/rev/f8f4a035ed094b3f70d054b1e2fa90bc9e79408b/mailnews/imap/src/nsImapMailFolder.cpp#7599-7603

It doesn't look like it's doing anything useful. And even less useful since it's in the DTOR.

(In reply to Magnus Melin [:mkmelin] from comment #34)

https://searchfox.org/comm-central/rev/f8f4a035ed094b3f70d054b1e2fa90bc9e79408b/mailnews/imap/src/nsImapMailFolder.cpp#7599-7603

It doesn't look like it's doing anything useful. And even less useful since it's in the DTOR.

Yep - seems pretty redundant to me! (It does call nsIMsgFolder.GetUriForMsg(), but there's only one implementation and it doesn't have any side-effects).

I don't think it'll have any bearing on the crash - the redundant code doesn't create any extra references held beyond the dtor scope.

Patch attached, try run here:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=4e25a85d6b23818364430948aa55ec8d5fc327a2

More generally,
There are a whole bunch of different classes which show up in the various crash logs. Without a replicable case (where the diagnostics could identify the cycle) I think it's just a matter of continuing to plod through all the member variables of the involved classes and trying to work out their lifetimes...
(It's about this point where someone suggests rewriting everything in rust ;- )

Attachment #9169007 - Flags: review?(mkmelin+mozilla)
Attachment #9169007 - Flags: review?(mkmelin+mozilla) → review+

Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/e56a57fb95a6
Remove redundant code in nsImapMailCopyState dtor. r=mkmelin

To break down within the signature I prefer to look at the summary, order by version and then math - roughly 4% on 78.x vs 93% 68.x (and the rest other products)
Based on these two sets of numbers I'd say there is no significant change in the crash rate of 78.x.

Still the case.

(In reply to Robert Hartmann from comment #24)

May be this crash can be forced by saving a draft into a nearly full (or overquota) IMAP folder.
Small mails were okay, but a mail with "corona homeschooling homework" jpg attachements crashed.

Robert, are you saying you have a reproducible scenario?

Flags: needinfo?(Robert_Hartmann)

(In reply to Wayne Mery (:wsmwk) from comment #38)

(In reply to Robert Hartmann from comment #24)

May be this crash can be forced by saving a draft into a nearly full (or overquota) IMAP folder.
Small mails were okay, but a mail with "corona homeschooling homework" jpg attachements crashed.

Robert, are you saying you have a reproducible scenario?

Yes the scenario reproducible created crashes for me until I cleaned up may GMX-IMAP-Space... so from practical point of view I cannnot reproduce it ...

But let assume we can reproduce it without having to waste GB space on some mail servers...

We need a IMAP-Server which does not tell Thunderbird any quota information (like GMX-Server doesn't do that).
We need a test account on that server with a minimum of IMAP user space.
We should write a mail with attachment and save to draft or let copying the sent mail to sent-folder. That should be tested in two cases
a) using encryption while sending or saving the draft
b) using no encryption

I think than either the stack trace of this bugreport or the stack trace of the stack trace of other bug 1517464 comment 36 ( https://bugzilla.mozilla.org/show_bug.cgi?id=1517464#c36 ) will be reproducible with described secenario .

Flags: needinfo?(Robert_Hartmann)

Ben are you able to utilize Robert's comment 39?

Flags: needinfo?(benc)

(In reply to Wayne Mery (:wsmwk) from comment #41)

Ben are you able to utilize Robert's comment 39?

I haven't yet, but I plan to - it looks like an excellent lead...

Flags: needinfo?(benc)

This bug seems to have shifted a bit - I don't see any nsImapMockChannel involvement any more in the stack traces.

I haven't been able to replicate the exact crash yet, but my current working theory is that it happens when a shutdown occurs while an nsImapMailCopyState object is still outstanding - either because the app was shutdown mid-copy (so larger messages would be more prone to this), or because the copy is failing and the nsImapMailCopyState isn't being cleaned up (an IMAP server rejecting the copy for going over-quota, say...).
nsImapMailCopyState is used only for some kinds of copy - copying an email from a local folder into an IMAP folder, for instance.

There's a degree of speculation involved in this patch :-)

From the crash logs, it looks like it all falls apart because of an nsImapUrl which is holding onto a pointer to an nsImapMailCopyState object and being released the IMAP thread. (the nsImapUrl in turn is being held by the nsImapProtocol object which inhabits the IMAP thread).

I think it happens because some references (possibly cyclic) are being held on past the end of the copy operation. And this patch helps by disconnecting the nsImapMailCopyState from the nsImapUrl after a copy.

It's all a bit twisty - the copy-related code is fantastically convoluted. This patch it clears out the copyState attribute on the nsImapUrl, which is sometimes an nsImapMailCopyState, but sometimes not. The fix in a slightly ugly place, but it was the only place I was confident of being inside an operation where copyState is an nsImapMailCopyState and not something else!

try run here:
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=1f8cfbfb63c3ceafd8d75b17e3c2a13f8aff5603

Attachment #9188703 - Flags: review?(mkmelin+mozilla)
Comment on attachment 9188703 [details] [diff] [review]
1586494-release-nsImapMailCopyState-after-copy-1.patch

Review of attachment 9188703 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.  I think we should add the explanation you wrote into the patch description so it's always more easily accessible.
Attachment #9188703 - Flags: review?(mkmelin+mozilla) → review+

(In reply to Magnus Melin [:mkmelin] from comment #45)

Looks good. I think we should add the explanation you wrote into the patch
description so it's always more easily accessible.

Like this you mean?

Attachment #9188703 - Attachment is obsolete: true
Attachment #9189081 - Flags: review+

Yes. Once this lands lets close it for uplifting purposed in particular. We can clone it to another bug if the patch didn't fix it.

Pushed by geoff@darktrojan.net:
https://hg.mozilla.org/comm-central/rev/b4c0fee502df
Release nsImapMailCopyState directly after IMAP copy to fix shutdown crash. r=mkmelin

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: Thunderbird 76.0 → 85 Branch

Comment on attachment 9189081 [details] [diff] [review]
1586494-release-nsImapMailCopyState-after-copy-2.patch

[Approval Request Comment]
Regression caused by (bug #):
unsure

User impact if declined:
We need more crash logs to gauge the improvement (if any) provided by this fix.

Testing completed (on c-c, etc.):
No in-depth testing - we don't have a concrete set of steps to reproduce.

Risk to taking this patch (and alternatives if risky):
I think this is a low risk (it's a one-liner, clearing a pointer which is no longer required).

Attachment #9189081 - Flags: approval-comm-beta?

Comment on attachment 9189081 [details] [diff] [review]
1586494-release-nsImapMailCopyState-after-copy-2.patch

[Triage Comment]
Approved for beta

Attachment #9189081 - Flags: approval-comm-beta? → approval-comm-beta+

Crash rate today is about 20% lower than 6 months ago https://crash-stats.mozilla.org/signature/?product=Thunderbird&signature=NS_CycleCollectorSuspect3&date=%3E%3D2020-05-23T11%3A30%3A00.000Z&date=%3C2020-11-23T11%3A30%3A00.000Z#graphs

One might conclude the crash rate for 78.x must be lower than 68.x, but it's hard to construct a set of crash queries to tease that out because the time period covers summer and the erratic transition from 68 to 78.

(In reply to Wayne Mery (:wsmwk) from comment #51)

Created attachment 9189473 [details]
Screen Shot 2020-11-23 at 6.35.42 AM.png

Crash rate today is about 20% lower than 6 months ago https://crash-stats.mozilla.org/signature/?product=Thunderbird&signature=NS_CycleCollectorSuspect3&date=%3E%3D2020-05-23T11%3A30%3A00.000Z&date=%3C2020-11-23T11%3A30%3A00.000Z#graphs

One might conclude the crash rate for 78.x must be lower than 68.x, but it's hard to construct a set of crash queries to tease that out because the time period covers summer and the erratic transition from 68 to 78.

Well, I didn't realize this bug produces a few thousands crashes a day(!). This is way too high, isn't it?
(Actually, evern since I noticed TB tended to crash when I shut it down, I have been using TB without shutting down all the time.)

Maybe after December 11, I may be able to test overflowing Imap server or something.
It seems to me that the failure to check for low-level I/O errors may be also to blame (or at least complicate the analysis here).

Comment on attachment 9189081 [details] [diff] [review]
1586494-release-nsImapMailCopyState-after-copy-2.patch

[Approval Request Comment]
Crash fix.

Attachment #9189081 - Flags: approval-comm-esr78?

To set expectations, beta https://crash-stats.mozilla.org/signature/?release_channel=beta&product=Thunderbird&signature=NS_CycleCollectorSuspect3&date=%3E%3D2020-09-10T15%3A59%3A00.000Z&date=%3C2020-12-10T15%3A59%3A00.000Z#graphs does not show a significant decrease in crash rate, if any as a result of this being on 84.0b2 buildid 20201123211438. So it will be interesting to see if uplifting has a different impact on release channel

Comment on attachment 9189081 [details] [diff] [review]
1586494-release-nsImapMailCopyState-after-copy-2.patch

[Triage Comment]
Approved for esr78

Attachment #9189081 - Flags: approval-comm-esr78? → approval-comm-esr78+

Uplifting note: The change that landed in comment 36 (Aug 2020) is not present on c-esr78 so it is going along with the approved patch in comment 56.

Good news....

  • crash reports down ~60% for NS_CycleCollectorSuspect3 (this bug) - but still not gone
  • crash reports down ~40% for nsXPCWrappedJS::Release (bug 1517464) - but still not gone
    That's with 22% drop in ADI compared to two weeks ago, and full uptake of 78.6.0 (85% of users) per https://stats.thunderbird.net/

I get roughly the same reduction if I limit stats to only 78.5.0, 78.5.1, 78.6.0.

Two examples from 78.6.0 bp-3fdb7008-cc7f-4bc9-9b3f-ca1f50210101 bp-5520680c-c1e4-4e53-86e1-ee74d0210101

Attachment #9189473 - Attachment description: Screen Shot 2020-11-23 at 6.35.42 AM.png → NS_CycleCollectorSuspect3 graph 6 months

Thunderbird 78.9.0 Crash Report [@ NS_CycleCollectorSuspect3 ] / 32bit TB on 64bit Win 10. / master password active / crash while auto saving to drafts. I thing tha draft wanted to become stored crypted via TB internal PGP - imho that failed, because attachement 25 MB or more were to big,,,
Crashing Thread (72), Name: IMAP
bp-09bed46d-a84e-4e60-b072-3a51b0210406

Frame 	Module 	Signature 	Source 	Trust
0 	xul.dll 	NS_CycleCollectorSuspect3(void*, nsCycleCollectionParticipant*, nsCycleCollectingAutoRefCnt*, bool*) 	xpcom/base/nsCycleCollector.cpp:3759 	context
1 	xul.dll 	XULContentSinkImpl::Release() 	dom/xul/nsXULContentSink.cpp:165 	cfi
2 	xul.dll 	nsMsgComposeAndSend::~nsMsgComposeAndSend() 	comm/mailnews/compose/src/nsMsgSend.cpp:0 	cfi
3 	xul.dll 	nsMsgComposeAndSend::~nsMsgComposeAndSend() 	comm/mailnews/compose/src/nsMsgSend.cpp:283 	cfi
4 	xul.dll 	nsMsgComposeAndSend::Release() 	comm/mailnews/compose/src/nsMsgSend.cpp:244 	cfi
5 	xul.dll 	CopyListener::~CopyListener() 	comm/mailnews/compose/src/nsMsgCopy.cpp:41 	cfi
6 	xul.dll 	ArrayBufferInputStream::Release() 	netwerk/base/BackgroundFileSaver.cpp:1077 	cfi
7 	xul.dll 	nsImapMailCopyState::~nsImapMailCopyState() 	comm/mailnews/imap/src/nsImapMailFolder.cpp:7611 	cfi
8 	xul.dll 	nsImapMailCopyState::~nsImapMailCopyState() 	comm/mailnews/imap/src/nsImapMailFolder.cpp:7608 	cfi
9 	xul.dll 	mozilla::dom::`anonymous namespace'::FileCreationHandler::Release() 	dom/base/BodyConsumer.cpp:436 	cfi

10 xul.dll nsImapUrl::~nsImapUrl() comm/mailnews/imap/src/nsImapUrl.cpp:76 cfi
11 xul.dll [thunk]: virtual void* __thiscall nsImapUrl::`vector deleting destructor'(unsigned int) cfi
12 xul.dll nsMsgMailNewsUrl::Release() comm/mailnews/base/src/nsMsgMailNewsUrl.cpp:80 frame_pointer

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: