crash in NS_CycleCollector(Suspect|Forget)2_P using non-threadsafe object from a thread other than the main thread

VERIFIED WORKSFORME

Status

--
critical
VERIFIED WORKSFORME
7 years ago
2 years ago

People

(Reporter: wsmwk, Unassigned)

Tracking

(Depends on: 1 bug, 4 keywords)

Trunk
x86
Windows 7
crash, qawanted, regression, testcase-wanted
Dependency tree / graph

Thunderbird Tracking Flags

(thunderbird11-, thunderbird12-, thunderbird13-)

Details

(Whiteboard: [no str][3+ stacks?][gs], crash signature, URL)

(Reporter)

Description

7 years ago
This bug was filed from the Socorro interface 
============================================================= 

#1 rank currently for version 8.0. versus #5 for version 7 and #7 for version 6.  

In firefox it had been highly ranked for version 5, but is currently well below 100. Bug 633445 - OOM crash in nsCycleCollectingAutoRefCnt::decr(nsISupports*) or AbortIfOffMainThreadIfCheckFast (malware-related)


bp-30683f32-3185-47a7-bf4b-649b82111109 (otosaxel)
0	mozalloc.dll	mozalloc_abort	memory/mozalloc/mozalloc_abort.cpp:77
1	xul.dll	NS_DebugBreak_P	xpcom/base/nsDebugImpl.cpp:345
2	xul.dll	AbortIfOffMainThreadIfCheckFast	xpcom/base/nsCycleCollector.cpp:1267
3	xul.dll	nsCycleCollector::Forget2	xpcom/base/nsCycleCollector.cpp:2508
4	xul.dll	NS_CycleCollectorForget2_P	xpcom/base/nsCycleCollector.cpp:3411
5	xul.dll	nsCycleCollectingAutoRefCnt::incr	objdir-tb/mozilla/dist/include/nsISupportsImpl.h:161
6	xul.dll	nsEventListenerInfo::AddRef	content/svg/content/src/nsSVGLength2.cpp:58
7	xul.dll	nsRefPtr<nsDocShellEnumerator>::assign_with_AddRef	objdir-tb/mozilla/dist/include/nsAutoPtr.h:940
8	xul.dll	nsDOMEvent::nsDOMEvent	content/events/src/nsDOMEvent.cpp:121
9	xul.dll	nsDOMPageTransitionEvent::nsDOMPageTransitionEvent	content/events/src/nsDOMPageTransitionEvent.h:50
10	xul.dll	NS_NewDOMPageTransitionEvent	content/events/src/nsDOMPageTransitionEvent.cpp:77
11	xul.dll	nsEventDispatcher::CreateEvent	content/events/src/nsEventDispatcher.cpp:878
12	xul.dll	nsDocument::CreateEvent	content/base/src/nsDocument.cpp:6273
13	xul.dll	nsDocument::DispatchPageTransition	content/base/src/nsDocument.cpp:7282
14	xul.dll	nsDocument::OnPageHide	content/base/src/nsDocument.cpp:7404
15	xul.dll	DocumentViewerImpl::PageHide	layout/base/nsDocumentViewer.cpp:1284
(Reporter)

Comment 1

7 years ago
Half of the crashes have lightning installed, if my searching is correct.

thus half do not, including 
bp-45c91420-18ea-40f7-bb70-12b772111111 (Paul)
bp-07ad904c-04f8-4d7b-a41f-e468a2111112 (rorscan)
(Reporter)

Comment 2

7 years ago
continues to be #1 crash
hopefully the following will be helpful. 

1. Some users report that they crash ONLY on first startup with a new release. However, afaict looking at small sample of crashes it is very tiny, perhaps insignificant % of crashes. But can't tell - socorro doesn't aggregate that info in the summary

2. *** Digging deeper in the stacks, we see 90+% of the crashes have this on the stack
mozHunspellDirProvider::AppendingEnumerator::~AppendingEnumerator
nsInterfaceRequestorAgg::Release
but a high percentage of people say they were doing nothing, eg no composing, when crash occurred

3. if one ignores the above two frames, there may be correlation to the stacks for Bug 692981 - crash for pop user in nsInterfaceRequestorAgg::Release

4. distribution of uptime - 30% less than 10 minutes,  10min < 30% < 6hr

5. having looked at 20+ stacks, only one is malware related bp-4961f34b-5ab8-41fa-9cef-6a1fc2111128 - so perhaps no strong correlation to FF bug 633445 - OOM crash in nsCycleCollectingAutoRefCnt::decr(nsISupports*) or AbortIfOffMainThreadIfCheckFast (malware-related)

6. Some significant percentage of users have no extensions or very few

7. socorro offers no correlations 

8. some reporters say they had just gotten new mail
Keywords: qawanted
Summary: crash mozalloc_abort → crash mozalloc_abort with mozHunspellDirProvider::AppendingEnumerator::~AppendingEnumerator and nsInterfaceRequestorAgg::Release on stack
(Reporter)

Comment 3

7 years ago
1. usul, can we eliminate malware or AV as causing most of these?  

2. bienvenu, looking back at several releases, the stacks have changed somewhat over time - earlier stacks not having mozHunspellDirProvider for example. But the majority [1] do have a frame of nsNSSSocketInfo::SetNotificationCallbacks, which gives more correlation to bug 692981. 
  v6 bp-3b607353-8e4c-434e-b7eb-d895a2111029
  v5 bp-f6ab4ddb-6491-4535-bceb-8ba9e2111030
  [1] an example exception to the majority is bp-a20d8c29-92d1-43f3-a289-479cf2111029

3. I may be jumping the gun to call this a regression, but the signature occurs in version 5 and it's ancestors. examples
 v5.0b1    bp-f8f552d3-fbaa-457d-b0e3-0006b2110603
 v3.3a4pre bp-29effd5b-6230-40f4-8614-9279a2110602 
 v3.3a3pre bp-fddbe1eb-11f9-4bac-9a96-04a282110301
 (again their are exceptions, of the imap variety like bp-400f1d6e-dc99-40c2-9c76-843ca2110524 and bp-17b77683-0585-4bd8-87ed-3493d2110322
Keywords: regression

Comment 4

7 years ago
(In reply to Wayne Mery (:wsmwk) from comment #3)
> 1. usul, can we eliminate malware or AV as causing most of these?  
> 
> 2. bienvenu, looking back at several releases, the stacks have changed
> somewhat over time - earlier stacks not having mozHunspellDirProvider for
> example. But the majority [1] do have a frame of
> nsNSSSocketInfo::SetNotificationCallbacks, which gives more correlation to
> bug 692981. 

Yes, that was my feeling as well. I believe mozHunspellDirProvider is a complete red herring.
(In reply to Wayne Mery (:wsmwk) from comment #3)
> 1. usul, can we eliminate malware or AV as causing most of these?  

I'd think so I don't see how AV would interact here. For malware it could be but something would probably be visible in the extensions.

Comment 6

7 years ago
Is this reproducible?
(Reporter)

Comment 7

7 years ago
(In reply to Ehsan Akhgari [:ehsan] from comment #6)
> Is this reproducible?

we don't have steps. 
and no users who crash are cc on the bug
(hence qawanted)
Whiteboard: [no str]

Comment 8

7 years ago
Wayne (wsmwk)I followed your link because as requested I added the add-0n View About to Thunderbird yesterday and this morning Windows7 again did not start normally. It went into its troubleshooting mode and turned the system back to last know good boot. When I checked my e-mail again in the afternoon, I found that all the work I had done previously had been undone. I am not able to send you any additional information but if I allow Thunderbird to update, the next start up will send everything back to the 'last good boot'. Send me instructions on what you need and I will see if I can send it to you.
(Reporter)

Comment 9

7 years ago
bienvenu, many of the reporters are pop, perhaps disproportionately so (but I am still investigating that angle). Perhaps I mentioned this before - getting a handle on bug 692981 / bug 673438 may help narrow the scope of this crash signature. 

pappotautos for example using version 12 crashes with nsCOMPtr_base::~nsCOMPtr_base() | nsInterfaceRequestorAgg::Release() bp-8126607a-81c1-472c-b5e8-28dfc2120108 bug 673438, and AbortIfOffMainThreadIfCheckFast bp-ec2c1cf2-1493-4a99-8f8c-95ed62120105 (nsInterfaceRequestorAgg not on stack - but cycle collector is)

however, the crash rate of mozalloc_abort(char const* const) | NS_DebugBreak_P | AbortIfOffMainThreadIfCheckFast far exceeds that of the other signatures (tenfold in version 8)

looking briefly at current crashes, I don't see any that have mozHunspellDirProvider::AppendingEnumerator stack.
tracking-thunderbird11: --- → ?
Depends on: 692981, 673438, 676826
See Also: → bug 633445
Summary: crash mozalloc_abort with mozHunspellDirProvider::AppendingEnumerator::~AppendingEnumerator and nsInterfaceRequestorAgg::Release on stack → crash mozalloc_abort with nsInterfaceRequestorAgg::Release on stack
(Reporter)

Comment 10

7 years ago
(In reply to Wayne Mery (:wsmwk) from comment #9)
> ...
> pappotautos for example using version 12 crashes with
> nsCOMPtr_base::~nsCOMPtr_base() | nsInterfaceRequestorAgg::Release()
> bp-8126607a-81c1-472c-b5e8-28dfc2120108 bug 673438, and
> AbortIfOffMainThreadIfCheckFast bp-ec2c1cf2-1493-4a99-8f8c-95ed62120105
> (nsInterfaceRequestorAgg not on stack - but cycle collector is)

specifically, nsCycleCollectingAutoRefCnt::incr as mentioned in bug 633445
(Reporter)

Comment 11

7 years ago
Noel, had you posted in getsatisfaction?   Do you have a URL to that topic?
Are your accounts all pop?

Comment 12

7 years ago
reply to comment 11. I sent you all the information I was able to get from the system. I had to clean out my computer of dust and animal hair as it was not working properly and on restarting I found I had lost everything pertaining to this problem. I am currently in the process of reinstalling Windows7 and will see what happens. The accounts are with verizon as POP and Gmail is IMAP. Sorry I can't be of more help.
Hard to say how bad this stack actually is - there seems to be several different stacks with the same top.
tracking-thunderbird11: ? → -
tracking-thunderbird12: --- → ?
Assignee: nobody → dbienvenu
(Reporter)

Comment 14

7 years ago
(In reply to Mark Banner (:standard8) from comment #13)
> Hard to say how bad this stack actually is - there seems to be several
> different stacks with the same top.

yes. and very few crash comments - only 8 comments in 2 weeks
Keywords: testcase-wanted
Whiteboard: [no str] → [no str][3+ stacks?]
We can track this for 13, but I think this needs more analysis on the individual stacks that are going on.
tracking-thunderbird12: ? → -
tracking-thunderbird13: --- → +
tracking-thunderbird13: + → -
(Reporter)

Updated

6 years ago
Whiteboard: [no str][3+ stacks?] → [no str][3+ stacks?][gs]
(Reporter)

Comment 16

6 years ago
topcrash also for 10.0.6esr. some examples:
bp-f19240ce-d26c-4d17-8e9e-a07582120811 (has email address)
bp-5f175215-5a9e-46c3-8497-df22e2120815 (has ldap on stack)
bp-25e26300-27be-4d49-a464-bc4162120810
(Reporter)

Comment 17

6 years ago
irving, could you have a look at this and the related crash bugs?  Perhaps you can see a path to resolution.

I've emailed you Larry's conversation with me.

Comment 18

6 years ago
We are hitting an abort added in bug 549743 by Jonas Sicking with this description:

"To avoid race conditions that cause leaks and crashes sometimes, we should make sure that people that use cycle collected objects from threads other than the main thread know about it.

The easiest way to do that is to abort as soon as someone does it.

This needs to happen early in a release cycle so that developers have time to notice and fix their code."

Perhaps Jonas can give us some hints of what to do about this, now that we have had "time to notice".
The short of what's happening here is that someone is trying to use an object which isn't threadsafe from a thread other than the main thread.

The fact that we intentionally abort is likely not actually causing any more crashes since if we didn't abort here we'd likely be messing up the hash tables used by the cycle collector which means that we'd be crashing in a random addref or release elsewhere.

So that said, the stack in [0] is looking somewhat weird.

Somehow we are, off the main thread, calling nsDocLoader::Release which is calling the nsDocShell destructor. All off the main thread. This eventually ends up firing a page transition event, probably "pagehide" which is crashing somewhere trying to addref a cycle collected object.

Unfortunately I can't tell why we end up calling nsDocLoader::Release. It's looking like something has been registered as a notification callback on a nsNSSSocketInfo. When that something is getting released it's causing a call to nsDocLoader::Release. This is happening in frames 21-26 in [0].

Unfortunately I suspect there's a good amount of function collapsing happening here, so it's probably not the functions in frame 21 and 22 which are actually called, but rather something else.

Really what we should be doing here is to make the nsDocLoader (which is the base class for nsDocShell) crash even earlier if someone tries to use it off the main thread. That might provide a more helpful callstack.

Unfortunately I can't provide a whole lot more useful info other than "someone is using the docshell of the main thread and that's just not safe". Unfortunately we are using threadsafe macros to implement addref/releae for the docshell which is hiding this bug for a bit.

[0] https://crash-stats.mozilla.com/report/index/30683f32-3185-47a7-bf4b-649b82111109
(Reporter)

Comment 20

6 years ago
(In reply to Jonas Sicking (:sicking) from comment #19)
> The short of what's happening here is that someone is trying to use an
> object which isn't threadsafe from a thread other than the main thread.
> 
> The fact that we intentionally abort is likely not actually causing any more
> crashes since if we didn't abort here we'd likely be messing up the hash
> tables used by the cycle collector which means that we'd be crashing in a
> random addref or release elsewhere.
> 
...

Thanks Sicking

Note: a small spot check indicates most crashes no longer correlate strongly to nsInterfaceRequestorAgg. Instead, stacks tend to have nsDocShell::FirePageHideNotification, imap, or involve message compose.
examples:
bp-f0fe9057-9e4e-4a10-8510-b8abf2121113 nsImapMailCopyState::~nsImapMailCopyState
bp-91dc1faf-918f-4d64-8430-4bd042121114 nsEditor::Release | nsMsgComposeAndSend::~nsMsgComposeAndSend
bp-f19240ce-d26c-4d17-8e9e-a07582120811
bp-c6c8a57b-10c0-46be-85b1-d42952121117
Summary: crash mozalloc_abort with nsInterfaceRequestorAgg::Release on stack → crash using non-threadsafe object from a thread other than the main thread. mozalloc_abort ... AbortIfOffMainThreadIfCheckFast with nsInterfaceRequestorAgg::Release on stack
(Reporter)

Comment 21

6 years ago
mozilla::widget::AsyncFaviconDataReady::~AsyncFaviconDataReady() looks lika a new manifestation https://crash-stats.mozilla.com/report/list?product=Thunderbird&query_search=signature&query_type=exact&query=mozilla%3A%3Awidget%3A%3AAsyncFaviconDataReady%3A%3A~AsyncFaviconDataReady%28%29&reason_type=contains&date=11%2F30%2F2012%2015%3A38%3A00&range_value=4&range_unit=weeks&hang_type=any&process_type=all&do_query=1&admin=1&signature=mozilla%3A%3Awidget%3A%3AAsyncFaviconDataReady%3A%3A~AsyncFaviconDataReady%28%29

#50 crash for TB17. signature starts in TB17.
Crash Signature: [@ mozalloc_abort(char const* const) | NS_DebugBreak_P | AbortIfOffMainThreadIfCheckFast] → [@ mozalloc_abort(char const* const) | NS_DebugBreak_P | AbortIfOffMainThreadIfCheckFast] [@ mozilla::widget::AsyncFaviconDataReady::~AsyncFaviconDataReady()]
Component: General → Networking: POP
Product: Thunderbird → MailNews Core
(Reporter)

Comment 22

6 years ago
#4 crash for TB17 (probably #1 , #1 crash for TB19 beta, #1 for TB21

TB21 bp-ad67e393-02ee-4127-a1d1-cffff2130120 (sergio)
Crash Signature: [@ mozalloc_abort(char const* const) | NS_DebugBreak_P | AbortIfOffMainThreadIfCheckFast] [@ mozilla::widget::AsyncFaviconDataReady::~AsyncFaviconDataReady()] → [@ mozalloc_abort(char const* const) | NS_DebugBreak_P | AbortIfOffMainThreadIfCheckFast] [@ mozilla::widget::AsyncFaviconDataReady::~AsyncFaviconDataReady()] [@ moz_abort | arena_run_reg_dalloc | arena_dalloc_small | arena_dalloc | je_free | nsInterfac…
Keywords: topcrash → topcrash+
(Reporter)

Comment 23

6 years ago
(In reply to Jonas Sicking (:sicking) from comment #19)
>Really what we should be doing here is to make the nsDocLoader (which is the base
>class for nsDocShell) crash even earlier if someone tries to use it off the main 
> thread. That might provide a more helpful callstack.
> Unfortunately I can't provide a whole lot more useful info other than
> "someone is using the docshell of the main thread and that's just not safe".
> Unfortunately we are using threadsafe macros to implement addref/releae for
> the docshell which is hiding this bug for a bit.

rkent or irving, could you make a try build with such a change?
Then we can unleash this to a user who reproduces the crash.
Flags: needinfo?(irving)
We're assigning the value that ends up resulting in the crash here: http://mxr.mozilla.org/comm-central/source/mailnews/imap/src/nsImapProtocol.cpp#632

I'm concerned about the comment at https://bugzilla.mozilla.org/show_bug.cgi?id=243591#c16, which asserts that the callbacks aren't used off the main thread; at least in Thunderbird, they are added and removed from the socket (and thus reference counted and in the case of our crash freed) on the socket thread.
Flags: needinfo?(irving) → needinfo?(jonas)

Updated

6 years ago
Crash Signature: [@ mozalloc_abort(char const* const) | NS_DebugBreak_P | AbortIfOffMainThreadIfCheckFast] [@ mozilla::widget::AsyncFaviconDataReady::~AsyncFaviconDataReady()] [@ moz_abort | arena_run_reg_dalloc | arena_dalloc_small | arena_dalloc | je_free | nsInterfac… → [@ mozalloc_abort(char const* const) | NS_DebugBreak_P | AbortIfOffMainThreadIfCheckFast] [@ mozilla::widget::AsyncFaviconDataReady::~AsyncFaviconDataReady()] [@ moz_abort | arena_run_reg_dalloc | arena_dalloc_small | arena_dalloc | je_free | nsInterfac…
(Reporter)

Comment 26

6 years ago
(In reply to :Irving Reid from comment #24)
> We're assigning the value that ends up resulting in the crash here:
> http://mxr.mozilla.org/comm-central/source/mailnews/imap/src/nsImapProtocol.
> cpp#632
> 
> I'm concerned about the comment at
> https://bugzilla.mozilla.org/show_bug.cgi?id=243591#c16, which asserts that
> the callbacks aren't used off the main thread; at least in Thunderbird, they
> are added and removed from the socket (and thus reference counted and in the
> case of our crash freed) on the socket thread.

(In reply to Jonas Sicking (:sicking) from comment #25)
> See comment 19

Can you make a try build with changes related to nsDocLoader suggested in comment 19?
Flags: needinfo?(irving)
(In reply to Wayne Mery (:wsmwk) from comment #26)

> (In reply to Jonas Sicking (:sicking) from comment #25)
> > See comment 19
> 
> Can you make a try build with changes related to nsDocLoader suggested in
> comment 19?

I think nsDocLoader is not related to this crash because there is no nsDocShell symbols in recent crash logs. (An example in comment22)
(Reporter)

Comment 28

5 years ago
Appreciate the feedback everyone. 

Let's remember, this is literally our #1 crash. With a crash rate (of the combined signatures) roughly double the #2 and #3 ranked signatures.


(In reply to Hiroyuki Ikezoe (:hiro) from comment #27)
> (In reply to Wayne Mery (:wsmwk) from comment #26)
> 
> > (In reply to Jonas Sicking (:sicking) from comment #25)
> > > See comment 19
> > 
> > Can you make a try build with changes related to nsDocLoader suggested in
> > comment 19?
> 
> I think nsDocLoader is not related to this crash because there is no
> nsDocShell symbols in recent crash logs. (An example in comment22)

If so, can you suggest another approach where we can make it crash sooner/differently?

If not, the crash has been going on over a year without solid progress, shouldn't we try something, even it's a long shot?
Flags: needinfo?(jonas)
Flags: needinfo?(hiikezoe)
The whole purpose of the AbortIfOffMainThreadIfCheckFast function is to crash sooner. I.e. we are crashing as soon as someone does an addref or release from the wrong thread. When the crash happens the buggy code should be visible directly in the call stack.

I don't think there is any way to crash any sooner, because generally we should be crashing right at the right time. The only way I can think of crashing sooner is to make sure that we don't have main-thread-only objects that are using the THREADSAFE nsISupports macros. I.e. a full audit of all classes that use the THREADSAFE macros might turn up something. I think nsDocShell used to use THREADSAFE macros though it shouldn't, but that appears to be fixed now.

Last I looked at this bug, which was in comment 19, the stack seemed a bit messed up. So the problem wasn't crashing at the wrong time, but rather that we weren't getting a proper stack when the crash happened. If we have more incidents now then I can have a look. Though so can probably most other Gecko developers.
Flags: needinfo?(jonas)

Comment 30

5 years ago
(In reply to Wayne Mery (:wsmwk) from comment #22)
> #4 crash for TB17 (probably #1 , #1 crash for TB19 beta, #1 for TB21 
> TB21 bp-ad67e393-02ee-4127-a1d1-cffff2130120 (sergio)
Wayne, are you sure the moz_abort | arena_run_reg_dalloc | arena_dalloc_small | arena_dalloc | je_free | nsInterfaceRequestorAgg::Release signature with no abort message (see https://crash-stats.mozilla.com/report/list?product=Thunderbird&signature=moz_abort+|+arena_run_reg_dalloc+|+arena_dalloc_small+|+arena_dalloc+|+je_free+|+nsInterfaceRequestorAgg%3A%3ARelease%28%29) is related (bug 781049 with a similar signature and still no abort message is not)? If it isn't, this bug is fixed.
Flags: needinfo?(vseerror)

Comment 31

5 years ago
Comment 30 contains the question and the answer. I filed bug 902158 for the moz_abort | arena_run_reg_dalloc | arena_dalloc_small | arena_dalloc | je_free | nsInterfaceRequestorAgg::Release signature.

There are currently no crashes starting with mozalloc_abort in 23.0 Beta, 24.0a2 and 25.0a1 with the "Main-thread-only object used off the main thread" abort message.
Crash Signature: [@ mozalloc_abort(char const* const) | NS_DebugBreak_P | AbortIfOffMainThreadIfCheckFast] [@ mozilla::widget::AsyncFaviconDataReady::~AsyncFaviconDataReady()] [@ moz_abort | arena_run_reg_dalloc | arena_dalloc_small | arena_dalloc | je_free | nsInterfac… → [@ mozalloc_abort(char const* const) | NS_DebugBreak_P | AbortIfOffMainThreadIfCheckFast] [@ mozilla::widget::AsyncFaviconDataReady::~AsyncFaviconDataReady()] [@ mozalloc_abort(char const* const) | NS_DebugBreak_P | AbortIfOffMainThreadIfCheckFast | nsC…
status-thunderbird-esr17: --- → affected
Flags: needinfo?(vseerror)
Summary: crash using non-threadsafe object from a thread other than the main thread. mozalloc_abort ... AbortIfOffMainThreadIfCheckFast with nsInterfaceRequestorAgg::Release on stack → crash in NS_CycleCollector(Suspect|Forget)2_P using non-threadsafe object from a thread other than the main thread
(Reporter)

Comment 32

5 years ago
Scoobidiver - thanks very much for the help

(In reply to Scoobidiver from comment #30)
> (In reply to Wayne Mery (:wsmwk) from comment #22)
> > #4 crash for TB17 (probably #1 , #1 crash for TB19 beta, #1 for TB21 
> > TB21 bp-ad67e393-02ee-4127-a1d1-cffff2130120 (sergio)
> [1] Wayne, are you sure the moz_abort | arena_run_reg_dalloc |
> arena_dalloc_small | arena_dalloc | je_free |
> nsInterfaceRequestorAgg::Release signature with no abort message (see
> https://crash-stats.mozilla.com/report/
> list?product=Thunderbird&signature=moz_abort+|+arena_run_reg_dalloc+|+arena_d
> alloc_small+|+arena_dalloc+|+je_free+|+nsInterfaceRequestorAgg%3A%3ARelease%2
> 8%29) is related (bug 781049 with a similar signature and still no abort
> message is not)? 

Unsure - I'd have to do some digging. (or anyone could)


> [2] If it isn't, this bug is fixed.

ditto.  

If fixed, I'd be most interested to know fixed when, and by what?

Note, crash reports for this collection of bugs have highly correlated to pop users, and there have been no major Thunderbird pop bugs (or networking for that matter) fixed in recent months. https://bugzilla.mozilla.org/buglist.cgi?f1=OP&o3=anywordssubstr&list_id=7508976&resolution=FIXED&classification=Client%20Software&classification=Components&o2=anywordssubstr&f4=CP&chfieldto=Now&chfield=resolution&query_format=advanced&j1=OR&chfieldfrom=5m&f3=keywords&f2=short_desc&chfieldvalue=fixed&component=Backend&component=Networking&component=Networking%3A%20POP&product=MailNews%20Core&product=Thunderbird

That said there could have been patch checkins via bugs in other components, or core.

Note also, what we see in non-release versions of thunderbird doesn't always correlate nicely to what we see in release.
I don't think the bug is fixed. I think the stack trace has just changed.
Flags: needinfo?(hiikezoe)

Comment 34

5 years ago
(In reply to Wayne Mery (:wsmwk) from comment #32)
> Note also, what we see in non-release versions of thunderbird doesn't always
> correlate nicely to what we see in release.
I think SeaMonkey can give you a tendency and I don't see crashes with a signature containing AbortIfOffMainThread after SeaMonkey 2.17.1 (Gecko 20).

> If fixed, I'd be most interested to know fixed when, and by what?
It's fixed between Gecko 20 and 22 (no SM 2.18).
Looking at crash-stats, it looks like all the recent crashes are on Windows; not sure if that's a helpful clue yet.

I took a quick look at the most recent twelve stack traces for 17.0.8 and there are two distinct cases; ten of them trace back through mozilla::widget::AsyncFaviconDataReady::~AsyncFaviconDataReady() and the other two through nsImapProtocol::TryToLogon(); these are almost certainly two distinct bugs, with the AsyncFaviconDataReady the most common.

That said, in both cases we're releasing the last reference to a refcounted object off the main thread, which is triggering an attempt to manipulate UI elements from inside the destructors; those UI elements then trigger the non-threadsafe crash.

... Turns out AsyncFaviconDataReady is Windows-only code, which matches up with the crash-stats observation ...

I'm not having any luck getting a Windows debug build to invoke AsyncFaviconDataReady. I spent some time studying the code, and it looks like it's related to drag and drop actions on URL desktop objects, but I can't get it to trap in the debugger.

My impression of what's going on in the stack trace is that something has created an async I/O operation, and used an nsDocLoader as the event handler for the async I/O. I suspect that then the main thread references to the nsDocLoader are dropped *without cancelling the async I/O*, leaving the socket callback array holding the last reference to the nsDocLoader. When the socket releases its callbacks (off the main thread), everything else gets destroyed.

My guess is that the variety of stacks we're seeing represent different paths that trigger background I/O and then don't cancel it before releasing the DocShell.
status-thunderbird-esr17: affected → ---
Flags: needinfo?(irving)
(Reporter)

Comment 36

5 years ago
I haven't checked whether any of these match

linux...

bp-ef143b59-1aa8-4443-a70f-aa52b2130805 17.0.7  nsCycleCollector::Suspect2 
bp-cdc4eb7d-d7c4-4726-8f15-a94772130817 17.0.8  mozalloc_abort | NS_DebugBreak_P | AbortIfOffMainThreadIfCheckFast | nsCycleCollector::Suspect2 
bp-56637ddc-85db-4453-9592-a90b22130818 17.0.8  mozalloc_abort | NS_DebugBreak_P | AbortIfOffMainThreadIfCheckFast | nsCycleCollector::Suspect2 
bp-2bb7f37a-ca52-470c-8a42-b39312130730 17.0.7  mozalloc_abort | NS_DebugBreak_P | nsCycleCollector::Suspect2
(Reporter)

Comment 37

5 years ago
Based on similar topcrash rank, and the fact that it involves non-main thread, it seems the Thunderbird 24 equivalent is Bug 917961 - crash in nsXPCWrappedJS::CallMethod(unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*) accessing nsXPCWrappedJS off-main-thread after nsMsgFolderNotificationService::NotifyFolderDeleted

Hopefully Bug 917961 has a more useful stack :)


(In reply to Scoobidiver from comment #31)
> Comment 30 contains the question and the answer. I filed bug 902158 for the
> moz_abort | arena_run_reg_dalloc | arena_dalloc_small | arena_dalloc |
> je_free | nsInterfaceRequestorAgg::Release signature.
> 
> There are currently no crashes starting with mozalloc_abort in 23.0 Beta,
> 24.0a2 and 25.0a1 with the "Main-thread-only object used off the main
> thread" abort message.

Indeed signature does not exist in TB24 release.  However, it's unfortunately  proven again that it is near impossible to trust Thunderbird betas to predict new or changed topcrashes in release. 

0 crashes in betas - https://crash-stats.mozilla.com/query/?product=Thunderbird&version=ALL%3AALL&range_value=8&range_unit=weeks&date=09%2F21%2F2013+23%3A00%3A00&query_search=signature&query_type=is_exactly&query=nsXPCWrappedJS%3A%3ACallMethod%28unsigned+short%2C+XPTMethodDescriptor+const*%2C+nsXPTCMiniVariant*%29&reason=beta&release_channels=&build_id=&process_type=any&hang_type=any
Depends on: 917961

Updated

5 years ago
Keywords: topcrash+ → topcrash-thunderbird
(Reporter)

Updated

4 years ago
Assignee: mozilla → nobody
See Also: → bug 902158

Updated

3 years ago
Depends on: 1175055

Updated

3 years ago
Depends on: 1176748

Updated

3 years ago
Crash Signature: [@ mozalloc_abort(char const* const) | NS_DebugBreak_P | AbortIfOffMainThreadIfCheckFast] [@ mozilla::widget::AsyncFaviconDataReady::~AsyncFaviconDataReady()] [@ mozalloc_abort(char const* const) | NS_DebugBreak_P | AbortIfOffMainThreadIfCheckFast | nsC… → [@ mozalloc_abort(char const* const) | NS_DebugBreak_P | AbortIfOffMainThreadIfCheckFast] [@ mozilla::widget::AsyncFaviconDataReady::~AsyncFaviconDataReady()] [@ mozalloc_abort(char const* const) | NS_DebugBreak_P | AbortIfOffMainThreadIfCheckFast | nsC…
(Reporter)

Comment 38

2 years ago
No longer a topcrash
Keywords: topcrash-thunderbird
I'm marking this bug as WORKSFORME as bug crashlog signature didn't appear from a long time (over half year) [except some obsolete <18 versions, no crashes starting since 18 version].
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → WORKSFORME
(Reporter)

Comment 40

2 years ago
(In reply to Virtual_ManPL [:Virtual] - (please needinfo? me) from comment #39)
> crashlog signature didn't appear from a long time (over half year) [except some obsolete <18 versions, no
> crashes starting since 18 version].

Correct. https://crash-stats.mozilla.com/search/?signature=~AbortIfOffMainThreadIfCheckFast&product=Thunderbird&date=%3E%3D2016-12-27T12%3A55%3A54.000Z&date=%3C2017-03-27T12%3A55%3A54.000Z&_sort=-date&_facets=signature&_facets=version&_columns=date&_columns=signature&_columns=version&_columns=platform&_columns=user_comments#crash-reports

Import, which was a major sources of crashes, is mostly been disabled since version 17 and explains much of the reduction. 

Nothing landed in bug 673438 and unclear if anything from https://mzl.la/2n9FhKv helped.  Crash report comments don't help, and 90+% of version 17 and older crashes are one off where the user doesn't crash more than once in a 3 month period. Perhaps a patch in https://hg.mozilla.org/comm-central/log/tip/mailnews/local/src/nsPop3Protocol.cpp helped but that would take time to determine.

To summarize, this can stay WFM unless/until a testcase appears
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.