1606347 - Crash in [@ mozilla::DOMEventTargetHelper::Release]

Reporter

Description

•

5 years ago

This bug is for crash report bp-d811269a-12d0-4682-b1d1-040d80191220.

Seen while looking at nightly crash stats - Linux crashes started in 20191202220401 and were joined by some macOS crashes shortly thereafter: https://bit.ly/2ZMB2sF. 10 crashes/10 installs so far so rather low volume. There are some Linux and Windows crashes before 73 nightly with the same signature, but I am not sure if they are the same issue.

Possible nightly regression range based on build id: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=8504d70d827261346737af1cbe9b96acf6756b6d&tochange=9420b5dc27e0cdf30a65b1b733400a970b3cb708

Top 9 frames of crashing thread:

0 XUL mozilla::DOMEventTargetHelper::Release dom/events/DOMEventTargetHelper.cpp:87
1 XUL mozilla::UniquePtr<mozilla::extensions:: toolkit/components/extensions/webrequest/ChannelWrapper.cpp:1144
2 libsystem_c.dylib __cxa_finalize_ranges 
3 libsystem_c.dylib exit 
4 XUL google_breakpad::ExceptionHandler::WaitForMessage toolkit/crashreporter/breakpad-client/mac/handler/exception_handler.cc:631
5 libsystem_pthread.dylib _pthread_body 
6 libsystem_pthread.dylib _pthread_start 
7 libsystem_pthread.dylib thread_start 
8 XUL XUL@0x41e737f

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 1

•

5 years ago

ChannelWrapper just extends DOMEventTargetHelper

Component: DOM: Events → Request Handling

Product: Core → WebExtensions

Marcia Knous [:marcia]

Reporter

Comment 2

•

5 years ago

Last crash was in 1-02 build and one crash in 73.0b1. Jim any ideas what might have caused this small volume regression?

Flags: needinfo?(jmathies)

Shane Caraveo (:mixedpuppy)

Comment 3

•

5 years ago

From the crash data reports above, there seems to be a lot of different reasons for a crash on mozilla::DOMEventTargetHelper::Release. Looking at the potential regression range, I don't see anything that sticks out to me.

a) is there a way to limit the crash data report to those including ChannelWrapper?
b) probably will need someone more fluent in c++ than me to look into this, but if I can get a tighter report, I might be able to see something.

Flags: needinfo?(mozillamarcia.knous)

Marcia Knous [:marcia]

Reporter

Comment 4

•

5 years ago

There is a small subset of crashes that have ChannelWrapper (https://bit.ly/303QosK), but given the small overall volume on release I don't think we should probably devote that much time to figure out what might be causing them. In terms of the nightly crashes, since I filed this there haven't been any crashes since 1-02. I thought maybe this was a nightly only regression, but it seems as Shane notes in Comment 3 there may be lots of causes for this crash.

Flags: needinfo?(mozillamarcia.knous)

Flags: needinfo?(jmathies)

Shane Caraveo (:mixedpuppy)

Comment 5

•

5 years ago

Thanks for that...Another link:

https://crash-stats.mozilla.com/signature/?product=Firefox&proto_signature=~mozilla%3A%3Aextensions&signature=mozilla%3A%3ADOMEventTargetHelper%3A%3ARelease&date=%3E%3D2019-06-30T18%3A23%3A00.000Z&date=%3C2020-01-07T18%3A23%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_columns=startup_crash&_sort=-date&page=1#reports

One thing I notice on a random sampling of these, with Linux, the crash seems to happen during a run_exit_handlers call. As well, earlier in the stack for all of those is some webrender stuff.

Ryan VanderMeulen [:RyanVM]

Updated

•

5 years ago

status-firefox72: --- → wontfix

status-firefox74: --- → affected

status-firefox-esr68: --- → wontfix

BugBot [:suhaib / :marco/ :calixte]

Comment 6

•

5 years ago

The priority flag is not set for this bug.
:jimm, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jmathies)

Jim Mathies [:jimm]

Updated

•

5 years ago

status-firefox73: affected → fix-optional

status-firefox74: affected → fix-optional

Flags: needinfo?(jmathies)

Priority: -- → P2

Gabriele Svelto [:gsvelto]

Comment 7

•

4 years ago

Just bumped on this because our new WER-based crash reporting logic picked up this crash on Windows:

https://crash-stats.mozilla.org/report/index/d40073e9-5cec-4b11-9550-ee8500210408

Looking at the stacks on Linux and macOS this always seems to be happening when running exit handlers. This is a shortened stack on Windows:

0 mozilla::DOMEventTargetHelper::Release()
1 mozilla::UniquePtr<mozilla::extensions::(anonymous namespace)::ChannelListHolder, mozilla::DefaultDelete<mozilla::extensions::(anonymous namespace)::ChannelListHolder> >::reset(mozilla::extensions::`anonymous namespace'::ChannelListHolder*)
...
4 execute_onexit_table
...
11 exit_or_terminate_process

And this is Linux:

0 mozilla::DOMEventTargetHelper::Release()
1 mozilla::UniquePtr<mozilla::extensions::(anonymous namespace)::ChannelListHolder, mozilla::DefaultDelete<mozilla::extensions::(anonymous namespace)::ChannelListHolder> >::reset(mozilla::extensions::(anonymous namespace)::ChannelListHolder*)
2 __run_exit_handlers
3 exit

And macOS:

0 mozilla::DOMEventTargetHelper::Release()
1 mozilla::UniquePtr<mozilla::extensions::(anonymous namespace)::ChannelListHolder, mozilla::DefaultDelete<mozilla::extensions::(anonymous namespace)::ChannelListHolder> >::~UniquePtr()
2 __cxa_finalize_ranges
3 exit

Gabriele Svelto [:gsvelto]

Comment 8

•

4 years ago

The work I've done in bug 1682507 seems to be capturing quite a few crashes on Windows under this signature that we were missing before. I've opened a few minidumps in Visual Studio to analyze the issue and it's probably more complicated than it appears. First of all this is the stack trace as reported by VS:

1.  xul.dll!mozilla::dom::Animation::Release() Line 42	C++
2.  [Inline Frame] xul.dll!mozilla::extensions::ChannelWrapper::Release() Line 1201	C++
3.  [Inline Frame] xul.dll!mozilla::RefPtrTraits<mozilla::extensions::ChannelWrapper>::Release(mozilla::extensions::ChannelWrapper * aPtr) Line 50	C++
4.  [Inline Frame] xul.dll!RefPtr<mozilla::extensions::ChannelWrapper>::ConstRemovingRefPtrTraits<mozilla::extensions::ChannelWrapper>::Release(mozilla::extensions::ChannelWrapper * aPtr) Line 381	C++
5.  [Inline Frame] xul.dll!RefPtr<mozilla::extensions::ChannelWrapper>::assign_assuming_AddRef(mozilla::extensions::ChannelWrapper * aNewPtr) Line 69	C++
6.  [Inline Frame] xul.dll!RefPtr<mozilla::extensions::ChannelWrapper>::operator=(void *) Line 168	C++
7.  [Inline Frame] xul.dll!mozilla::extensions::ChannelWrapper::Die() Line 137	C++
8.  [Inline Frame] xul.dll!mozilla::extensions::`anonymous namespace'::ChannelListHolder::~ChannelListHolder() Line 95	C++
9.  [Inline Frame] xul.dll!mozilla::DefaultDelete<mozilla::extensions::(anonymous namespace)::ChannelListHolder>::operator()(mozilla::extensions::`anonymous namespace'::ChannelListHolder *) Line 463	C++
10. xul.dll!mozilla::UniquePtr<mozilla::extensions::(anonymous namespace)::ChannelListHolder,mozilla::DefaultDelete<mozilla::extensions::(anonymous namespace)::ChannelListHolder>>::reset(mozilla::extensions::`anonymous namespace'::ChannelListHolder * aPtr) Line 307	C++
11. [External Code]

The crash reason is a NULL pointer access, but it's not in our code. What's happening is that somehow in the Release() call we're trying to access thread local storage and the _tls_end symbol is NULL. To understand what's going on a bit of background on how Windows implements TLS is needed: what's going on is that in every DLL that uses TLS variables those are stored in an area bound by two variables called _tls_start and _tls_end. When the DLL is loaded those symbols become available and are used at runtime by the TLS access code. This crash is happening very late during shutdown, somewhere in an atexit() callback from what I can tell. At that point it's likely that whatever DLL contains the TLS variables we're trying to access (probably XUL) is being unloaded - or has already been unloaded, so the _tls_start and _tls_end symbols aren't available anymore and the code trying to access them fails with the reported NULL pointer access.

This might prove tricky to fix: we probably have to avoid accessing TLS variables inside atexit() handlers. On the upside this crash only happens very, very late during shutdown so it shouldn't be too annoying for users.

Gabriele Svelto [:gsvelto]

Updated

•

4 years ago

OS: macOS → All

Pascal Chevrel:pascalc

Comment 9

•

4 years ago

This bug is spiking on nightly 89 the last days

status-firefox73: fix-optional → wontfix

status-firefox74: fix-optional → wontfix

status-firefox87: --- → affected

status-firefox89: --- → affected

status-firefox-esr78: --- → affected

BugBot [:suhaib / :marco/ :calixte]

Comment 10

•

4 years ago

Change the status for beta to have the same as nightly and release.
For more information, please visit auto_nag documentation.

status-firefox88: --- → affected

Gabriele Svelto [:gsvelto]

Comment 11

•

4 years ago

(In reply to Pascal Chevrel:pascalc from comment #9)

This bug is spiking on nightly 89 the last days

Note that the spike is "artificial" in that these crashes were already happening but we didn't catch them; they'd go straight to Microsoft health dashboards. What's happening is that we're now seeing their actual volume.

Pascal Chevrel:pascalc

Comment 12

•

4 years ago

Thanks Gabriele, this is what I suspected. We should probably revisit the priority of this bug given that our users are crashing a lot more than we thought with this signature.

Ryan VanderMeulen [:RyanVM]

Updated

•

4 years ago

status-firefox87: affected → wontfix

status-firefox88: affected → wontfix

Andrew McCreight [:mccr8]

Comment 13

•

4 years ago

(In reply to Gabriele Svelto [:gsvelto] from comment #8)

The crash reason is a NULL pointer access, but it's not in our code. What's happening is that somehow in the Release() call we're trying to access thread local storage and the _tls_end symbol is NULL.

The addref and release methods of cycle collected classes call into TLS (to get the cycle collector object for the current thread). This could be the same underlying issue as the spike in crashes seen with the signature NS_CycleCollectorSuspect3, as that is the method that AddRef and Release call into.

Looking at the types, it seems like we must be destroying sChannelList. Bug 1495748 added some cleanup code that runs at XPCOMShutdown, but maybe that is failing to run for some reason, or maybe new things are being added to the list after shutdown has occurred. Maybe we could just not add things to sChannelList after XPCOMShutdown has occurred somehow? In fact, this whole mechanism seems like something added to make the leak checker happy, so maybe we could simply not use this list at all except in builds where we care about leaks.

Comment 14

•

4 years ago

Attached file Bug 1606347: Don't create a new ChannelWrapper list late in shutdown. r=mccr8 — Details

If we do, we create a new instance after its ClearOnShutdown handler has
run, and it gets destroyed from a static destructor after the cycle collector
has shutdown.

Phabricator Automation

Updated

•

4 years ago

Assignee: nobody → kmaglione+bmo

Status: NEW → ASSIGNED

Kris Maglione [:kmag]

Comment 15

•

4 years ago

(In reply to Andrew McCreight [:mccr8] from comment #13)

In fact, this whole mechanism seems like something added to make the leak checker happy, so maybe we could simply not use this list at all except in builds where we care about leaks.

I think we do need to run this even in cases where we don't care about leaks, since if we don't, the ChannelWrappers will still get destroyed possibly too late in shutdown to be safe in those cases. It won't matter when we do early exits, but when we don't, it could potentially lead to weird crashes.

Pascal Chevrel:pascalc

Comment 16

•

4 years ago

I am tracking this bug for 89 as this has a high volume of crashes on nightly.

tracking-firefox89: --- → +

Pulsebot

Comment 17

•

4 years ago

Pushed by maglione.k@gmail.com: https://hg.mozilla.org/integration/autoland/rev/16f37cbabda8 Don't create a new ChannelWrapper list late in shutdown. r=mccr8

Bogdan Tara[:bogdan_tara | bogdant]

Comment 18

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/16f37cbabda8

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

status-firefox90: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 90 Branch

Ryan VanderMeulen [:RyanVM]

Comment 19

•

4 years ago

This grafts cleanly as far back as ESR78 also. Under the assumption that the frequency is likely being under-reported there due to the aforementioned WER spike, I'm thinking we probably want to uplift this fix there too.

Kris Maglione [:kmag]

Comment 20

•

4 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #19)

This grafts cleanly as far back as ESR78 also. Under the assumption that the frequency is likely being under-reported there due to the aforementioned WER spike, I'm thinking we probably want to uplift this fix there too.

The PastShutdownPhase function doesn't exist before 83.

Ryan VanderMeulen [:RyanVM]

Comment 21

•

4 years ago

Well that's certainly a problem :). Guess we'll live with it until ESR91 then!

status-firefox-esr78: affected → wontfix

Kris Maglione [:kmag]

Comment 22

•

4 years ago

Comment on attachment 9216484 [details]
Bug 1606347: Don't create a new ChannelWrapper list late in shutdown. r=mccr8

Beta/Release Uplift Approval Request

User impact if declined: Shutdown crashes
Is this code covered by automated tests?: No
Has the fix been verified in Nightly?: No
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): Adds a simple check so we don't reinstantiate things during shutdown.
String changes made/needed:

Attachment #9216484 - Flags: approval-mozilla-beta?

Andrew McCreight [:mccr8]

Comment 23

•

4 years ago

Doug, are you working on the fast shutdown stuff? Is it expected that we're running static destructors like this on Nightly?

Flags: needinfo?(dothayer)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 24

•

4 years ago

Crash volume is unchanged in builds with the fix.

Status: RESOLVED → REOPENED

status-firefox90: fixed → affected

Flags: needinfo?(kmaglione+bmo)

Resolution: FIXED → ---

Target Milestone: 90 Branch → ---

Andrew McCreight [:mccr8]

Comment 25

•

4 years ago

This issue is old enough that it isn't really a regression at this point. The surge on crash-stats is just due to increased reporting.

Maybe we need to move ahead with a solution that makes the cycle collector suspect not crash when called after CC shutdown. It would be good if it still continued to crash when called from a thread where the CC was never started (before shutdown).

Keywords: regression

Pascal Chevrel:pascalc

Comment 26

•

4 years ago

Comment on attachment 9216484 [details]
Bug 1606347: Don't create a new ChannelWrapper list late in shutdown. r=mccr8

No change on nightly crashes so no need to uplift that patch at the moment, thanks

Attachment #9216484 - Flags: approval-mozilla-beta? → approval-mozilla-beta-

Pascal Chevrel:pascalc

Updated

•

4 years ago

status-firefox89: affected → wontfix

tracking-firefox89: + → ---

Kris Maglione [:kmag]

Updated

•

4 years ago

Flags: needinfo?(kmaglione+bmo)

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

Andrew McCreight [:mccr8]

Updated

•

2 years ago

Flags: needinfo?(dothayer)

Ryan VanderMeulen [:RyanVM]

Updated

•

2 years ago

status-firefox90: affected → wontfix

BugBot [:suhaib / :marco/ :calixte]

Comment 27

•

1 year ago

The bug assignee is inactive on Bugzilla, so the assignee is being reset.

Assignee: kmaglione+bmo → nobody

Andrew McCreight [:mccr8]

Comment 28

•

1 year ago

There aren't many recent crashes, and they look very different than the initial crash, and a patch has landed that should have helped with the initial issue, so let's close this.

Status: REOPENED → RESOLVED

Closed: 4 years ago → 1 year ago

Resolution: --- → WORKSFORME