Closed Bug 1606347 Opened 4 years ago Closed 6 months ago

Crash in [@ mozilla::DOMEventTargetHelper::Release]

Categories

(WebExtensions :: Request Handling, defect, P2)

Unspecified
All
defect

Tracking

(firefox-esr68 wontfix, firefox-esr78 wontfix, firefox72 wontfix, firefox73 wontfix, firefox74 wontfix, firefox87 wontfix, firefox88 wontfix, firefox89 wontfix, firefox90 wontfix)

RESOLVED WORKSFORME
Tracking Status
firefox-esr68 --- wontfix
firefox-esr78 --- wontfix
firefox72 --- wontfix
firefox73 --- wontfix
firefox74 --- wontfix
firefox87 --- wontfix
firefox88 --- wontfix
firefox89 --- wontfix
firefox90 --- wontfix

People

(Reporter: marcia, Unassigned)

References

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

This bug is for crash report bp-d811269a-12d0-4682-b1d1-040d80191220.

Seen while looking at nightly crash stats - Linux crashes started in 20191202220401 and were joined by some macOS crashes shortly thereafter: https://bit.ly/2ZMB2sF. 10 crashes/10 installs so far so rather low volume. There are some Linux and Windows crashes before 73 nightly with the same signature, but I am not sure if they are the same issue.

Possible nightly regression range based on build id: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=8504d70d827261346737af1cbe9b96acf6756b6d&tochange=9420b5dc27e0cdf30a65b1b733400a970b3cb708

Top 9 frames of crashing thread:

0 XUL mozilla::DOMEventTargetHelper::Release dom/events/DOMEventTargetHelper.cpp:87
1 XUL mozilla::UniquePtr<mozilla::extensions:: toolkit/components/extensions/webrequest/ChannelWrapper.cpp:1144
2 libsystem_c.dylib __cxa_finalize_ranges 
3 libsystem_c.dylib exit 
4 XUL google_breakpad::ExceptionHandler::WaitForMessage toolkit/crashreporter/breakpad-client/mac/handler/exception_handler.cc:631
5 libsystem_pthread.dylib _pthread_body 
6 libsystem_pthread.dylib _pthread_start 
7 libsystem_pthread.dylib thread_start 
8 XUL XUL@0x41e737f 

ChannelWrapper just extends DOMEventTargetHelper

Component: DOM: Events → Request Handling
Product: Core → WebExtensions

Last crash was in 1-02 build and one crash in 73.0b1. Jim any ideas what might have caused this small volume regression?

Flags: needinfo?(jmathies)

From the crash data reports above, there seems to be a lot of different reasons for a crash on mozilla::DOMEventTargetHelper::Release. Looking at the potential regression range, I don't see anything that sticks out to me.

a) is there a way to limit the crash data report to those including ChannelWrapper?
b) probably will need someone more fluent in c++ than me to look into this, but if I can get a tighter report, I might be able to see something.

Flags: needinfo?(mozillamarcia.knous)

There is a small subset of crashes that have ChannelWrapper (https://bit.ly/303QosK), but given the small overall volume on release I don't think we should probably devote that much time to figure out what might be causing them. In terms of the nightly crashes, since I filed this there haven't been any crashes since 1-02. I thought maybe this was a nightly only regression, but it seems as Shane notes in Comment 3 there may be lots of causes for this crash.

Flags: needinfo?(mozillamarcia.knous)
Flags: needinfo?(jmathies)

The priority flag is not set for this bug.
:jimm, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jmathies)
Flags: needinfo?(jmathies)
Priority: -- → P2

Just bumped on this because our new WER-based crash reporting logic picked up this crash on Windows:

https://crash-stats.mozilla.org/report/index/d40073e9-5cec-4b11-9550-ee8500210408

Looking at the stacks on Linux and macOS this always seems to be happening when running exit handlers. This is a shortened stack on Windows:

0 mozilla::DOMEventTargetHelper::Release()
1 mozilla::UniquePtr<mozilla::extensions::(anonymous namespace)::ChannelListHolder, mozilla::DefaultDelete<mozilla::extensions::(anonymous namespace)::ChannelListHolder> >::reset(mozilla::extensions::`anonymous namespace'::ChannelListHolder*)
...
4 execute_onexit_table
...
11 exit_or_terminate_process

And this is Linux:

0 mozilla::DOMEventTargetHelper::Release()
1 mozilla::UniquePtr<mozilla::extensions::(anonymous namespace)::ChannelListHolder, mozilla::DefaultDelete<mozilla::extensions::(anonymous namespace)::ChannelListHolder> >::reset(mozilla::extensions::(anonymous namespace)::ChannelListHolder*)
2 __run_exit_handlers
3 exit

And macOS:

0 mozilla::DOMEventTargetHelper::Release()
1 mozilla::UniquePtr<mozilla::extensions::(anonymous namespace)::ChannelListHolder, mozilla::DefaultDelete<mozilla::extensions::(anonymous namespace)::ChannelListHolder> >::~UniquePtr()
2 __cxa_finalize_ranges
3 exit

The work I've done in bug 1682507 seems to be capturing quite a few crashes on Windows under this signature that we were missing before. I've opened a few minidumps in Visual Studio to analyze the issue and it's probably more complicated than it appears. First of all this is the stack trace as reported by VS:

1.  xul.dll!mozilla::dom::Animation::Release() Line 42	C++
2.  [Inline Frame] xul.dll!mozilla::extensions::ChannelWrapper::Release() Line 1201	C++
3.  [Inline Frame] xul.dll!mozilla::RefPtrTraits<mozilla::extensions::ChannelWrapper>::Release(mozilla::extensions::ChannelWrapper * aPtr) Line 50	C++
4.  [Inline Frame] xul.dll!RefPtr<mozilla::extensions::ChannelWrapper>::ConstRemovingRefPtrTraits<mozilla::extensions::ChannelWrapper>::Release(mozilla::extensions::ChannelWrapper * aPtr) Line 381	C++
5.  [Inline Frame] xul.dll!RefPtr<mozilla::extensions::ChannelWrapper>::assign_assuming_AddRef(mozilla::extensions::ChannelWrapper * aNewPtr) Line 69	C++
6.  [Inline Frame] xul.dll!RefPtr<mozilla::extensions::ChannelWrapper>::operator=(void *) Line 168	C++
7.  [Inline Frame] xul.dll!mozilla::extensions::ChannelWrapper::Die() Line 137	C++
8.  [Inline Frame] xul.dll!mozilla::extensions::`anonymous namespace'::ChannelListHolder::~ChannelListHolder() Line 95	C++
9.  [Inline Frame] xul.dll!mozilla::DefaultDelete<mozilla::extensions::(anonymous namespace)::ChannelListHolder>::operator()(mozilla::extensions::`anonymous namespace'::ChannelListHolder *) Line 463	C++
10. xul.dll!mozilla::UniquePtr<mozilla::extensions::(anonymous namespace)::ChannelListHolder,mozilla::DefaultDelete<mozilla::extensions::(anonymous namespace)::ChannelListHolder>>::reset(mozilla::extensions::`anonymous namespace'::ChannelListHolder * aPtr) Line 307	C++
11. [External Code]

The crash reason is a NULL pointer access, but it's not in our code. What's happening is that somehow in the Release() call we're trying to access thread local storage and the _tls_end symbol is NULL. To understand what's going on a bit of background on how Windows implements TLS is needed: what's going on is that in every DLL that uses TLS variables those are stored in an area bound by two variables called _tls_start and _tls_end. When the DLL is loaded those symbols become available and are used at runtime by the TLS access code. This crash is happening very late during shutdown, somewhere in an atexit() callback from what I can tell. At that point it's likely that whatever DLL contains the TLS variables we're trying to access (probably XUL) is being unloaded - or has already been unloaded, so the _tls_start and _tls_end symbols aren't available anymore and the code trying to access them fails with the reported NULL pointer access.

This might prove tricky to fix: we probably have to avoid accessing TLS variables inside atexit() handlers. On the upside this crash only happens very, very late during shutdown so it shouldn't be too annoying for users.

OS: macOS → All

Change the status for beta to have the same as nightly and release.
For more information, please visit auto_nag documentation.

(In reply to Pascal Chevrel:pascalc from comment #9)

This bug is spiking on nightly 89 the last days

Note that the spike is "artificial" in that these crashes were already happening but we didn't catch them; they'd go straight to Microsoft health dashboards. What's happening is that we're now seeing their actual volume.

Thanks Gabriele, this is what I suspected. We should probably revisit the priority of this bug given that our users are crashing a lot more than we thought with this signature.

(In reply to Gabriele Svelto [:gsvelto] from comment #8)

The crash reason is a NULL pointer access, but it's not in our code. What's happening is that somehow in the Release() call we're trying to access thread local storage and the _tls_end symbol is NULL.

The addref and release methods of cycle collected classes call into TLS (to get the cycle collector object for the current thread). This could be the same underlying issue as the spike in crashes seen with the signature NS_CycleCollectorSuspect3, as that is the method that AddRef and Release call into.

Looking at the types, it seems like we must be destroying sChannelList. Bug 1495748 added some cleanup code that runs at XPCOMShutdown, but maybe that is failing to run for some reason, or maybe new things are being added to the list after shutdown has occurred. Maybe we could just not add things to sChannelList after XPCOMShutdown has occurred somehow? In fact, this whole mechanism seems like something added to make the leak checker happy, so maybe we could simply not use this list at all except in builds where we care about leaks.

See Also: → 1151643

If we do, we create a new instance after its ClearOnShutdown handler has
run, and it gets destroyed from a static destructor after the cycle collector
has shutdown.

Assignee: nobody → kmaglione+bmo
Status: NEW → ASSIGNED

(In reply to Andrew McCreight [:mccr8] from comment #13)

In fact, this whole mechanism seems like something added to make the leak checker happy, so maybe we could simply not use this list at all except in builds where we care about leaks.

I think we do need to run this even in cases where we don't care about leaks, since if we don't, the ChannelWrappers will still get destroyed possibly too late in shutdown to be safe in those cases. It won't matter when we do early exits, but when we don't, it could potentially lead to weird crashes.

I am tracking this bug for 89 as this has a high volume of crashes on nightly.

Pushed by maglione.k@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/16f37cbabda8
Don't create a new ChannelWrapper list late in shutdown. r=mccr8
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 90 Branch

This grafts cleanly as far back as ESR78 also. Under the assumption that the frequency is likely being under-reported there due to the aforementioned WER spike, I'm thinking we probably want to uplift this fix there too.

(In reply to Ryan VanderMeulen [:RyanVM] from comment #19)

This grafts cleanly as far back as ESR78 also. Under the assumption that the frequency is likely being under-reported there due to the aforementioned WER spike, I'm thinking we probably want to uplift this fix there too.

The PastShutdownPhase function doesn't exist before 83.

Well that's certainly a problem :). Guess we'll live with it until ESR91 then!

Comment on attachment 9216484 [details]
Bug 1606347: Don't create a new ChannelWrapper list late in shutdown. r=mccr8

Beta/Release Uplift Approval Request

  • User impact if declined: Shutdown crashes
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Adds a simple check so we don't reinstantiate things during shutdown.
  • String changes made/needed:
Attachment #9216484 - Flags: approval-mozilla-beta?

Doug, are you working on the fast shutdown stuff? Is it expected that we're running static destructors like this on Nightly?

Flags: needinfo?(dothayer)

Crash volume is unchanged in builds with the fix.

Status: RESOLVED → REOPENED
Flags: needinfo?(kmaglione+bmo)
Resolution: FIXED → ---
Target Milestone: 90 Branch → ---

This issue is old enough that it isn't really a regression at this point. The surge on crash-stats is just due to increased reporting.

Maybe we need to move ahead with a solution that makes the cycle collector suspect not crash when called after CC shutdown. It would be good if it still continued to crash when called from a thread where the CC was never started (before shutdown).

Keywords: regression

Comment on attachment 9216484 [details]
Bug 1606347: Don't create a new ChannelWrapper list late in shutdown. r=mccr8

No change on nightly crashes so no need to uplift that patch at the moment, thanks

Attachment #9216484 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Flags: needinfo?(kmaglione+bmo)
Severity: normal → S3
Flags: needinfo?(dothayer)

The bug assignee is inactive on Bugzilla, so the assignee is being reset.

Assignee: kmaglione+bmo → nobody

There aren't many recent crashes, and they look very different than the initial crash, and a patch has landed that should have helped with the initial issue, so let's close this.

Status: REOPENED → RESOLVED
Closed: 3 years ago6 months ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: