1525411 - ###!!! ASSERTION: 1 dynamic atom(s) with non-zero refcount: [System Principal]: 'nonZeroRefcountAtomsCount == 0', file /Users/kats/zspace/gecko-wr/xpcom/ds/nsAtomTable.cpp, line 445

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Description

•

7 years ago

I get this assertion every time I try to run reftests on a local debug macOS build.

###!!! ASSERTION: 1 dynamic atom(s) with non-zero refcount: [System Principal]: 'nonZeroRefcountAtomsCount == 0', file /Users/kats/zspace/gecko-wr/xpcom/ds/nsAtomTable.cpp, line 445

It causes the browser to crash on startup and I have to comment out the assertion to actually run the reftests.

Andrew McCreight [:mccr8]

Assignee

Comment 1

•

7 years ago

This is an assertion related to a leak. It is strange that it is happening on startup. Maybe the test harness is running a XPCShell that is running then shutting down? I also don't understand why an NS_ASSERTION is fatal.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Comment 2

•

7 years ago

"on startup" was poorly worded. I really meant "before the reftests actually run". But the reftest harness does a bunch of stuff during that time including open and closing some windows so really it could be triggered by some of that.

Is there a way to debug this?

Also of note is that if I comment out the assertion, I hit the assertion in bug 1523944 which I have to comment out as well. Maybe they are related?

Andrew McCreight [:mccr8]

Assignee

Comment 3

•

7 years ago

Do you have XPCOM_DEBUG_BREAK set to something in your environment? That could make the assertion fatal. So that part is weird.

But yeah, it is probably the same thing. I wouldn't be surprised if XPCWJS being held alive a long time would hit both assertions.

Andrew McCreight [:mccr8]

Assignee

Comment 4

•

7 years ago

I'm going to mark this as a variant of bug 1523944, with some weird additional twist that XPCOM_DEBUG_BREAK is being set to fatal. But maybe that's normal for the reftest harness.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → DUPLICATE

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Comment 5

•

7 years ago

(In reply to Andrew McCreight [:mccr8] from comment #3)

Do you have XPCOM_DEBUG_BREAK set to something in your environment? That could make the assertion fatal. So that part is weird.

I think the fatal-ness of the problem might be related to lldb (I've been having issues with lldb which are interfering). When I run without lldb it just takes a really long time (many minutes) to emit a stack for the NS_ASSERTION. But yeah I'm fine with duping this.

Andrew McCreight [:mccr8]

Assignee

Comment 6

•

7 years ago

erahm pointed out that runxpcshelltests.py includes this line which would explain the crash:
self.env["XPCOM_DEBUG_BREAK"] = "stack-and-abort"

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Comment 7

•

7 years ago

This is still happening for me, even with the fix for bug 1523944. What can I do to debug this?

Status: RESOLVED → REOPENED

Resolution: DUPLICATE → ---

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Comment 8

•

7 years ago

To be clear, I don't think this is a fatal error, but it just takes so long (many minutes) to dump the stack that it might as well be.

Andrew McCreight [:mccr8]

Assignee

Comment 9

•

7 years ago

Taking multiple minutes to dump a stack seems like a bug by itself. Is it taking that long to dump a single stack, or are there a ton of stacks being dumped?

Andrew McCreight [:mccr8]

Assignee

Comment 10

•

7 years ago

Hmm I guess "stack and abort" implies it only dumps a single stack.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Comment 11

•

7 years ago

It takes about 6 minutes to dump this single stack:

[Parent 91369, Main Thread] ###!!! ASSERTION: 1 dynamic atom(s) with non-zero refcount: [System Principal]: 'nonZeroRefcountAtomsCount == 0', file /Users/kats/zspace/gecko-wr/xpcom/ds/nsAtomTable.cpp, line 445
#01: nsAtomTable::GC(GCKind) (Mutex.h:168, in XUL)
#02: NS_ShutdownAtomTable() (nsAtomTable.cpp:484, in XUL)
#03: mozilla::ShutdownXPCOM(nsIServiceManager*) (XPCOMInit.cpp:779, in XUL)
#04: ScopedXPCOMStartup::~ScopedXPCOMStartup() (nsAppRunner.cpp:1314, in XUL)
#05: XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) (mozalloc.h:151, in XUL)
#06: XRE_main(int, char**, mozilla::BootstrapConfig const&) (nsAppRunner.cpp:4926, in XUL)
#07: main (nsBrowserApp.cpp:214, in firefox)

Running ps in a separate terminal shows that most of the time is spent in:

/Applications/Xcode.app/Contents/Developer/usr/bin/atos -arch x86_64 -o /Users/kats/zspace/gecko-wr/obj-host-debug/dist/NightlyDebug.app/Contents/MacOS/XUL

Andrew McCreight [:mccr8]

Assignee

Comment 12

•

7 years ago

Glandium, do you have time to look at this? It seems bad that a single call to MozStackWalk could take 6 minutes.

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago

Flags: needinfo?(mh+mozilla)

Nathan Froyd [:froydnj]

Comment 13

•

7 years ago

atos is just slow (at least on the first invocation; subsequent invocations--on the same binary?--seem to be much faster). Now that we've gone more all-in on clang, we could just use whatever the LLVM equivalent of atos is, assuming it has the same functionality...

Andrew McCreight [:mccr8]

Assignee

Comment 14

•

7 years ago

There's llvm-symbolizer, which we use for ASan.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Comment 15

•

7 years ago

llvm-symbolizer does appear to be much faster. I hacked fix_macosx_stack.py to call that instead of atos and it returns results in seconds instead of minutes. Although the output format is different and I haven't done the work to parse that properly so it's giving me mostly garbage at the moment. But it does so quickly!

I can spin off another bug to track this switch. It won't actually fix the assertion I originally filed this bug about, although it will make it less painful for me.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Updated

•

7 years ago

Depends on: 1529381

Nicholas Nethercote [inactive]

Comment 16

•

7 years ago

fix_linux_stack.py uses addr2line, which is also horribly slow. (In fact, my experience with DMD stacks that addr2line is much slower than atos!) So if a faster clang alternative would work there too, that would be great.

Mike Hommey [:glandium]

Updated

•

7 years ago

Flags: needinfo?(mh+mozilla)

Brad Werth [:bradwerth]

Comment 17

•

7 years ago

This is now occurring for me on a newly setup macOS build machine.

Brad Werth [:bradwerth]

Comment 18

•

7 years ago

•

Edited

Trying to find the relevant place where XPCOM_DEBUG_BREAK is set to "stack-and-abort"...

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Comment 19

•

7 years ago

Ignoring the distraction of the slow stack dump, how is one supposed to actually debug and fix the actual assertion? The [System Principal] string only appears a handful of places in the code, but it's not clear to me how it ends up in the dynamic atom table or why it has a non-zero refcount on shutdown.

Andrew McCreight [:mccr8]

Assignee

Comment 20

•

7 years ago

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #19)

Ignoring the distraction of the slow stack dump, how is one supposed to actually debug and fix the actual assertion? The [System Principal] string only appears a handful of places in the code, but it's not clear to me how it ends up in the dynamic atom table or why it has a non-zero refcount on shutdown.

Odds are this is just the symptom of a leak in a test suite where we ignore leaks.

You can see the rest of the stuff that is leaking if you run with:
XPCOM_MEM_BLOAT_LOG=1

I'm not a fan of the various "assert on leaks" that we have in the code base, because they cause issues like this.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Comment 21

•

7 years ago

Attached file Tail of mochitest output with XPCOM_MEM_BLOAT_LOG=1 — Details

I get the attached output at the end of the test run, which points to a bunch of XPCOM stuff getting leaked.

Andrew McCreight [:mccr8]

Assignee

Comment 22

•

7 years ago

That's actually a lot less bad than I'd have thought. It could just be something like XPCShell keeping alive a backstage pass on the stack or something.

Boris Chiou [:boris]

Comment 23

•

7 years ago

I also have this problem when running ./mach reftest layout/reftests/xxxx.html every time on mac debug build for some weeks ago. This is really annoying.

Marcos Caceres [:marcosc]

Comment 24

•

7 years ago

Also affected by this:

./mach mochitest dom/manifest/test/browser_ManifestObtainer_obtain.js

Andrew McCreight [:mccr8]

Assignee

Comment 25

•

7 years ago

I'll try to figure out something here.

Assignee: nobody → continuation

Depends on: 1523944

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago

Depends on: 1542854

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago

Depends on: 1542944

Andrew McCreight [:mccr8]

Assignee

Comment 26

•

7 years ago

Bug 1542944 and bug 1542854 should both fix the shutdown hangs, by avoiding the assertion. I tried the STR in comment 24 and it fixed that. I'm going to see if I can do something about stack walking being so slow.

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago

Blocks: mach-busted

Andrew McCreight [:mccr8]

Assignee

Comment 27

•

7 years ago

The immediate issue with not being able to run reftests or mochitests at all without a hang is fixed, so I'll close this and file another bug about the slow stacks.

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → FIXED

Andrew McCreight [:mccr8]

Assignee

Updated

•

7 years ago