Closed Bug 1525411 Opened 8 months ago Closed 6 months ago

###!!! ASSERTION: 1 dynamic atom(s) with non-zero refcount: [System Principal]: 'nonZeroRefcountAtomsCount == 0', file /Users/kats/zspace/gecko-wr/xpcom/ds/nsAtomTable.cpp, line 445

Categories

(Core :: XPCOM, enhancement)

Other Branch
enhancement
Not set

Tracking

()

RESOLVED FIXED

People

(Reporter: kats, Assigned: mccr8)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(1 file)

I get this assertion every time I try to run reftests on a local debug macOS build.

###!!! ASSERTION: 1 dynamic atom(s) with non-zero refcount: [System Principal]: 'nonZeroRefcountAtomsCount == 0', file /Users/kats/zspace/gecko-wr/xpcom/ds/nsAtomTable.cpp, line 445

It causes the browser to crash on startup and I have to comment out the assertion to actually run the reftests.

This is an assertion related to a leak. It is strange that it is happening on startup. Maybe the test harness is running a XPCShell that is running then shutting down? I also don't understand why an NS_ASSERTION is fatal.

"on startup" was poorly worded. I really meant "before the reftests actually run". But the reftest harness does a bunch of stuff during that time including open and closing some windows so really it could be triggered by some of that.

Is there a way to debug this?

Also of note is that if I comment out the assertion, I hit the assertion in bug 1523944 which I have to comment out as well. Maybe they are related?

Do you have XPCOM_DEBUG_BREAK set to something in your environment? That could make the assertion fatal. So that part is weird.

But yeah, it is probably the same thing. I wouldn't be surprised if XPCWJS being held alive a long time would hit both assertions.

I'm going to mark this as a variant of bug 1523944, with some weird additional twist that XPCOM_DEBUG_BREAK is being set to fatal. But maybe that's normal for the reftest harness.

Status: NEW → RESOLVED
Closed: 8 months ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1523944

(In reply to Andrew McCreight [:mccr8] from comment #3)

Do you have XPCOM_DEBUG_BREAK set to something in your environment? That could make the assertion fatal. So that part is weird.

I think the fatal-ness of the problem might be related to lldb (I've been having issues with lldb which are interfering). When I run without lldb it just takes a really long time (many minutes) to emit a stack for the NS_ASSERTION. But yeah I'm fine with duping this.

erahm pointed out that runxpcshelltests.py includes this line which would explain the crash:
self.env["XPCOM_DEBUG_BREAK"] = "stack-and-abort"

This is still happening for me, even with the fix for bug 1523944. What can I do to debug this?

Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---

To be clear, I don't think this is a fatal error, but it just takes so long (many minutes) to dump the stack that it might as well be.

Taking multiple minutes to dump a stack seems like a bug by itself. Is it taking that long to dump a single stack, or are there a ton of stacks being dumped?

Hmm I guess "stack and abort" implies it only dumps a single stack.

It takes about 6 minutes to dump this single stack:

[Parent 91369, Main Thread] ###!!! ASSERTION: 1 dynamic atom(s) with non-zero refcount: [System Principal]: 'nonZeroRefcountAtomsCount == 0', file /Users/kats/zspace/gecko-wr/xpcom/ds/nsAtomTable.cpp, line 445
#01: nsAtomTable::GC(GCKind) (Mutex.h:168, in XUL)
#02: NS_ShutdownAtomTable() (nsAtomTable.cpp:484, in XUL)
#03: mozilla::ShutdownXPCOM(nsIServiceManager*) (XPCOMInit.cpp:779, in XUL)
#04: ScopedXPCOMStartup::~ScopedXPCOMStartup() (nsAppRunner.cpp:1314, in XUL)
#05: XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) (mozalloc.h:151, in XUL)
#06: XRE_main(int, char**, mozilla::BootstrapConfig const&) (nsAppRunner.cpp:4926, in XUL)
#07: main (nsBrowserApp.cpp:214, in firefox)

Running ps in a separate terminal shows that most of the time is spent in:

/Applications/Xcode.app/Contents/Developer/usr/bin/atos -arch x86_64 -o /Users/kats/zspace/gecko-wr/obj-host-debug/dist/NightlyDebug.app/Contents/MacOS/XUL

Glandium, do you have time to look at this? It seems bad that a single call to MozStackWalk could take 6 minutes.

Flags: needinfo?(mh+mozilla)

atos is just slow (at least on the first invocation; subsequent invocations--on the same binary?--seem to be much faster). Now that we've gone more all-in on clang, we could just use whatever the LLVM equivalent of atos is, assuming it has the same functionality...

There's llvm-symbolizer, which we use for ASan.

llvm-symbolizer does appear to be much faster. I hacked fix_macosx_stack.py to call that instead of atos and it returns results in seconds instead of minutes. Although the output format is different and I haven't done the work to parse that properly so it's giving me mostly garbage at the moment. But it does so quickly!

I can spin off another bug to track this switch. It won't actually fix the assertion I originally filed this bug about, although it will make it less painful for me.

fix_linux_stack.py uses addr2line, which is also horribly slow. (In fact, my experience with DMD stacks that addr2line is much slower than atos!) So if a faster clang alternative would work there too, that would be great.

Flags: needinfo?(mh+mozilla)

This is now occurring for me on a newly setup macOS build machine.

Trying to find the relevant place where XPCOM_DEBUG_BREAK is set to "stack-and-abort"...

Ignoring the distraction of the slow stack dump, how is one supposed to actually debug and fix the actual assertion? The [System Principal] string only appears a handful of places in the code, but it's not clear to me how it ends up in the dynamic atom table or why it has a non-zero refcount on shutdown.

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #19)

Ignoring the distraction of the slow stack dump, how is one supposed to actually debug and fix the actual assertion? The [System Principal] string only appears a handful of places in the code, but it's not clear to me how it ends up in the dynamic atom table or why it has a non-zero refcount on shutdown.

Odds are this is just the symptom of a leak in a test suite where we ignore leaks.

You can see the rest of the stuff that is leaking if you run with:
XPCOM_MEM_BLOAT_LOG=1

I'm not a fan of the various "assert on leaks" that we have in the code base, because they cause issues like this.

I get the attached output at the end of the test run, which points to a bunch of XPCOM stuff getting leaked.

That's actually a lot less bad than I'd have thought. It could just be something like XPCShell keeping alive a backstage pass on the stack or something.

I also have this problem when running ./mach reftest layout/reftests/xxxx.html every time on mac debug build for some weeks ago. This is really annoying.

Also affected by this:

./mach mochitest dom/manifest/test/browser_ManifestObtainer_obtain.js

I'll try to figure out something here.

Assignee: nobody → continuation
Depends on: 1523944
Depends on: 1542854
Depends on: 1542944

Bug 1542944 and bug 1542854 should both fix the shutdown hangs, by avoiding the assertion. I tried the STR in comment 24 and it fixed that. I'm going to see if I can do something about stack walking being so slow.

Blocks: mach-busted

The immediate issue with not being able to run reftests or mochitests at all without a hang is fixed, so I'll close this and file another bug about the slow stacks.

Status: REOPENED → RESOLVED
Closed: 8 months ago6 months ago
Resolution: --- → FIXED
See Also: → 1543447
You need to log in before you can comment on or make changes to this bug.