While it's true this is most likely followed by a crash, in some cases we don't get a decent stack from it, for example if this happens on tinderbox and causes a segfault there is no way to get a stack. Thus, I suggest we convert these assertions to MOZ_ASSERT so we abort and get a stack in all cases. An example of a tinderbox failure: ###!!! ASSERTION: You can't dereference a NULL nsCOMPtr with operator->().: 'mRawPtr != 0', file ../../../../../dist/include/nsCOMPtr.h, line 809 NS_DebugBreak+0x00000026 [/builds/slave/m-aurora-lnx-dbg/build/obj-firefox/dist/bin/libxpcom.so +0x0000203C] UNKNOWN [../../../../../dist/bin/test_IHistory +0x000025AA] UNKNOWN [../../../../../dist/bin/test_IHistory +0x00003A3B] UNKNOWN [../../../../../dist/bin/test_IHistory +0x00004F7F] UNKNOWN [../../../../../dist/bin/test_IHistory +0x0000211C] UNKNOWN [/builds/slave/m-aurora-lnx-dbg/build/obj-firefox/dist/bin/libxul.so +0x010F7C37] UNKNOWN [../../../../../dist/bin/test_IHistory +0x00008B2C] UNKNOWN [../../../../../dist/bin/test_IHistory +0x00006211] __libc_start_main+0x000000DC [/lib/libc.so.6 +0x00015DEC] ###!!! ASSERTION: You can't dereference a NULL nsCOMPtr with operator->().: 'mRawPtr != 0', file ../../../../../dist/include/nsCOMPtr.h, line 809 /bin/sh: line 1: 32579 Segmentation fault XPCOM_DEBUG_BREAK=stack-and-abort /builds/slave/m-aurora-lnx-dbg/build/obj-firefox/dist/bin/run-mozilla.sh ../../../../../dist/bin/$f make: *** [check] Error 139 make: Leaving directory `/builds/slave/m-aurora-lnx-dbg/build/obj-firefox/toolkit/components/places/tests/cpp' make: *** [check] Error 2 make: Leaving directory `/builds/slave/m-aurora-lnx-dbg/build/obj-firefox/toolkit/components/places/tests' make: *** [check] Error 2 make: Leaving directory `/builds/slave/m-aurora-lnx-dbg/build/obj-firefox/toolkit/components/places' make: *** [check] Error 2 make: Leaving directory `/builds/slave/m-aurora-lnx-dbg/build/obj-firefox/toolkit/components' make: *** [check] Error 2 make: Leaving directory `/builds/slave/m-aurora-lnx-dbg/build/obj-firefox/toolkit' make: *** [check] Error 2 program finished with exit code 2 elapsedTime=782.101013 TinderboxPrint: check<br/>16834/0 Unknown Error: command finished with exit code: 2
As a side note, we can't use MOZ_ASSERT here, due to bug 717540. So either I use NS_ABORT_IF_FALSE or I have to wait for that bug.
I don't understand how using MOZ_ASSERT will help: this test is already using stack-and-abort, which called mozalloc_abort. MOZ_ASSERT forwards to CrashInJS, which uses the same basic technique, so I don't see how we'd get any better stack... I think this is just an issue where binary test programs aren't hooked up to breakpad while running tests, so we don't get the good crashreport stacks, just whatever stacks get printed by the program itself.
well, we'd get a stack on the xpcom deref rather than later, plus the benefit of having an aborting assertion rather than a weak one, the error is pretty important to catch. Even if, according to what you say, looks like even aborting at the deref we'd not get a stack due to how binary tests are built? Would that be hard to fix?
I don't oppose the change, but it won't make this particular case any better. Each of the binary tests would have to enable the crash reporter and we'd have to run them in a test environment which is aware of and processes the resulting minidumps. For things that don't go through XRE_main that's a real PITA, and it's a pretty large chunk of work.
Created attachment 589512 [details] [diff] [review] patch I see, it's a pity we can't figure these failures, being unable to reproduce locally :( Btw, in case we want to take an abort_if_false patch, here it is.
Why aren't we getting good stacks out of these crashes? They're running on the build machines as part of "make check", so if they're hitting assertions they should get good stacks on debug builds. (The binaries should be unstripped, so NS_Stackwalk should work just fine.) Are we not hitting these on debug builds, only release builds, or what?
Sorry, I don't know the details of how our stackwalk works, though I can say that these failures happen really late in the shutdown process, much after xpcom-shutdown, so aborting earlier may help, or not.
just as a reference, the above has been copied from this log https://tbpl.mozilla.org/php/getParsedLog.php?id=8628442&tree=Mozilla-Aurora
Okay, after talking to glandium on IRC, I think the problem here is that NS_StackWalk just hasn't ever given us good symbols on Linux (because we're just using dladdr, and it doesn't read debug info). fix-linux-stack.pl exists to fix this very problem, so we should just fix our C++ unit tests to pipe their output through that. Filed bug 719120 on that.
Thank you, fixing that will be really useful.
This change makes my life harder :/ I'll have to add some fragile code to http://www.squarefree.com/2010/11/21/how-my-dom-fuzzer-ignores-known-bugs/ in order to ignore specific assertions with specific stack traces.
The scope was to make easier for people to detect null dereferences and to get their stacks from automated tests (once bug 719120 is fixed), though, if this makes things harder for the fuzzer we use to debug, we can backout. My assumption was that this kind of failure should be reported and fixed immediately.
Fuzzer bugs often don't get fixed immediately. The DOM fuzzer's ignore list currently has signatures for 32 crash bugs.
I don't understand the problem, Jesse: isn't this assertion *always* immediately followed by a crash? There's really no continuing from a nullptr deref.
The problem is that when the assertion is changed to an abort, the crash doesn't happen. My fuzzer scripts use the entire stack to decide whether a crash is a known bug, but only the assertion message to decide whether an assertion failure is a known bug.
That doesn't seem like the right way to do it, given that we have JS_ASSERTs at highly centralized pinch points, and a failure at such a pinch point may well alias more than one issue...