Open Bug 1543447 Opened 6 months ago Updated 2 months ago

Multiple minutes to dump a stack from NS_ASSERTION in mochitests and reftests on OS X - Set MOZ_DISABLE_STACK_FIX=1 to disable stack fixer

Categories

(Core :: XPCOM, defect)

Unspecified
macOS
defect
Not set

Tracking

()

People

(Reporter: mccr8, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: leave-open)

Attachments

(2 files)

In bug 1525411, developers were experiencing multiple minute hangs on OS X when running reftests and mochitests because of a leak assertion during shutdown. I fixed the leaks, but the slow stack issue is still a problem. Bug 1529381 is open for switching from atos to llvm-symbolizer, but I'm not sure that's the main issue.

The weird thing is that this appears to only be a problem when running mochitests or reftests and not when using mach run.

I added NS_ASSERTION(false, "beep boop"); to nsGlobalWindowOuter's ctor.

If I run Firefox like this then everything is fine:
XPCOM_DEBUG_BREAK=stack ./mach run
the console is being spammed with a ton of symbolicated stacks like you'd expect, flying by as fast as the console can print them. (stack-and-abort also works, though it exits immediately, of course.)

However, if you run a Mochitest or a reftest, then you hit a hang. For instance:
./mach mochitest dom/manifest/test/browser_ManifestObtainer_obtain.js
./mach reftest layout/reftests/flexbox/flexbox-align-self-baseline-horiz-2.xhtml

The test runs just fine, but the stack does not print out for multiple minutes. There's a program CrashReporter that is taking over 100% CPU for the entire duration.

In IRC, mgaudet mentioned that he'd experienced a very similar issue:
https://www.mgaudet.ca/technical/2018/10/4/fixing-long-delays-when-a-program-crashes-while-running-in-the-vscode-terminal

Summary: Multiple minutes to dump a stack from NS_ASSERTION in mochitests and reftests → Multiple minutes to dump a stack from NS_ASSERTION in mochitests and reftests on OS X

I see this too, but all the time is spent running atos on my machine.

(In reply to Matt Woodrow (:mattwoodrow) from comment #1)

I see this too, but all the time is spent running atos on my machine.

I have disabled CrashReporter and also see atos is taking 100% the CPU on my machine. It finishes in about 20 minutes on MBP.

Easily reproducable with ./mach reftest layout/reftests/bugs/368155-negative-margins-1.html

Looking at the sample and open files of atos process, it seems that most of the time is spent opening/reading/closing x86_64-apple-darwin/debug/libgkrust.a. In the Python file calling atos 1, I saw references to bug 429963, with some discussion about slow buffering, maybe this is related?

I dug into this a bit more, since it bites me frequently.

The normal stacks that gecko outputs (with XPCOM_DEBUG_BREAK=stack) have demangled names, but no source file/line information (just the binary and offset).

When running one of the test harnesses, we use fix_macosx_stack.py to run atos, which discards the demangled name, and uses the binary+offset to get a new one, with source file information.

Running atos on XUL takes 15 minutes on my machine, all in startup time. We keep a single instance running, so only the first symbol takes that long, and it's close to instant after that.

I tried playing with llvm-symbolizer (see bug 1529381), but it doesn't seem to be able to get source file/line information, so running it doesn't achieve anything. It also has a 3.5 minute startup time, but due to a silly bug that we could get fixed if we wanted.

Running ./mach buildsymbols generates the breakpad symbol information and will cause us to use that information for symbolication, instead of atos. I tried to test this, but gave up after 20 minutes and 5TB of virtual memory usage (running buildsymbols).

The shortest path to success here might be just adding an env var to disable fix_macosx_stack.py, since instant stack traces without line numbers is likely more useful than waiting 15 minutes.

I also see endless opening and closing of libgkrust.a when atos is running, and this problem has got significantly worse over the last year, which suggests that something to do with rust compilation is emitting symbols that atos struggles with. We can't modify atos, but maybe we can look at our symbol generation to see if there's a more 'standard' way to do things?
Anthony, is this something your team could look at?

Flags: needinfo?(ajones)
Flags: needinfo?(ajones) → needinfo?(erahm)

Interesting. So in automation we should be using fix_stack_using_bpsyms.py which should sidestep this issue. Locally we'll use that too if you set BREAKPAD_SYMBOLS_PATH, but of course you'd need to actually have breakpad symbols for that to work. If you need an immediate work around my guess is doing something like the following should work:

BREAKPAD_SYMBOLS_PATH=/tmp ./mach reftest layout/reftests/bugs/368155-negative-margins-1.html

atos being trash is harder to fix, I don't have a problem with using llvm-symbolizer when available (we should have it in our ~/.mozbuild/clang dir from mach bootstrap). As I noted in Bug 1529381, comment 4 there's already a reasonably useful script that just tries everything (including llvm-symbolizer) that we could possibly fork.

Matt if you have more info about the "silly bug" we could certainly look into fixing that. It seems like we should be able to get the source and line info, but maybe it needs to have a dSYM file which can take a while to build.

Flags: needinfo?(erahm) → needinfo?(matt.woodrow)
Attached patch stack-fix.diffSplinter Review

Using a fake breakpad symbols path results in errors, and the printed stack ends up containing just the raw addresses, which is worse than the original input (demangled names + binary/offset).

I'm attaching a patch that I'm using locally that just disables the stack fixer entirely if you set an env var.

The silly bug is here: https://github.com/llvm-mirror/llvm/blob/bf3672857eab1e9dc38ed20f78129e691de646ff/lib/Object/SymbolSize.cpp#L81

That loops scans forward when it detects duplicate symbols to find the next unique address. Unfortunately it repeats the scan forward for every duplicate symbol, which results in N^2 behaviour, and sucks if you have a lot of duplicate symbols (which we apparently do).

Hoisting the NextI variable outside the loop such that we only scan forward once makes the runtime go from ~4.5minutes to 2 seconds for me. This may be a sign that our symbol data is just insane, unsure.

I haven't dug into why it doesn't have source/line information, since the env var workaround is generally sufficient for my needs.

Flags: needinfo?(matt.woodrow) → needinfo?(erahm)

Matt, for the time being I'm fine accepting your patch as-is, I'd also r+ something that turns off stack fixing for local osx builds until we can find a better solution. Our team is going to be working on a crash reporter rewrite over the next 2 quarters, we expect a cross-platform stack fixer to come out of that effort.

Flags: needinfo?(erahm)

How about we land the current patch, and leave the bug open so that it's still discoverable via ./mach busted?

We can close it properly once the longer term fix is landed.

Pushed by mwoodrow@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c0c38c9ed153
Add an env var to disable stack fixing scripts in mozrunner. r=erahm
Summary: Multiple minutes to dump a stack from NS_ASSERTION in mochitests and reftests on OS X → Multiple minutes to dump a stack from NS_ASSERTION in mochitests and reftests on OS X - Set MOZ_DISABLE_STACK_FIX=1 to disable stack fixer
You need to log in before you can comment on or make changes to this bug.