Closed Bug 1543447 Opened 1 year ago Closed 3 months ago

Multiple minutes to dump a stack from NS_ASSERTION in mochitests and reftests on OS X - Set MOZ_DISABLE_STACK_FIX=1 to disable stack fixer

Categories

(Core :: XPCOM, defect)

Unspecified
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: mccr8, Assigned: mattwoodrow)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

In bug 1525411, developers were experiencing multiple minute hangs on OS X when running reftests and mochitests because of a leak assertion during shutdown. I fixed the leaks, but the slow stack issue is still a problem. Bug 1529381 is open for switching from atos to llvm-symbolizer, but I'm not sure that's the main issue.

The weird thing is that this appears to only be a problem when running mochitests or reftests and not when using mach run.

I added NS_ASSERTION(false, "beep boop"); to nsGlobalWindowOuter's ctor.

If I run Firefox like this then everything is fine:
XPCOM_DEBUG_BREAK=stack ./mach run
the console is being spammed with a ton of symbolicated stacks like you'd expect, flying by as fast as the console can print them. (stack-and-abort also works, though it exits immediately, of course.)

However, if you run a Mochitest or a reftest, then you hit a hang. For instance:
./mach mochitest dom/manifest/test/browser_ManifestObtainer_obtain.js
./mach reftest layout/reftests/flexbox/flexbox-align-self-baseline-horiz-2.xhtml

The test runs just fine, but the stack does not print out for multiple minutes. There's a program CrashReporter that is taking over 100% CPU for the entire duration.

In IRC, mgaudet mentioned that he'd experienced a very similar issue:
https://www.mgaudet.ca/technical/2018/10/4/fixing-long-delays-when-a-program-crashes-while-running-in-the-vscode-terminal

Summary: Multiple minutes to dump a stack from NS_ASSERTION in mochitests and reftests → Multiple minutes to dump a stack from NS_ASSERTION in mochitests and reftests on OS X

I see this too, but all the time is spent running atos on my machine.

(In reply to Matt Woodrow (:mattwoodrow) from comment #1)

I see this too, but all the time is spent running atos on my machine.

I have disabled CrashReporter and also see atos is taking 100% the CPU on my machine. It finishes in about 20 minutes on MBP.

Easily reproducable with ./mach reftest layout/reftests/bugs/368155-negative-margins-1.html

Looking at the sample and open files of atos process, it seems that most of the time is spent opening/reading/closing x86_64-apple-darwin/debug/libgkrust.a. In the Python file calling atos 1, I saw references to bug 429963, with some discussion about slow buffering, maybe this is related?

I dug into this a bit more, since it bites me frequently.

The normal stacks that gecko outputs (with XPCOM_DEBUG_BREAK=stack) have demangled names, but no source file/line information (just the binary and offset).

When running one of the test harnesses, we use fix_macosx_stack.py to run atos, which discards the demangled name, and uses the binary+offset to get a new one, with source file information.

Running atos on XUL takes 15 minutes on my machine, all in startup time. We keep a single instance running, so only the first symbol takes that long, and it's close to instant after that.

I tried playing with llvm-symbolizer (see bug 1529381), but it doesn't seem to be able to get source file/line information, so running it doesn't achieve anything. It also has a 3.5 minute startup time, but due to a silly bug that we could get fixed if we wanted.

Running ./mach buildsymbols generates the breakpad symbol information and will cause us to use that information for symbolication, instead of atos. I tried to test this, but gave up after 20 minutes and 5TB of virtual memory usage (running buildsymbols).

The shortest path to success here might be just adding an env var to disable fix_macosx_stack.py, since instant stack traces without line numbers is likely more useful than waiting 15 minutes.

I also see endless opening and closing of libgkrust.a when atos is running, and this problem has got significantly worse over the last year, which suggests that something to do with rust compilation is emitting symbols that atos struggles with. We can't modify atos, but maybe we can look at our symbol generation to see if there's a more 'standard' way to do things?
Anthony, is this something your team could look at?

Flags: needinfo?(ajones)
Flags: needinfo?(ajones) → needinfo?(erahm)

Interesting. So in automation we should be using fix_stack_using_bpsyms.py which should sidestep this issue. Locally we'll use that too if you set BREAKPAD_SYMBOLS_PATH, but of course you'd need to actually have breakpad symbols for that to work. If you need an immediate work around my guess is doing something like the following should work:

BREAKPAD_SYMBOLS_PATH=/tmp ./mach reftest layout/reftests/bugs/368155-negative-margins-1.html

atos being trash is harder to fix, I don't have a problem with using llvm-symbolizer when available (we should have it in our ~/.mozbuild/clang dir from mach bootstrap). As I noted in Bug 1529381, comment 4 there's already a reasonably useful script that just tries everything (including llvm-symbolizer) that we could possibly fork.

Matt if you have more info about the "silly bug" we could certainly look into fixing that. It seems like we should be able to get the source and line info, but maybe it needs to have a dSYM file which can take a while to build.

Flags: needinfo?(erahm) → needinfo?(matt.woodrow)
Attached patch stack-fix.diffSplinter Review

Using a fake breakpad symbols path results in errors, and the printed stack ends up containing just the raw addresses, which is worse than the original input (demangled names + binary/offset).

I'm attaching a patch that I'm using locally that just disables the stack fixer entirely if you set an env var.

The silly bug is here: https://github.com/llvm-mirror/llvm/blob/bf3672857eab1e9dc38ed20f78129e691de646ff/lib/Object/SymbolSize.cpp#L81

That loops scans forward when it detects duplicate symbols to find the next unique address. Unfortunately it repeats the scan forward for every duplicate symbol, which results in N^2 behaviour, and sucks if you have a lot of duplicate symbols (which we apparently do).

Hoisting the NextI variable outside the loop such that we only scan forward once makes the runtime go from ~4.5minutes to 2 seconds for me. This may be a sign that our symbol data is just insane, unsure.

I haven't dug into why it doesn't have source/line information, since the env var workaround is generally sufficient for my needs.

Flags: needinfo?(matt.woodrow) → needinfo?(erahm)

Matt, for the time being I'm fine accepting your patch as-is, I'd also r+ something that turns off stack fixing for local osx builds until we can find a better solution. Our team is going to be working on a crash reporter rewrite over the next 2 quarters, we expect a cross-platform stack fixer to come out of that effort.

Flags: needinfo?(erahm)

How about we land the current patch, and leave the bug open so that it's still discoverable via ./mach busted?

We can close it properly once the longer term fix is landed.

Keywords: leave-open
Pushed by mwoodrow@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c0c38c9ed153
Add an env var to disable stack fixing scripts in mozrunner. r=erahm
Summary: Multiple minutes to dump a stack from NS_ASSERTION in mochitests and reftests on OS X → Multiple minutes to dump a stack from NS_ASSERTION in mochitests and reftests on OS X - Set MOZ_DISABLE_STACK_FIX=1 to disable stack fixer
Depends on: 1596292
Duplicate of this bug: 1599583

Morgan ran into this after an update to Catalina, too, like Mossop in the dupe.

Do we know what in the Catalina update might be triggering this suddenly becoming much worse, and/or can we consider turning this on by default on mac? It's disorientating for people and basically breaks tests at random moments on debug builds, for 5-15 minutes at a time.

Flags: needinfo?(erahm)

(In reply to :Gijs (he/him) from comment #13)

Morgan ran into this after an update to Catalina, too, like Mossop in the dupe.

Do we know what in the Catalina update might be triggering this suddenly becoming much worse

Probably a new version of atos or libunwind.

can we consider turning this on by default on mac? It's disorientating for people and basically breaks tests at random moments on debug builds, for 5-15 minutes at a time.

njn just landed mac support to our new rust-based stack fixing tool, ideally we'll have something a replacement in tree by the end of 2019. There are potential interim solutions, but I think our efforts are best served focusing on a replacement.

If someone wants to pick this up a few options are:

  1. Switch stack fixing to use breakpad symbols in automation (I'm not sure if this will help with try), this sidesteps using atos which is slow
  2. Delay stack fixing to a post-processing step so the test doesn't timeout and fail
  3. ...and the simplest: Make MOZ_DISABLE_STACK_FIX=1 the default for local builds

I'd r+ 3 for now if folks are happy with having potentially useless stacks (IIRC the stacks are often actually okay w/o stack fixing on mac debug builds).

Flags: needinfo?(erahm)

I also get really poor performance from other system tools that try to get symbols from libxul, especially Instruments. That makes it rather annoying to do performance profiling (for issues where the gecko profiler is less useful).

Writing our own tools that don't have this problem won't help for cases like this :(

(In reply to Matt Woodrow (:mattwoodrow) from comment #15)

I also get really poor performance from other system tools that try to get symbols from libxul, especially Instruments. That makes it rather annoying to do performance profiling (for issues where the gecko profiler is less useful).

Writing our own tools that don't have this problem won't help for cases like this :(

Do we have contacts at Apple we can flag this kind of thing up to, or checks we can run on our symbols / paths / whatever to figure out if we're doing something we're not supposed to that could explain the slowness? I guess we can't very well instrument Instruments...

Flags: needinfo?(spohl.mozilla.bugs)

We have contacts, but we would need to have this type of problem analyzed and distilled in great detail before it would make sense to reach out to them, including easy steps to reproduce or some type of sample application that would trigger this behavior.

Flags: needinfo?(spohl.mozilla.bugs)

(In reply to :Gijs (he/him) from comment #16)

figure out if we're doing something we're not supposed to that could explain the slowness? I guess we can't very well instrument Instruments...

A couple notes:

  • There's a rust bug about including way too much debug info (like gigabytes) that might be related to poor performance (that's just a hunch). Basically the more crate dependencies we have the worse this will get.
  • Apple seems to be using a private Symbolication framework for atos, presumably that's what's changed over time.

(In reply to Stephen A Pohl [:spohl] from comment #17)

We have contacts, but we would need to have this type of problem analyzed and distilled in great detail before it would make sense to reach out to them, including easy steps to reproduce or some type of sample application that would trigger this behavior.

It's pretty basic, just run atos on libxul.

(In reply to Eric Rahm [:erahm] from comment #18)

(In reply to Stephen A Pohl [:spohl] from comment #17)

We have contacts, but we would need to have this type of problem analyzed and distilled in great detail before it would make sense to reach out to them, including easy steps to reproduce or some type of sample application that would trigger this behavior.

It's pretty basic, just run atos on libxul.

I meant: Is running this on 10.15 sufficient to reproduce the issue? Does it only occur after an update to Catalina (comment 13)? Does it happen on all systems that have been updated, or are there any other factors that need to be in place before this reproduces? Basically, once we have someone at Apple look into this we want to be sure that they can reproduce the issue.

(In reply to Stephen A Pohl [:spohl] from comment #19)

(In reply to Eric Rahm [:erahm] from comment #18)

(In reply to Stephen A Pohl [:spohl] from comment #17)

We have contacts, but we would need to have this type of problem analyzed and distilled in great detail before it would make sense to reach out to them, including easy steps to reproduce or some type of sample application that would trigger this behavior.

It's pretty basic, just run atos on libxul.

I meant: Is running this on 10.15 sufficient to reproduce the issue? Does it only occur after an update to Catalina (comment 13)? Does it happen on all systems that have been updated, or are there any other factors that need to be in place before this reproduces? Basically, once we have someone at Apple look into this we want to be sure that they can reproduce the issue.

It's been bad for a long time (this was filed 8 months ago, a related bug 10 months ago, I looked at perf a few years ago as well), I don't think upgrading to 10.15 is required. It just might be that the tools are worse now.

Assignee: nobody → gijskruitbosch+bugs
Assignee: gijskruitbosch+bugs → nobody
Pushed by gijskruitbosch@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/72fe5f25acc3
disable stack walking on mac for non-automation uses of mozrunner because of perf issues, r=erahm
Depends on: 1602717
No longer depends on: 1596292

Fixed by bug 1602717.

Status: NEW → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Assignee: nobody → matt.woodrow
You need to log in before you can comment on or make changes to this bug.