Closed Bug 1602717 Opened 4 years ago Closed 4 years ago

Re-enable stackwalking on local macOS builds when rust-based stackwalking is available

Categories

(Core :: XPCOM, task)

task
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla76
Tracking Status
firefox73 --- wontfix
firefox76 --- fixed

People

(Reporter: Gijs, Assigned: n.nethercote)

References

Details

Attachments

(1 file)

As per summary - in bug 1543447 we're landing a change to disable stackwalking on local macOS usage because of performance issues. This bug is here to re-enable it when we've fixed the perf issues by switching to rust.

Depends on: 1619837
No longer depends on: 1596292

Now that fix_stacks.py is being used instead of fix_macosx_stack.py,
stack-fixing time has dropped from about 14 minutes to about 30 seconds on my
new MacBook Pro.

Assignee: nobody → n.nethercote
Status: NEW → ASSIGNED

Now that fix_stacks.py is being used instead of fix_macosx_stack.py,
stack-fixing time has dropped from about 14 minutes to about 30 seconds on my
new MacBook Pro.

This can't land until bug 1619837 does, but while we're waiting: does this seem like a reasonable length of time to wait for a stack trace? I'm needinfo'ing various people who hit this problem in or around bug 1543447.

Flags: needinfo?(mikokm)
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(gijskruitbosch+bugs)
Flags: needinfo?(dtownsend)
Flags: needinfo?(continuation)

I also wonder if we still need the MOZ_DISABLE_STACK_FIX environment variable that Matt added in bug 1543447.

(In reply to Nicholas Nethercote [:njn] from comment #2)

Now that fix_stacks.py is being used instead of fix_macosx_stack.py,
stack-fixing time has dropped from about 14 minutes to about 30 seconds on my
new MacBook Pro.

This can't land until bug 1619837 does, but while we're waiting: does this seem like a reasonable length of time to wait for a stack trace? I'm needinfo'ing various people who hit this problem in or around bug 1543447.

30s is better, but it'd be helpful if the system could immediately output something that indicated that that was what was happening for the next 30-odd seconds. I also imagine that the 30s is a number on some types of machines, and that number will differ on slower / different machines...

(In reply to Nicholas Nethercote [:njn] from comment #3)

I also wonder if we still need the MOZ_DISABLE_STACK_FIX environment variable that Matt added in bug 1543447.

I think we should keep it (perhaps renaming it if we want to opt people back in to having them by default) to make it possible to not have this happen locally when e.g. running tests, given that when debugging test failures, timing can be very important. In any case, if the assertions for which we gather stacks are unrelated to what you're doing, you hit a few of them for every run, and you're trying to run verify, now a 1-2 minute task has become a 5-10 minute task, and that'd still be annoying.

Flags: needinfo?(gijskruitbosch+bugs)

I also imagine that the 30s is a number on some types of machines, and that number will differ on slower / different machines...

That's a safe bet.

you hit a few of them for every run

Fortunately the ~30 second cost should be a one-time hit for the first stack trace, and later stack traces will be almost instantaneous, because a single fix-stacks process will be used for the entire test run.

(In reply to Nicholas Nethercote [:njn] from comment #2)

Now that fix_stacks.py is being used instead of fix_macosx_stack.py,
stack-fixing time has dropped from about 14 minutes to about 30 seconds on my
new MacBook Pro.

This can't land until bug 1619837 does, but while we're waiting: does this seem like a reasonable length of time to wait for a stack trace? I'm needinfo'ing various people who hit this problem in or around bug 1543447.

30s is still a surprising amount of time to see no output on a test run, but I think if we can do what Gijs suggested (output a message saying that a stack is being processed and there will be a short delay) then I think at least for me I'd be happy with having this on as a default.

(In reply to Nicholas Nethercote [:njn] from comment #5)

Fortunately the ~30 second cost should be a one-time hit for the first stack trace, and later stack traces will be almost instantaneous, because a single fix-stacks process will be used for the entire test run.

Is there any way to pre-cache the data that fix-stacks needs?

Flags: needinfo?(dtownsend)

Pre-caching is difficult. fix-stacks needs to slurp up a lot of debuginfo, and it does this on a binary-by-binary basis. If your stack trace involves libxul, there will be a wait, because libxul is enormous. If your stack trace only involves other libraries it'll be much faster.) Also, on Linux and Windows all the debuginfo for each binary is in a single file (on Linux it's typically in the binary itself, on Windows it's in a separate PDB file), but on Mac the debuginfo is spread across all the .o files that fed into the binary. So there's just a lot of disk access, which is why it takes 10s of seconds, and Mac tends to be the slowest platform.

We could package that data into a more convenient form ahead of time, but it would still be a large amount of data and so the start-up time would likely be better but still significant. Also, the cost of the packaging step would presumably be added to build time, which would hurt every build. Or we'd have a separate packaging step that you'd have to run before running tests, which would be annoying. (The standard Mac approach is to do exactly this kind of packaging via dsymutil, but unfortunately it is extremely slow, which is why we read directly from .o files.)

So in short, the current approach of doing things on demand seems like the least worst option. But a warning message should be easy and will help a lot.

30 seconds sounds reasonable, if there's a message, though I do worry that this is the sort of thing that could increase further over time.

I'm also worried about platform tools (like Instruments) that can't use fix-stacks, and currently hang for 15+ minutes when used. My initial testing in bug 1543447 suggested that the slowness was specific to libgkrust.a (which also explains why this has regressed so much over the last few years). Has there been any investigation as to why that is, and if it's possible to output symbols from rust in a way that isn't as slow?

It'd also be really nice to have a mozconfig/mozbuild option to run a persistent fix-stack process (spawned when a build completes, and prodded to pre-cache XUL symbols), so that it'd be ready for test runs.

Flags: needinfo?(matt.woodrow)

(In reply to Matt Woodrow (:mattwoodrow) from comment #8)

It'd also be really nice to have a mozconfig/mozbuild option to run a persistent fix-stack process (spawned when a build completes, and prodded to pre-cache XUL symbols), so that it'd be ready for test runs.

Yeah this would be great. I've frequently run the same test multiple times without a rebuild in-between and so every time I'd see the slow first run to figure out the symbols

30 seconds seems fine, though I'd still like for a way to disable it. I guess leaving the existing env var does have the disadvantage that anybody who did so will continue to have broken stacks until they change something.

Flags: needinfo?(continuation)

(In reply to Andrew McCreight [:mccr8] from comment #10)

30 seconds seems fine, though I'd still like for a way to disable it. I guess leaving the existing env var does have the disadvantage that anybody who did so will continue to have broken stacks until they change something.

To that end it would be nice to also add a message when this is disabled telling folks how to re-enable it.

Flags: needinfo?(mikokm)

The new patch prints both requested warnings: one if you have MOZ_DISABLE_STACK_FIX set, and one when the first stack frame gets fixed. Other suggestions (fixing Instruments, magic pre-caching, persistent fix-stacks processes) are out of scope for this bug.

(In reply to Nicholas Nethercote [:njn] from comment #12)

The new patch prints both requested warnings: one if you have MOZ_DISABLE_STACK_FIX set, and one when the first stack frame gets fixed. Other suggestions (fixing Instruments, magic pre-caching, persistent fix-stacks processes) are out of scope for this bug.

This is awesome, thanks!

I've frequently run the same test multiple times without a rebuild in-between and so every time I'd see the slow first run to figure out the symbols

If you know in advance that you'll be running the same test multiple times, you can do ./mach buildsymbols first. It takes a while to run, but stack-fixing using breakpad symbols is much faster on my Mac than native stack fixing.

Pushed by nnethercote@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9d6c366e7cde
Re-enable stackwalking on local macOS builds. r=erahm
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla76
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: