Closed Bug 1743983 Opened 4 years ago Closed 11 months ago

Rewrite the minidump-analyzer in Rust

Categories

(Toolkit :: Crash Reporting, task)

task

Tracking

()

RESOLVED FIXED
132 Branch
Tracking Status
firefox132 --- fixed

People

(Reporter: gsvelto, Assigned: afranchuk)

References

(Blocks 2 open bugs)

Details

Attachments

(8 files, 2 obsolete files)

169.19 KB, patch
Details | Diff | Splinter Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

As per title. The current minidump-analyzer takes the JSON annotations file as generated in a Firefox crash report plus the minidump and creates a stack trace that will be added to the JSON annotations. While doing so it also retrieves module signatures and adds them to the annotations.

We only ever implemented reading unwinding information on x86-64/Windows in bug 1333126, for all the other platforms we only used frame pointers or stack scanning.

A rewrite could be done in several steps:

  • Build a version that only uses stack scanning & frame pointers but which is otherwise compatible with the existing implementation
  • Add unwinding support for Windows (by using the symbolic-debuginfo crate to read the unwinding information from PE files)
  • Add unwinding support macOS (likewise but for Mach-O ones)
  • Add unwinding support for Linux/Android. This is trickier as I believe that symbolic-debuginfo has no way of surfacing this data - yet - so we might have to gimli/goblin directly to parse the data
Assignee: nobody → afranchuk

Once the stackwalking functionality is split out of the processor we can move forward, as we don't want to audit/vendor the huge dependencies of minidump-processor.

This needs the minidump-walk-stack crate to be vendored. It builds when minidump and
minidump-walk-stack are changed to paths in the split-processor-crate branch of rust-minidump
(see https://github.com/rust-minidump/rust-minidump/pull/825).

I'm now just waiting on a new release of rust-minidump. Unfortunately the way vendoring works, I can't make progress or do it ahead of time without the crates being in crates.io (I tried directing them to both git urls and local paths as a means of working on it prior to publishing, but that seems to cause the packages to not be pulled in by vendoring properly).

There's a bit more work that needs to be done. Steven Fackler did some testing and noticed that the logic that finds symbols has some issues. We also don't demangle symbols when using local debug information. Neither of these issues would preclude us from using the existing code for minidump-analyze because we won't be symbolicating the stacks on the user machine.

However the symbol information also contains the inlining information and with Steven's patch applied I get incoherent stacks. The latter needs to be fixed before we roll out the new minidump-analyzer as we need consistent stacks with what we get on Socorro.

I've just published version 0.17.0 of the rust-minidump crate including the new minidump-unwind crate which we need here.

This is almost there. When vendoring dependencies, some files hit the size limit. We've reached out to gimli developers wrt the fixtures in that crate (which don't need to be vendored). Some of the other large files are necessary so we'll have to accept them.

:glandium brought up slight concern with vendoring symbolic since we hope to replace it with wholesym in the future (i.e. symbolic will live on in the repository forever). I researched a bit and came to the conclusion that it will be a non-trivial effort to update wholesym for our needs. We will want to expose CFI and probably add some crate features to be able to turn off networked stuff (which brings along unnecessary crates for our purposes). This is somewhat straightforward for macos and linux as macho-unwind-info and gimli provide CFI support. Windows (PE) will be a little more effort. But at this time I think it makes more sense to go ahead with using symbolic. It's entirely possible we'll use it for other things in the future, so I don't feel too bad about vendoring it in the repository. In total the source of the crates are about 1.1MB.

Attachment #9328160 - Attachment description: WIP: Bug 1743983 - Rewrite the minidump-analyzer in Rust r=gsvelto → Bug 1743983 - Rewrite the minidump-analyzer in Rust r=gsvelto
Blocks: syn-2

Depends on D175252

Due to gsvelto's discovery that the new minidump-analyzer uses an unacceptable amount of memory (with a large portion being from internal symbolic data structures), we need to rework the approach here. We cannot use symbolic to enable stack walking.

Instead, I am planning to use framehop. As of now, minidump-analyzer only needs the return addresses (no symbolication). It would be possible to use framehop alone with minidump to read the stack/registers, without using minidump-unwind at all. However, framehop doesn't do stack scanning, so this would not work for platforms without CFI. So we should likely use framehop with minidump-unwind, and ensure that will work without requiring symbolication.

framehop also doesn't support PE unwind info, so that will need to be added.

gsvelto, how does this approach sound to you?

Flags: needinfo?(gsvelto)

(In reply to Alex Franchuk from comment #8)

Due to gsvelto's discovery that the new minidump-analyzer uses an unacceptable amount of memory (with a large portion being from internal symbolic data structures), we need to rework the approach here. We cannot use symbolic to enable stack walking.

Instead, I am planning to use framehop. As of now, minidump-analyzer only needs the return addresses (no symbolication). It would be possible to use framehop alone with minidump to read the stack/registers, without using minidump-unwind at all. However, framehop doesn't do stack scanning, so this would not work for platforms without CFI. So we should likely use framehop with minidump-unwind, and ensure that will work without requiring symbolication.

framehop also doesn't support PE unwind info, so that will need to be added.

gsvelto, how does this approach sound to you?

This SGTM but before adding functionality to framehop I think it would be better to start with a minimum viable implementation on Linux - where it should already work well. We'll use that to validate if the approach works and then jump into adding whatever we need for the other platforms.

Flags: needinfo?(gsvelto)

Excellent idea, I will try Linux first.

I've confirmed that this works using framehop (and wholesym for completeness, which actually was able to get some symbols that symbolic couldn't). framehop sometimes performed functionally better than symbolic (getting a few extra frames, usually at the beginning). As far as performance, in debug builds framehop was about 10x faster than symbolic, and had very low memory usage.

There's one drawback right now, which is that framehop doesn't restore all registers for each unwound frame. This can be added in the future; I'm fine with adding a note to the documentation of the minidump-unwind DebugInfoSymbolProvider to that effect.

I'll move forward with adding the other platforms we need to framehop.

EDIT
I just did a release build comparison.
framehop + wholesym: 146MB peak memory usage, ~170ms usr time
symbolic: 322MB peak memory usage, 1.44s usr time

Debug builds of the framehop solution run faster than release builds of symbolic :)

Note that the memory usage is just for the simple case of an about:crashcontent minidump. This only loads libc.so and libxul.so (on linux). However the framehop solution works differently than the old way which lazily loaded modules: it loads all modules up-front, so it shouldn't use much more memory than that for any firefox minidumps.

No longer blocks: syn-2
Depends on: 1882578
Attachment #9328160 - Attachment is obsolete: true
Attachment #9341287 - Attachment is obsolete: true
Attachment #9398215 - Attachment description: Bug 1743983 pt3 - Remove and replace the old minidump analyzer r=gsvelto → Bug 1743983 pt3 - Remove the old minidump analyzer r=gsvelto

infinite_code_chain, infinite_entry_chain, invalid_exception_rva,
and not_a_pe all used a hacky --force-use-module flag to test the
unwind code directly. This is no longer possible since unwinding is done
in a third-party crate, and such tests would be more appropriate as unit
tests in minidump-unwind or framehop.

unknown_op relied on our unwind code not supporting some unwind
operations. The new minidump analyzer now supports all unwind
operations, so we can't test this anymore.

Attachment #9398213 - Attachment description: Bug 1743983 pt1 - Audits and vendoring for minidump-analyzer r=gsvelto → WIP: Bug 1743983 pt1 - Audits and vendoring for minidump-analyzer r=gsvelto
Blocks: 1372830
Attachment #9398213 - Attachment description: WIP: Bug 1743983 pt1 - Audits and vendoring for minidump-analyzer r=gsvelto → Bug 1743983 pt1 - Audits and vendoring for minidump-analyzer r=gsvelto
Pushed by afranchuk@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b476928598ab pt1 - Audits and vendoring for minidump-analyzer r=glandium,supply-chain-reviewers,profiler-reviewers,aabh https://hg.mozilla.org/integration/autoland/rev/7c90923fce11 pt2 - Rewrite the minidump-analyzer in Rust r=gsvelto https://hg.mozilla.org/integration/autoland/rev/5da5022a4f08 pt3 - Remove the old minidump analyzer r=gsvelto https://hg.mozilla.org/integration/autoland/rev/ace3ac2e2e26 pt4 - Remove inapplicable tests r=gsvelto
Regressions: 1911680

It seems that mingw is missing some symbols. That's fairly inconvenient, but we can dynamically load them.

Flags: needinfo?(afranchuk)

Ironically an earlier version of this code did dynamically load all symbols from wincrypt.dll wintrust.dll (as the old version does). It should not be too difficult to resurrect and update that, though moving to windows-sys originally offered a nice advantage in not needing to mess around with dynamically loading the functions.

As for the other failures:

  • The gv-junit failure seems unrelated to this change, as it's on android (minidump-analyzer isn't on android yet).
  • The bc failure is because minidump-unwind doesn't support fat dylibs. This was an unintentional omission.
  • The xpcshell failures on test_crash_exc_guard.js looks related to the minidump crates giving more detail (so the test can probably be updated).
  • The xpcshell crash failures look to be due to minidump_writer, which had to be updated when vendoring dependencies to not have duplicate crate versions.

So we need to:

  1. Decide whether to dynamically load wincrypt wintrust functions again or simply disable support for that fallback code path on mingw.
  2. Update minidump-unwind to support fat dylibs.
  3. Update toolkit/crashreporter/test/browser/browser_sandbox_crash.js with the new value.
  4. Potentially fix the issue in minidump_writer and bump the version.

(In reply to Alex Franchuk [:afranchuk] from comment #20)

Ironically an earlier version of this code did dynamically load all symbols from wincrypt.dll (as the old version does). It should not be too difficult to resurrect and update that, though moving to windows-sys originally offered a nice advantage in not needing to mess around with dynamically loading the functions.

The missing symbols are in wintrust.dll, not wincrypt.dll. They're probably only missing from the mingw import libs. Patching that is in the realm of possibilities. The functions are actually already in mscat.h, they're just missing from wintrust.def.

(In reply to Mike Hommey [:glandium] from comment #22)

(In reply to Alex Franchuk [:afranchuk] from comment #20)

Ironically an earlier version of this code did dynamically load all symbols from wincrypt.dll (as the old version does). It should not be too difficult to resurrect and update that, though moving to windows-sys originally offered a nice advantage in not needing to mess around with dynamically loading the functions.

The missing symbols are in wintrust.dll, not wincrypt.dll. They're probably only missing from the mingw import libs. Patching that is in the realm of possibilities. The functions are actually already in mscat.h, they're just missing from wintrust.def.

Sorry, that was a [repeated] typo (the function names throw me off). If we could patch it that'd be great, I'd prefer that over going to the old version of the code.

https://github.com/rust-minidump/rust-minidump/pull/1019 has the changes to minidump-unwind to support fat dylibs.

Can we land the minidump-writer in a separate bug before we land this to reduce the scope here?

It may be more complicated than that because there may need to be a number of other crate updates at the same time, however it's worth trying.

https://github.com/rust-minidump/minidump-writer/pull/129 fixes the minidump_writer crash. That was probably occurring because Android binaries are a bit weird sometimes, and we may have read something with an invalid start address or the derived data offset was gibberish.

Depends on: 1912131

(In reply to Alex Franchuk [:afranchuk] from comment #24)

https://github.com/rust-minidump/rust-minidump/pull/1019 has the changes to minidump-unwind to support fat dylibs.

The test failures exposed this issue, but I believe those failures are otherwise not due to these patches (there is an associated intermittent failure bug). But I was able to confirm that the rust-minidump PR works.

This test used to check for the exact string "SIGSYS", however the new
minidump-analyzer gives more context in the crash cause: because we are
using a call to chroot to crash the process which is sandboxed by
seccomp, the cause is "SIGSYS / SYS_SECCOMP".

Blocks: 1917685
Pushed by afranchuk@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/a46c460ecb3a pt1 - Audits and vendoring for minidump-analyzer r=glandium,supply-chain-reviewers,profiler-reviewers,aabh https://hg.mozilla.org/integration/autoland/rev/60b418a167d7 pt2 - Rewrite the minidump-analyzer in Rust r=gsvelto https://hg.mozilla.org/integration/autoland/rev/777d22bbe805 pt3 - Remove the old minidump analyzer r=gsvelto https://hg.mozilla.org/integration/autoland/rev/85fe08447584 pt4 - Remove inapplicable tests r=gsvelto https://hg.mozilla.org/integration/autoland/rev/d752fefce387 pt5 - Update sigsys check in unit test r=gsvelto https://hg.mozilla.org/integration/autoland/rev/4c18ba39bd29 pt6 - Add missing CryptCATAdmin symbols to mingw wintrust.def r=glandium

Backed out for causing xpcshell failures on test_crash_exc_guard.js

Backout link

Push with failures

Failure log

Flags: needinfo?(afranchuk)

The new minidump-analyzer adds more context to the crash cause/type.

Pushed by afranchuk@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b533615923ff pt1 - Audits and vendoring for minidump-analyzer r=glandium,supply-chain-reviewers,profiler-reviewers,aabh https://hg.mozilla.org/integration/autoland/rev/86fd9a8022d5 pt2 - Rewrite the minidump-analyzer in Rust r=gsvelto https://hg.mozilla.org/integration/autoland/rev/1e5b92251183 pt3 - Remove the old minidump analyzer r=gsvelto https://hg.mozilla.org/integration/autoland/rev/98825aca0eff pt4 - Remove inapplicable tests r=gsvelto https://hg.mozilla.org/integration/autoland/rev/2161233cb06e pt5 - Update sigsys check in unit test r=gsvelto https://hg.mozilla.org/integration/autoland/rev/99bd98d4f69b pt6 - Add missing CryptCATAdmin symbols to mingw wintrust.def r=glandium https://hg.mozilla.org/integration/autoland/rev/60b40c542ce3 pt7 - Fix test_crash_exc_guard.js on macos r=gsvelto
Flags: needinfo?(afranchuk)
Regressions: 1921420
Blocks: 1923657
Regressions: 1931237
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: