Rewrite the minidump-analyzer in Rust
Categories
(Toolkit :: Crash Reporting, task)
Tracking
()
Tracking | Status | |
---|---|---|
firefox132 | --- | fixed |
People
(Reporter: gsvelto, Assigned: afranchuk)
References
(Blocks 2 open bugs)
Details
Attachments
(8 files, 2 obsolete files)
169.19 KB,
patch
|
Details | Diff | Splinter Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review |
As per title. The current minidump-analyzer takes the JSON annotations file as generated in a Firefox crash report plus the minidump and creates a stack trace that will be added to the JSON annotations. While doing so it also retrieves module signatures and adds them to the annotations.
We only ever implemented reading unwinding information on x86-64/Windows in bug 1333126, for all the other platforms we only used frame pointers or stack scanning.
A rewrite could be done in several steps:
- Build a version that only uses stack scanning & frame pointers but which is otherwise compatible with the existing implementation
- Add unwinding support for Windows (by using the symbolic-debuginfo crate to read the unwinding information from PE files)
- Add unwinding support macOS (likewise but for Mach-O ones)
- Add unwinding support for Linux/Android. This is trickier as I believe that symbolic-debuginfo has no way of surfacing this data - yet - so we might have to gimli/goblin directly to parse the data
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 1•2 years ago
|
||
Once the stackwalking functionality is split out of the processor we can move forward, as we don't want to audit/vendor the huge dependencies of minidump-processor
.
Assignee | ||
Comment 2•2 years ago
|
||
This needs the minidump-walk-stack
crate to be vendored. It builds when minidump
and
minidump-walk-stack
are changed to paths in the split-processor-crate
branch of rust-minidump
(see https://github.com/rust-minidump/rust-minidump/pull/825).
Assignee | ||
Comment 3•2 years ago
|
||
I'm now just waiting on a new release of rust-minidump
. Unfortunately the way vendoring works, I can't make progress or do it ahead of time without the crates being in crates.io
(I tried directing them to both git urls and local paths as a means of working on it prior to publishing, but that seems to cause the packages to not be pulled in by vendoring properly).
Reporter | ||
Comment 4•2 years ago
|
||
There's a bit more work that needs to be done. Steven Fackler did some testing and noticed that the logic that finds symbols has some issues. We also don't demangle symbols when using local debug information. Neither of these issues would preclude us from using the existing code for minidump-analyze because we won't be symbolicating the stacks on the user machine.
However the symbol information also contains the inlining information and with Steven's patch applied I get incoherent stacks. The latter needs to be fixed before we roll out the new minidump-analyzer as we need consistent stacks with what we get on Socorro.
Reporter | ||
Comment 5•2 years ago
|
||
I've just published version 0.17.0 of the rust-minidump
crate including the new minidump-unwind
crate which we need here.
Assignee | ||
Comment 6•2 years ago
|
||
This is almost there. When vendoring dependencies, some files hit the size limit. We've reached out to gimli developers wrt the fixtures in that crate (which don't need to be vendored). Some of the other large files are necessary so we'll have to accept them.
:glandium brought up slight concern with vendoring symbolic since we hope to replace it with wholesym in the future (i.e. symbolic will live on in the repository forever). I researched a bit and came to the conclusion that it will be a non-trivial effort to update wholesym for our needs. We will want to expose CFI and probably add some crate features to be able to turn off networked stuff (which brings along unnecessary crates for our purposes). This is somewhat straightforward for macos and linux as macho-unwind-info
and gimli
provide CFI support. Windows (PE) will be a little more effort. But at this time I think it makes more sense to go ahead with using symbolic. It's entirely possible we'll use it for other things in the future, so I don't feel too bad about vendoring it in the repository. In total the source of the crates are about 1.1MB.
Updated•2 years ago
|
Assignee | ||
Comment 7•2 years ago
|
||
Depends on D175252
Assignee | ||
Comment 8•2 years ago
|
||
Due to gsvelto's discovery that the new minidump-analyzer uses an unacceptable amount of memory (with a large portion being from internal symbolic
data structures), we need to rework the approach here. We cannot use symbolic
to enable stack walking.
Instead, I am planning to use framehop
. As of now, minidump-analyzer
only needs the return addresses (no symbolication). It would be possible to use framehop
alone with minidump
to read the stack/registers, without using minidump-unwind
at all. However, framehop
doesn't do stack scanning, so this would not work for platforms without CFI. So we should likely use framehop
with minidump-unwind
, and ensure that will work without requiring symbolication.
framehop
also doesn't support PE unwind info, so that will need to be added.
gsvelto, how does this approach sound to you?
Reporter | ||
Comment 9•2 years ago
|
||
(In reply to Alex Franchuk from comment #8)
Due to gsvelto's discovery that the new minidump-analyzer uses an unacceptable amount of memory (with a large portion being from internal
symbolic
data structures), we need to rework the approach here. We cannot usesymbolic
to enable stack walking.Instead, I am planning to use
framehop
. As of now,minidump-analyzer
only needs the return addresses (no symbolication). It would be possible to useframehop
alone withminidump
to read the stack/registers, without usingminidump-unwind
at all. However,framehop
doesn't do stack scanning, so this would not work for platforms without CFI. So we should likely useframehop
withminidump-unwind
, and ensure that will work without requiring symbolication.
framehop
also doesn't support PE unwind info, so that will need to be added.gsvelto, how does this approach sound to you?
This SGTM but before adding functionality to framehop
I think it would be better to start with a minimum viable implementation on Linux - where it should already work well. We'll use that to validate if the approach works and then jump into adding whatever we need for the other platforms.
Assignee | ||
Comment 10•2 years ago
|
||
Excellent idea, I will try Linux first.
Assignee | ||
Comment 11•2 years ago
•
|
||
I've confirmed that this works using framehop
(and wholesym
for completeness, which actually was able to get some symbols that symbolic
couldn't). framehop
sometimes performed functionally better than symbolic
(getting a few extra frames, usually at the beginning). As far as performance, in debug builds framehop
was about 10x faster than symbolic
, and had very low memory usage.
There's one drawback right now, which is that framehop
doesn't restore all registers for each unwound frame. This can be added in the future; I'm fine with adding a note to the documentation of the minidump-unwind
DebugInfoSymbolProvider
to that effect.
I'll move forward with adding the other platforms we need to framehop
.
EDIT
I just did a release build comparison.
framehop + wholesym
: 146MB peak memory usage, ~170ms usr time
symbolic
: 322MB peak memory usage, 1.44s usr time
Debug builds of the framehop
solution run faster than release builds of symbolic
:)
Note that the memory usage is just for the simple case of an about:crashcontent
minidump. This only loads libc.so
and libxul.so
(on linux). However the framehop
solution works differently than the old way which lazily loaded modules: it loads all modules up-front, so it shouldn't use much more memory than that for any firefox minidumps.
Updated•1 year ago
|
Updated•1 year ago
|
Assignee | ||
Comment 13•1 year ago
|
||
Assignee | ||
Comment 14•1 year ago
|
||
Assignee | ||
Comment 15•1 year ago
|
||
Updated•1 year ago
|
Assignee | ||
Comment 16•1 year ago
|
||
infinite_code_chain
, infinite_entry_chain
, invalid_exception_rva
,
and not_a_pe
all used a hacky --force-use-module
flag to test the
unwind code directly. This is no longer possible since unwinding is done
in a third-party crate, and such tests would be more appropriate as unit
tests in minidump-unwind
or framehop
.
unknown_op
relied on our unwind code not supporting some unwind
operations. The new minidump analyzer now supports all unwind
operations, so we can't test this anymore.
Updated•1 year ago
|
Updated•1 year ago
|
Comment 17•1 year ago
|
||
Assignee | ||
Comment 19•1 year ago
|
||
It seems that mingw is missing some symbols. That's fairly inconvenient, but we can dynamically load them.
Assignee | ||
Comment 20•1 year ago
•
|
||
Ironically an earlier version of this code did dynamically load all symbols from wincrypt.dll wintrust.dll (as the old version does). It should not be too difficult to resurrect and update that, though moving to windows-sys
originally offered a nice advantage in not needing to mess around with dynamically loading the functions.
Assignee | ||
Comment 21•1 year ago
•
|
||
As for the other failures:
- The gv-junit failure seems unrelated to this change, as it's on android (
minidump-analyzer
isn't on android yet). - The bc failure is because
minidump-unwind
doesn't support fat dylibs. This was an unintentional omission. - The xpcshell failures on test_crash_exc_guard.js looks related to the
minidump
crates giving more detail (so the test can probably be updated). - The xpcshell crash failures look to be due to
minidump_writer
, which had to be updated when vendoring dependencies to not have duplicate crate versions.
So we need to:
- Decide whether to dynamically load
wincrypt
wintrust
functions again or simply disable support for that fallback code path on mingw. - Update
minidump-unwind
to support fat dylibs. - Update
toolkit/crashreporter/test/browser/browser_sandbox_crash.js
with the new value. - Potentially fix the issue in
minidump_writer
and bump the version.
Comment 22•1 year ago
|
||
(In reply to Alex Franchuk [:afranchuk] from comment #20)
Ironically an earlier version of this code did dynamically load all symbols from wincrypt.dll (as the old version does). It should not be too difficult to resurrect and update that, though moving to
windows-sys
originally offered a nice advantage in not needing to mess around with dynamically loading the functions.
The missing symbols are in wintrust.dll, not wincrypt.dll. They're probably only missing from the mingw import libs. Patching that is in the realm of possibilities. The functions are actually already in mscat.h, they're just missing from wintrust.def.
Assignee | ||
Comment 23•1 year ago
•
|
||
(In reply to Mike Hommey [:glandium] from comment #22)
(In reply to Alex Franchuk [:afranchuk] from comment #20)
Ironically an earlier version of this code did dynamically load all symbols from wincrypt.dll (as the old version does). It should not be too difficult to resurrect and update that, though moving to
windows-sys
originally offered a nice advantage in not needing to mess around with dynamically loading the functions.The missing symbols are in wintrust.dll, not wincrypt.dll. They're probably only missing from the mingw import libs. Patching that is in the realm of possibilities. The functions are actually already in mscat.h, they're just missing from wintrust.def.
Sorry, that was a [repeated] typo (the function names throw me off). If we could patch it that'd be great, I'd prefer that over going to the old version of the code.
Assignee | ||
Comment 24•1 year ago
|
||
https://github.com/rust-minidump/rust-minidump/pull/1019 has the changes to minidump-unwind
to support fat dylibs.
Reporter | ||
Comment 25•1 year ago
|
||
Can we land the minidump-writer in a separate bug before we land this to reduce the scope here?
Assignee | ||
Comment 26•1 year ago
|
||
It may be more complicated than that because there may need to be a number of other crate updates at the same time, however it's worth trying.
Assignee | ||
Comment 27•1 year ago
|
||
https://github.com/rust-minidump/minidump-writer/pull/129 fixes the minidump_writer
crash. That was probably occurring because Android binaries are a bit weird sometimes, and we may have read something with an invalid start address or the derived data offset was gibberish.
Assignee | ||
Comment 28•1 year ago
|
||
(In reply to Alex Franchuk [:afranchuk] from comment #24)
https://github.com/rust-minidump/rust-minidump/pull/1019 has the changes to
minidump-unwind
to support fat dylibs.
The test failures exposed this issue, but I believe those failures are otherwise not due to these patches (there is an associated intermittent failure bug). But I was able to confirm that the rust-minidump PR works.
Assignee | ||
Comment 29•1 year ago
|
||
This test used to check for the exact string "SIGSYS", however the new
minidump-analyzer gives more context in the crash cause: because we are
using a call to chroot
to crash the process which is sandboxed by
seccomp, the cause is "SIGSYS / SYS_SECCOMP".
Assignee | ||
Comment 30•1 year ago
|
||
Comment 31•11 months ago
|
||
Comment 32•11 months ago
|
||
Backed out for causing xpcshell failures on test_crash_exc_guard.js
Assignee | ||
Comment 33•11 months ago
|
||
The new minidump-analyzer adds more context to the crash cause/type.
Comment 34•11 months ago
|
||
Assignee | ||
Updated•11 months ago
|
Comment 35•11 months ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/b533615923ff
https://hg.mozilla.org/mozilla-central/rev/86fd9a8022d5
https://hg.mozilla.org/mozilla-central/rev/1e5b92251183
https://hg.mozilla.org/mozilla-central/rev/98825aca0eff
https://hg.mozilla.org/mozilla-central/rev/2161233cb06e
https://hg.mozilla.org/mozilla-central/rev/99bd98d4f69b
https://hg.mozilla.org/mozilla-central/rev/60b40c542ce3
Description
•