Closed Bug 1741205 Opened 3 years ago Closed 2 years ago

Replace the Breakpad-based minidump_stackwalk toolchain binary with rust-minidump's one

Tracking

()

Status:

RESOLVED FIXED

Milestone:

96 Branch

Tracking Flags:

Tracking

Status

firefox96

---

fixed

People

(Reporter: gsvelto, Assigned: Gankra)

References

(Blocks 2 open bugs)

Details

Attachments

(3 files, 1 obsolete file)

Bug 1741205 - Use rust-minidump's minidump-stackwalk for mozcrash.py 3 years ago Aria Beingessner [:Gankra] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1741205 - remove tools/crashreporter/minidump_stackwalk 3 years ago Aria Beingessner [:Gankra] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1741205 - stub out a 64-bit windows build for minidump-stackwalk. r?gsvelto 3 years ago Aria Beingessner [:Gankra] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1741205 - add more robust version checking to mozcrash.py. r?kriswright 2 years ago Aria Beingessner [:Gankra] 48 bytes, text/x-phabricator-request		Details \| Review

Gabriele Svelto [:gsvelto] (PTO)

Reporter

Description

•

3 years ago

As per title, it's a matter of replacing the toolchain task that builds it and adjusting mozcrash.py to pass the --human parameter so that we get human-readable output instead of JSON.

Aria Beingessner [:Gankra]

Assignee

Updated

•

3 years ago

Assignee: nobody → a.beingessner

Gabriele Svelto [:gsvelto] (PTO)

Reporter

Updated

•

3 years ago

Blocks: 1487410

Gabriele Svelto [:gsvelto] (PTO)

Reporter

Updated

•

3 years ago

Blocks: 1410840

Aria Beingessner [:Gankra]

Assignee

Comment 1

•

3 years ago

Some notes:

need to pass --human parameter
output must conform to this regex (it does)
must properly append --symbols-url to the url arg (the minidump_stackwalk binary used here implicitly accepts them in the position of symbols_path args, which I do not believe we do, and I do not want to support)
- Note that although the symbols_path argument to mozcrash.py does take other URLs, those are pre-processed away by _get_symbols, so we only need to handle this one case.
These scripts expect a binary named minidump_stackwalk, which rust-minidump also produces, but I have the crate published on crates.io as minidump-stackwalk, so that's what cargo install will name it. May need to do some renaming depending on how we build it.

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 2

•

3 years ago

I'm working on fixing bug #1594515 which touches minidump_stackwalk and mozcrash. If you want me to add some future-proofing to ease some pain, let me know.

Aria Beingessner [:Gankra]

Assignee

Comment 3

•

3 years ago

Oh that's interesting context. I'm confused by the discussion of a required "temp" directory argument since I don't see that in mozcrash.py. (offer appreciated, don't see anything that needs future-proofing though)

Aria Beingessner [:Gankra]

Assignee

Comment 4

•

3 years ago

Attached file Bug 1741205 - Use rust-minidump's minidump-stackwalk for mozcrash.py — Details

This is the first step in replacing a huge pile of our breakpad-based infra
with our new implementation (rust-minidump). This stackwalker is only used
for reporting crashes in local builds and CI, so it's a good first deploy.

Although most of the work on rust-minidump has been focused on the JSON output,
this uses the --human output, because it's primarily intended for humans to
directly read. There is however some minor parsing done on this format. This
is not strictly supported by --human (it has no schema) but it's not something
we plan to break. (This parsing is pre-existing, just recording the facts.)

The new build configs/scripts are hybridized from fix-stacks and dump_syms,
as this basically is a hybrid of the two. In particular it needs the openssl
vendoring tricks that dump_syms uses, but is a target binary that prefers
win32 over win64 (like fix-stacks).

Technically a regression but probably just culling legacy cruft at this point:
this patchset removes support for building a local copy of minidump-stackwalk
from source. You must now download a copy built on task-cluster using mozboot.
mozboot already did this, which is why this feature appears to be legacy cruft
-- there was little reason to build a local copy.

However rust-minidump's minidump-stackwalk has a far better portability story,
so you can build+install your own local copy by just running:

cargo install minidump-stackwalk

Aria Beingessner [:Gankra]

Assignee

Comment 5

•

3 years ago

Attached file Bug 1741205 - remove tools/crashreporter/minidump_stackwalk — Details

This version of minidump-stackwalk is now replaced with rust-minidump's
minidump-stackwalk, which we build from a FETCH. Not touching the other
stuff in this directory because I have no idea what it is.

Depends on D131315

Aria Beingessner [:Gankra]

Assignee

Comment 6

•

3 years ago

try build https://treeherder.mozilla.org/jobs?repo=try&revision=774702bdba8438a7041c5a08a834abafb162aa99

Aria Beingessner [:Gankra]

Assignee

Comment 7

•

3 years ago

that build seems to be caching the minidump-stackwalk builds, so here's the last try build that touched them: https://treeherder.mozilla.org/jobs?repo=try&revision=2cbb95ed2252de120f685b4829af6244cf422b2d

Aria Beingessner [:Gankra]

Assignee

Comment 8

•

3 years ago

you can see it working in production in this even-earlier-try build's logs: https://treeherder.mozilla.org/logviewer?job_id=358264754&repo=try&lineNumber=21679-21703

Aria Beingessner [:Gankra]

Assignee

Comment 9

•

3 years ago

Attached file Bug 1741205 - stub out a 64-bit windows build for minidump-stackwalk. r?gsvelto (obsolete) — Details

Depends on D131316

Phabricator Automation

Updated

•

3 years ago

Attachment #9251211 - Attachment is obsolete: true

Aria Beingessner [:Gankra]

Assignee

Comment 10

•

3 years ago

Squashed the 64-bit stuff into the first commit after address review and doing some refactoring. I had broken breakpad-injector which is now a bit of an unrelated oddball.

New try push: https://treeherder.mozilla.org/jobs?repo=try&revision=86f8286d0da5a3cf2add65120ff6868990431389

Aria Beingessner [:Gankra]

Assignee

Comment 11

•

3 years ago

Notice to Sheriffs and Tooling Folks

I am now pushing this change to try's crash handler (and the one used for local builds). Ideally it won't change behaviour (meaningfully). So if it is behaving radically, different, please let me know! (Even if the change is positive, that's great to know!)

Specifically, this completely rewrites the tool that is used to produce a backtrace from a minidump when a test unexpectedly crashes (invoking mozcrash.py). This tool is used to generate this kind of line in the try frontend:

PROCESS-CRASH | Last test finished | application crashed [@ static mozilla::image::SurfaceCache::ReleaseImageOnMainThread(already_AddRefed<mozilla::image::Image>, bool)]

And this information in the logs:

INFO - mozcrash Copy/paste: Z:/task_163718277621850/fetches\minidump_stackwalk\minidump_stackwalk.exe --human C:\Users\task_163718277621850\AppData\Local\Temp\tmpra9trz2u.mozrunner\minidumps\b2b1c4f0-3e0a-46b8-b469-988591c3c015.dmp Z:\task_163718277621850\build\symbols --symbols-url=https://symbols.mozilla.org/
INFO - mozcrash Saved minidump as Z:\task_163718277621850\build\blobber_upload_dir\b2b1c4f0-3e0a-46b8-b469-988591c3c015.dmp
INFO - PROCESS-CRASH | Last test finished | application crashed [@ static mozilla::image::SurfaceCache::ReleaseImageOnMainThread(already_AddRefed<mozilla::image::Image>, bool)]
INFO - Crash dump filename: C:\Users\task_163718277621850\AppData\Local\Temp\tmpra9trz2u.mozrunner\minidumps\b2b1c4f0-3e0a-46b8-b469-988591c3c015.dmp
INFO - Operating system: Windows NT
INFO -                   10.0.19041 
INFO - CPU: amd64
INFO -      family 6 model 85 stepping 7
INFO -      8 CPUs
INFO - 
INFO - Crash reason:  EXCEPTION_BREAKPOINT
INFO - Crash address: 0x7ff8ae0df019
INFO - Process uptime: 2 seconds
INFO - 
INFO - Thread 3 TaskController #2 (crashed)
INFO -  0  xul.dll!static mozilla::image::SurfaceCache::ReleaseImageOnMainThread(already_AddRefed<mozilla::image::Image>, bool) [SurfaceCache.cpp:694ce55b85c51b3381eaf432020924e4f0ca4717 : 1831 + 0x40]
INFO -     rax = 0x00007ff8b541d9f9   rdx = 0x0000000000000000
INFO -     rcx = 0x00007ff8e277c978   rbx = 0x0000000000000001
INFO -     rsi = 0x00000041a977f350   rdi = 0x00000177b0f6f740
INFO -     rbp = 0x00000177aa172130   rsp = 0x00000041a977f2b0
INFO -      r8 = 0x00000041a977f820    r9 = 0x00007ff8ea530000
INFO -     r10 = 0x00007ff8ea582651   r11 = 0x00000041a977ec70
INFO -     r12 = 0x00007ff8e26e9630   r13 = 0x00000177b3a15120
INFO -     r14 = 0x00000177b0f970d8   r15 = 0x00000177b28a2a00
INFO -     rip = 0x00007ff8ae0df019
INFO -     Found by: given as instruction pointer in context
INFO -  1  xul.dll!mozilla::image::DecodedSurfaceProvider::FinishDecoding() [DecodedSurfaceProvider.cpp:694ce55b85c51b3381eaf432020924e4f0ca4717 : 200 + 0x37]
INFO -     rbx = 0x0000000000000001   rbp = 0x00000177aa172130
INFO -     rsp = 0x00000041a977f310   r12 = 0x00007ff8e26e9630
INFO -     r13 = 0x00000177b3a15120   r14 = 0x00000177b0f970d8
INFO -     r15 = 0x00000177b28a2a00   rip = 0x00007ff8ae0b0dbf
INFO -     Found by: call frame info
INFO -  2  xul.dll!mozilla::image::DecodedSurfaceProvider::Run() [DecodedSurfaceProvider.cpp:694ce55b85c51b3381eaf432020924e4f0ca4717 : 129 + 0x7]
INFO -     rbx = 0x0000000000000001   rbp = 0x00000177aa172130
INFO -     rsp = 0x00000041a977f390   r12 = 0x00007ff8e26e9630
INFO -     r13 = 0x00000177b3a15120   r14 = 0x00000177b0f970d8
INFO -     r15 = 0x00000177b28a2a00   rip = 0x00007ff8ae0b0929
INFO -     Found by: call frame info
...

Only the new implementation will include the --human flag in the Copy/paste: Z:/task_163718277621850/fetches\minidump_stackwalk\minidump_stackwalk.exe line, if you aren't sure which implementation is being used. However I believe this line is omitted if fix-stacks.py is involved. In that case, you can use the fact that the backtraces will end with "unimplemented streams", a feature unique to the new tool:

INFO -  Unimplemented streams encountered:
INFO -  Stream 0x00000000 UnusedStream (Official) @ 0x00000000
INFO -  Stream 0x00000015 SystemMemoryInfoStream (Official) @ 0x00002d98
INFO -  Stream 0x00000016 ProcessVmCountersStream (Official) @ 0x00002f84

Nitty Gritty Details

I am specifically replacing the implementation of minidump_stackwalk used by mozcrash.py with our new rust-based implementation.

This implementation has been running on the staging servers for crash-stats since friday, and is slated to become the new crash-stats backend once we've done enough testing to be confident in that deployment. Deploying it to try is another step in building our confidence in the new implementation.

On paper, this should be a significant upgrade to try's minidump-stackwalk, in that the one that is currently being used is a weird unmaintained fork of what's used on crash-stats. The new one should be faster and more reliable, and will be easy to update by just changing the commit in its toolchain fetch:

rust-minidump:
    description: rust-minidump source code (for minidump-stackwalk)
    fetch:
        type: git
        repo: https://github.com/luser/rust-minidump/
        revision: 0c90e02544797317503d1c4cff8daab0cabdea86

However this will introduce some changes to how minidump-stackwalk is built:

minidump-stackwalk no longer builds from checked-in source, so it will only need to be built if the toolchain fetch is updated (less builds and churn, yay!)
this introduces a currently orphaned win64-minidump-stackwalk build, for future use in solving Bug 1410840. That can be removed if having an orphan tool sets off some annoying warnings/errors for the sheriffs.
there was technically some wiring in mach to support building minidump-stackwalk locally. this no longer works, as there is no local source to use. mozboot was already downloading minidump_stackwalk for you, so it's unlikely this will affect anyone's workflow. If for whatever reason you want to build your own copy of minidump-stackwalk, you can cargo install minidump-stackwalk, or checkout the rust-minidump tree and build it with cargo build --release (that's all our build servers do!) (NB the binary is officially minidump-stackwalk but we rename it to minidump_stackwalk in CI to avoid needless churn)

Crash Analysis Power User Features

The new minidump-stackwalk should build and run locally on all mainstream platforms without any issue (I develop and test it in Windows Powershell!). I have done my best to write user documentation and the process for analyzing a firefox minidump is streamlined:

> minidump-stackwalk --symbols-url=https://symbols.mozilla.org/ /path/to/minidump.dmp

Because it is designed to be the backend for crash-stats (aka socorro), it can do ~all the analysis you expect there (more details are surfaced in the default JSON output than the human one). To help me test this, I have also created a tool for downloading minidump/annotations from crash-stats and compairing local minidump-stackwalk output to the values on crash-stats: socc-pair. While the comparison machinery is probably not necessarily useful to you, as a side-effect of its purpose it automates fetching all the details of a crash and running local analysis, and saves the results to files you can search through for details:

Output Files: 
  * Minidump: C:\Users\gankra\AppData\Local\Temp\socc-pair\dumps\b4f58e9f-49be-4ba5-a203-8ef160211027.dmp
  * Socorro Processed Crash: C:\Users\gankra\AppData\Local\Temp\socc-pair\dumps\b4f58e9f-49be-4ba5-a203-8ef160211027.json
  * Raw JSON: C:\Users\gankra\AppData\Local\Temp\socc-pair\dumps\b4f58e9f-49be-4ba5-a203-8ef160211027.raw.json
  * Local minidump-stackwalk Output: C:\Users\gankra\AppData\Local\Temp\socc-pair\dumps\b4f58e9f-49be-4ba5-a203-8ef160211027.local.json
  * Local minidump-stackwalk Logs: C:\Users\gankra\AppData\Local\Temp\socc-pair\dumps\b4f58e9f-49be-4ba5-a203-8ef160211027.log.txt

NOTE: these files can contain protected user information. Although they are written to your system's default "temp", I recommend deleting the temp socc-pair directory regularly to ensure compliance with our protected data policies.

Notably this includes all of the logging rust-minidump did (in the .log.txt), including tracing for the backtracer's analysis, which can help you debug strange backtraces:

[TRACE] unwind: starting stack unwind
[TRACE] unwind: unwinding NtGetContextThread
[TRACE] unwind: trying cfi
[TRACE] unwind: found symbols for address, searching for cfi entries
[TRACE] unwind: trying STACK CFI exprs
[TRACE] unwind:   .cfa: $rsp 8 + .ra: .cfa 8 - ^
[TRACE] unwind:   .cfa: $rsp 8 +
[TRACE] unwind: STACK CFI parse successful
[TRACE] unwind: STACK CFI seems reasonable, evaluating
[TRACE] unwind: successfully evaluated .cfa (frame address)
[TRACE] unwind: successfully evaluated .ra (return address)
[TRACE] unwind: cfi evaluation was successful -- caller_ip: 0x000000ec00000000, caller_sp: 0x000000ec7fbfd790
[TRACE] unwind: cfi result seems valid
[TRACE] unwind: unwinding 1013612281855
[TRACE] unwind: trying cfi
[TRACE] unwind: trying frame pointer
[TRACE] unwind: trying scan
[TRACE] unwind: scan seems valid -- caller_ip: 0x7ffd172c2a24, caller_sp: 0xec7fbfd7f8
[TRACE] unwind: unwinding <unknown in ntdll.dll>
[TRACE] unwind: trying cfi
[TRACE] unwind: found symbols for address, searching for cfi entries
[TRACE] unwind: trying frame pointer
[TRACE] unwind: trying scan
[TRACE] unwind: scan seems valid -- caller_ip: 0x7ffd162b7034, caller_sp: 0xec7fbfd828
[TRACE] unwind: unwinding BaseThreadInitThunk
[TRACE] unwind: trying cfi
[TRACE] unwind: found symbols for address, searching for cfi entries
[TRACE] unwind: trying STACK CFI exprs
[TRACE] unwind:   .cfa: $rsp 8 + .ra: .cfa 8 - ^
[TRACE] unwind:   .cfa: $rsp 48 +
[TRACE] unwind: STACK CFI parse successful
[TRACE] unwind: STACK CFI seems reasonable, evaluating
[TRACE] unwind: successfully evaluated .cfa (frame address)
[TRACE] unwind: successfully evaluated .ra (return address)
[TRACE] unwind: cfi evaluation was successful -- caller_ip: 0x0000000000000000, caller_sp: 0x000000ec7fbfd858
[TRACE] unwind: cfi result seems valid
[TRACE] unwind: instruction pointer was nullish, assuming unwind complete
[TRACE] unwind: finished stack unwind

If you are using minidump-stackwalk directly, you can get these same logs with --verbose=trace. --output-file=x/y/z and --log-file=x/y/z arguments are included to make it easier to pipe these streams to files.

Pulsebot

Comment 12

•

3 years ago

Pushed by abeingessner@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b0107c4d5274
Use rust-minidump's minidump-stackwalk for mozcrash.py r=gsvelto,KrisWright
https://hg.mozilla.org/integration/autoland/rev/7b95b224608b
remove tools/crashreporter/minidump_stackwalk r=gsvelto

Sandor Molnar[:smolnar]

Comment 13

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/b0107c4d5274
https://hg.mozilla.org/mozilla-central/rev/7b95b224608b

Status: NEW → RESOLVED

Closed: 2 years ago

status-firefox96: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 96 Branch

Pulsebot

Comment 14

•

2 years ago

Pushed by geoff@darktrojan.net:
https://hg.mozilla.org/comm-central/rev/fc125e910ec9
Port bug 1741205 - Add missing fetch job linux64-rust-windows-1.56. rs=bustage-fix

Gabriele Svelto [:gsvelto] (PTO)

Reporter

Comment 15

•

2 years ago

Please send the info in comment 11 to the dev-platform and stability mailing lists for extra visibility.

Flags: needinfo?(a.beingessner)

Aria Beingessner [:Gankra]

Assignee

Comment 16

•

2 years ago

done for dev-platform, stability seems to not like me (someone with perms, feel free to forward)

Flags: needinfo?(a.beingessner)

Aria Beingessner [:Gankra]

Assignee

Comment 17

•

2 years ago

Attached file Bug 1741205 - add more robust version checking to mozcrash.py. r?kriswright — Details

I forgot that mozboot only pulls in updates when you run ./mach bootstrap,
so some people got the new mozcrash.py locally without actually having the
new rust-minidump-based version. So now we first run the stackwalk binary
with -V to check what version it is.

The rest of the details can be found in the comments I added.

Also updates rust-minidump to 0.9.6 get some CLI parsing fixes
and better --help documentation (socorro staging is already updated to
this version).

Pulsebot

Comment 18

•

2 years ago

Pushed by abeingessner@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/550be0d3b535
add more robust version checking to mozcrash.py. r=KrisWright

Norisz Fay [:noriszfay]

Comment 19

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/550be0d3b535

Mike Hommey [:glandium]

Updated

•

2 years ago

Blocks: 1752252

You need to log in before you can comment on or make changes to this bug.