Closed Bug 1296274 Opened 8 years ago Closed 5 years ago

Evalute symbolicating stack traces when doing client-side stack walking

Categories

(Toolkit :: Crash Reporting, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: gsvelto, Unassigned)

References

Details

(Whiteboard: [fce-active-legacy])

+++ This bug was initially created as a clone of Bug #1280469 +++

In the current design when we extract the stack traces of a crash dump on a client machine we do not symbolicate the trace, relying instead on server-side infrastructure to accomplish that task.

Symbolicating on the client side would add some additional complexity to the process but bring a few benefits with it: we could extract proper symbols from Linux system libraries for example which we currently don't do. Additionally we might be able to generate a signature directly and use it to provide additional information to the user (e.g. point him to the relevant bug or a known solution or workaround for his problem).
Does this also include CFI-based stackwalking? To get this to release, we either need frame pointers or we need CFI-based stackwalking, and once we have that the symbolication is "easy".
Flags: needinfo?(gsvelto)
I thought the concern with this was that including frame pointers on Beta/Release would negatively impact the size of the program. (I don't know enough about CFI yet to have an opinion.) Either way, the signature aspect would be wise to do in conjunction with the Socorro team.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #1)
> Does this also include CFI-based stackwalking? To get this to release, we
> either need frame pointers or we need CFI-based stackwalking, and once we
> have that the symbolication is "easy".

Breakpad supports CFI-based stackwalking so it should work. I haven't had time to test it though (i.e. build a production Firefox for 32-bit Windows and see if the stacks that come out make sense).
Flags: needinfo?(gsvelto)
I've spent most of today looking at how production win32 build behave with my patch set and things are not working as I expected them to. While breakpad does support CFI-based stack walking it doesn't seem to use it when processing a minidump for a non-debug build. Instead it falls back to stack scanning (which it also does when walking over JIT frames). The results I get from stack scanning are decent but not perfect.

It might be that I'm not setting up breakpad correctly when parsing the minidump file (so it can't find the unwinding tables) but that requires more investigation.
CFI stackwalking requires that the CFI be available (from the debuginfo: it's not in the binaries or anything we ship to the client by default). So doing a CFI stackwalk means, I think:

* altering the symbol server to publish the CFI data for an offset, not just the symbol/source/line
* altering breakpad to fetch and use symbol/CFI information dynamically from the symbol server

That's why I figured symbolication would be tied up with CFI stackwalking: you can't postprocess the stackwalk to add symbol data: it's a dynamic input into the stackwalking process.
(In reply to David Durst [:ddurst] from comment #2)
> I thought the concern with this was that including frame pointers on
> Beta/Release would negatively impact the size of the program. (I don't know
> enough about CFI yet to have an opinion.) Either way, the signature aspect
> would be wise to do in conjunction with the Socorro team.

It's not the size of the binary, it's performance. Losing that register in x86 is enough to make a noticeable shift in some benchmarks.

"CFI" isn't really a thing that exists on Windows, what you get there is (on x86) program strings in the PDB file (the STACK WIN lines in symbol files) and (on x86-64) unwind tables in the PE headers (these get dumped as STACK CFI nowadays, I wrote code that translates them, but we could read them directly if we wanted). On non-Windows Breakpad can use the CFI from .debug_frame or .eh_frame. The latter ships in the binaries on x86-64, so you could presumably wire up some Breakpad code to read that directly instead of dumping it to .sym first (or just dump it on-the-fly for a proof of concept). On ARM there are exception tables (.ARM.exidx, .ARM.extbl) which are similar, and we have a local Breakpad patch in-tree to read them which I've never managed to upstream (bug 863475).
Now that we are shipping Win64 builds the perf impact of shipping with frame pointers might be workable since we can always say that we advise using our Win64 builds for the best performance...
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #6)
> 
> It's not the size of the binary, it's performance. Losing that register in
> x86 is enough to make a noticeable shift in some benchmarks.

In Microsoft's WER paper they talk about how they include frame pointers
(http://www.sigops.org/sosp/sosp09/papers/glerum-sosp09.pdf, section 7.1):

"The decision to disable frame-pointer omission (FPO), in Windows XP SP 2, was
originally quite controversial within Microsoft... Programmers believed FPO was
crucial to performance. However, extensive internal testing across a wide range
of both desktop and server benchmarks showed that FPO provided no statistically
provable benefit but significantly impaired post-mortem debugging of unexpected
call stacks. Ultimately it was decided that the benefit of improving
post-mortem debugging outweighed the cost of disabling FPO."

Whether that's still true and whether it would be true for us, I don't know.
But it's an interesting anecdote.

> Now that we are shipping Win64 builds the perf impact of shipping with frame
> pointers might be workable since we can always say that we advise using our
> Win64 builds for the best performance...

We're not yet shipping them in any meaningful sense -- you have to seek them
out. https://wiki.mozilla.org/Firefox/win64 has the rollout plans.

Even once we do ship Win64 widely it still will be unusable by
anyone running a 32-bit Windows. And for people running 64-bit Windows, the
plan is to eventually only offer 64-bit Firefox to users with more than 3,800
MiB of available physical memory, because 64-bit Firefox uses more memory
due to pointers being larger.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #5)
> CFI stackwalking requires that the CFI be available (from the debuginfo:
> it's not in the binaries or anything we ship to the client by default). So
> doing a CFI stackwalk means, I think:
> 
> * altering the symbol server to publish the CFI data for an offset, not just
> the symbol/source/line
> * altering breakpad to fetch and use symbol/CFI information dynamically from
> the symbol server

IIRC you can build unwinding tables without building the symbol tables (and vice-versa) so there shouldn't be a strict dependency between the two but I don't know the details on Windows. That being said and also in the light of Ted and Nicholas' comments we might want to hold this off until we've got all the basic stuff working on nightly and aurora where we have frame pointers everywhere and separate symbolication is easy.

BTW I've tested on Win64 too and I get pretty much the same stacks that I get on Win32 with stack scanning. Again decent but not perfect (or "better than nothing" if you prefer ;) ).
(In reply to Nicholas Nethercote [:njn] from comment #8)
> In Microsoft's WER paper they talk about how they include frame pointers
> (http://www.sigops.org/sosp/sosp09/papers/glerum-sosp09.pdf, section 7.1):
<...>
> Whether that's still true and whether it would be true for us, I don't know.
> But it's an interesting anecdote.

I love that paper. :) I think browsers are scrutinized more on benchmarks in the press than operating systems, though. I'm pretty confident that someone did the measurement in the past and it's noticeable. I think you can actually verify this pretty easily, just compare the same changeset in Talos on aurora and beta before and after an uplift, where we switch off --enable-profiling.

> We're not yet shipping them in any meaningful sense -- you have to seek them
> out. https://wiki.mozilla.org/Firefox/win64 has the rollout plans.

Sure, but for the purposes of "a journalist running benchmarks" this seems fine. We're already promoting Win64 builds for journalists because they offer better asm.js performance, for example.

> Even once we do ship Win64 widely it still will be unusable by
> anyone running a 32-bit Windows. And for people running 64-bit Windows, the
> plan is to eventually only offer 64-bit Firefox to users with more than 3,800
> MiB of available physical memory, because 64-bit Firefox uses more memory
> due to pointers being larger.

This was discussed to death, but bug 1274659 comment 9 mentions that OOMs due to address space fragmentation are basically our #1 topcrash on Windows, and Win64 builds eliminate that, despite using more memory. (RE: 32-bit only users, I'm pretty sure someone got the data on that from telemetry but I can't find it right now.)
(In reply to Gabriele Svelto [:gsvelto] from comment #9)
> BTW I've tested on Win64 too and I get pretty much the same stacks that I
> get on Win32 with stack scanning. Again decent but not perfect (or "better
> than nothing" if you prefer ;) ).

Right, so, the upstream Breakpad code has BasicSourceLineResolver, which reads .sym files output from dump_syms (and also FastSourceLineResolver which reads a binary symbol format which is faster because it doesn't have to be parsed), but none of that knows how to read symbols/unwind info of any format out of local binaries for unwinding. We'd have to write code for that. I have some code floating around that sort of works for Linux:http://hg.mozilla.org/users/tmielczarek_mozilla.com/minidump-stackwalk/file/31175cf72d19/local_debug_info_symbolizer.cc

We used a modified version of this for the Gecko profiler at one point, although that code has since been removed from the tree:
https://hg.mozilla.org/mozilla-central/file/c93664e82808/tools/profiler/gecko/local_debug_info_symbolizer.cc
Whiteboard: [fce-active]
Whiteboard: [fce-active] → [fce-active-legacy]

We've started doing this on the server side AFAIK so I think we can close this.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.