Closed Bug 607831 Opened 10 years ago Closed 3 years ago

switch to binary format symbols with a minidump_stackwalk replacement

Categories

(Socorro :: Symbols, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: ted, Unassigned)

References

Details

The current stackwalk_server + source_daemon combination is proving inadequate for our needs in Socorro (see bug 592467). At the same time, Google has implemented and landed in the Breakpad repository an alternate symbol implementation that uses binary symbol files instead of the current text format files. I am currently investigating using the new binary symbol files, and I think they'll solve some of our problems.

There's still the question of how we fit them into our process, since we'd like to scale up to a large number of processor nodes, and Aravind is not happy with the idea of having the symbol NFS mount mounted on a large number of systems.

My current testing involved writing a little commandline app to convert text-format symbols into binary-format symbols using the conversion classes available in Breakpad, and also writing a symbol supplier implementation that mimicks SimpleSymbolSupplier, but which simply mmaps the binary symbol files. I then built a modified version of minidump_stackwalk which used that symbol supplier to do processing with the bniary symbols. Initial results are very promising. I downloaded the symbol files for the Firefox 4.0b6 builds: Linux x86, Linux x86-64, Windows x86, and Mac ppc/x86 Universal. In terms of file size, the binary symbols are about 1.6x larger on disk:

In terms of processing speed, the binary symbols + mmap supplier is much faster:
real	0m1.751s (stock minidump_stackwalk + text symbols)
real	0m0.235s (modified mdsw + mmapped binary symbols)

Both times are with a fully primed fs cache, so no actual IO overhead. Reading from a cold fs the difference is still huge:
real	0m3.654s (text)
real	0m0.943s (binary)

I also wanted to see the difference in number of bytes read from disk, since with the text-format symbols, the entire file needs to be read in order to do anything with it, whereas the with the binary-format symbols, only part of the file needs to be read (although I haven't read the entire implementation, so I'm not sure if it's as smart as I think it should be). I wrote a little utility to exec a program and cat /proc/<pid>/io after it finished, since I was having no luck with systemtap and other utilities:
http://hg.mozilla.org/users/tmielczarek_mozilla.com/procio/

Using this utility I could examine how much data was actually read by both implementations while processing the same minidump:
------------
text symbols:

rchar: 37916068
wchar: 103144
syscr: 4666
syscw: 3076
read_bytes: 34646016
write_bytes: 0
cancelled_write_bytes: 0
------------
binary symbols:

rchar: 3319886
wchar: 103072
syscr: 438
syscw: 3076
read_bytes: 13722624
write_bytes: 0
cancelled_write_bytes: 0
------------

I believe the important number here is read_bytes, which is the number of bytes of actual disk I/O that the program caused. (This is only non-zero when the fs cache is empty, I've been testing with the symbols on a USB drive that I can unmount and remount to flush the cache.) With text symbols for this minidump, we wind up reading about 34MB of data from disk. With binary symbols, we only read about 13MB of data, which is a pretty nice improvement!

All that being said, storing the symbol files on disk like we do now is just one option. We could also investigate storing the binary files wholesale in HBase or HDFS.

The tools I used for testing are here:
http://hg.mozilla.org/users/tmielczarek_mozilla.com/symboltests/
(except for procio, in the previously mentioned repository)
Assignee: nobody → ted.mielczarek
If we wind up using the binary-format symbols, regardless of where we store them we'll have to:
a) Convert our existing set of text-format symbols on the symbol store.
b) Figure out a process for converting new symbols as they're uploaded from build machines.
Depends on: 607951
bug 607951 should cover b) from my previous comment. a) will be relatively straightforward, but we'll wind up using a lot more storage during the transition period. We'll have to convert all our symbols and keep both formats until we roll out the new Socorro that uses the new format, at which point we can delete the old text-format symbols (and make the post-upload script delete the text-format after conversion).
(In reply to comment #0)
> x86, Linux x86-64, Windows x86, and Mac ppc/x86 Universal. In terms of file
> size, the binary symbols are about 1.6x larger on disk:

Oops, I forgot to paste the relevant bit in my comment 0, which was:
297M	text
492M	bin

As I said in comment 0, this is all the symbols from the 4.0b6 release(*).
(* Ok, I left out the mac64 build, but that's because for 4.0b6 we shipped a mac universal + mac64 build, and for 4.0b7 and later we're only shipping a mac64 universal build, so I think this is a reasonable estimate for future symbol storage.)
Depends on: 607961
No longer depends on: 607951
Depends on: 607969
For the record, my current plan is to modify stackwalk_server to use the FastSymbolSupplier from here:
http://hg.mozilla.org/users/tmielczarek_mozilla.com/symboltests/file/1fb3d622b089/fast_symbol_supplier.cc
along with the FastSourceLineResolver that was implemented in upstream Breakpad.

Before deploying this we'll need to convert all of our text-format symbols to binary-format, as described in the bugs blocking this one.

This should greatly decrease our processing time, and reduce the amount of data we have to read from the symbol NFS mount. If we find that reading the symbols from NFS becomes a sticking point in the future, it should be fairly straightforward to store the binary symbols in hbase or whatever storage we choose at a later date.
I pushed a change (on a non-default branch) to make stackwalk_server use FastSymbolSupplier and FastSourceLineResolver:
http://hg.mozilla.org/users/tmielczarek_mozilla.com/minidump-stackwalk/rev/2dbe20a95b02

This means it can be pointed at a directory tree full of binary symbols and produce symbolized stack traces.
Why are we re-reading symbol files for every crash? It seems to me that Socorro should have enough crashes from the same version of Firefox that reading symbol files should not dominate throughput.

I guess this will require changes to http://mxr.mozilla.org/mozilla-central/source/tools/rb/fix_stack_using_bpsyms.py.
It's not symbol IO that's the problem, it's the actual parsing of the symbol files that dominates, even if the symbol files are completely cached in memory. The text-based symbol format is expensive to parse, especially when you have 50+MB symbol files like we generate from libxul. It only works out to ~1.5s per dump on my machine, but it's slower if the files aren't in memory cache, and 1.5s per dump is still pretty slow when you want to process a few million of them.

I'm not intending to change the output of dump_syms or the symbol files contained in the symbol zip files, I'm going to do all the conversion server-side, so that changes should be contained to Socorro.
Blocks: 609593
Blocks: 609596
Okay, all the dependent bugs are either waiting on reviews or IT action. This should be essentially ready to go pretty soon. We'll need to figure out a timeframe for pushing this to production so we can fix the immediate deps of this bug, which will convert our existing symbols to the new format. Once we do that, we can test the stackwalk_server changes in staging and roll them to production.

We'll probably want to do a comparison, processing the same crashes in production (with the old text-format symbol implementation) and staging (with the binary symbols) so we can verify that the results are the same.
Target Milestone: --- → 1.7.8
Depends on: 573100
Summary: Design and implement new stackwalk_server → switch to binary format symbols with a minidump_stackwalk replacement
I don't think we're going to have time for this for 1.7.8.
Target Milestone: 1.7.8 → 1.9
We can punt this to the future if you'd like, it should speed up processing times but that's about it.
Whiteboard: Q3
Target Milestone: 1.9 → 2.3
Target Milestone: 2.3 → 2.3.1
Target Milestone: 2.3.1 → 2.3.2
Target Milestone: 2.3.2 → 2.4
Target Milestone: 2.4 → ---
Component: Socorro → General
Product: Webtools → Socorro
We're not going to do this. We're looking into using Postgres as a symbol store in bug 789493.
Assignee: ted → nobody
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
About-face! We're going to try and do this, after symbol upload is done with https://wiki.mozilla.org/CrashKill/Symbol_Submission
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Whiteboard: Q3
bug 1071724 will make this much easier because we can easily afford to transcode all the symbols and store two copies until we are ready to switch over.
Depends on: 1071724
still desired?
Component: General → Symbols
Flags: needinfo?(ted)
I don't think this is worth doing currently. It would speed up processing but I don't think we have a huge issue there.
Status: REOPENED → RESOLVED
Closed: 7 years ago3 years ago
Flags: needinfo?(ted)
Resolution: --- → WONTFIX
Can you characterize the possible speed up?

Performance bottlenecks are one of the barriers to client-side symbolication. Perf improvements like this are a goal of Tecken.
Processing time is currently dominated by parsing symbol files. Changing the format of the files would remove the overhead of parsing. However, it also means we have to ensure that every consumer of symbol files can handle the new format.
I concur. The performance elephant is network I/O or not and our LRU cache (one in Socorro processor and one in Tecken) takes away 95+% of the opportunity to optimize the network I/O.

Also, having plaintext .sym files is nice because they're, well, plain text so you can open then in the browser or in curl and quickly understand what you're looking at.
You need to log in before you can comment on or make changes to this bug.