Closed Bug 1621638 Opened 4 years ago Closed 4 years ago

[research] switch to symbolic crate for parsing sym files

Categories

(Eliot :: General, enhancement, P2)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

From https://bugzilla.mozilla.org/show_bug.cgi?id=1614928#c2 :

As a side note, we might want to switch to a Rust parser for sym files. calixte says we're using this one here:

https://docs.rs/symbolic-debuginfo/6.1.4/symbolic_debuginfo/breakpad/struct.BreakpadObject.html

That's maintained by the Sentry folks. It has Python bindings, too:

https://github.com/getsentry/symbolic#python-usage

Can we ditch our home-grown sym parser and switch to that?

Assignee: nobody → willkg
Status: NEW → ASSIGNED
Priority: -- → P2

I can install symbolic without a problem. I was able to throw together a proof-of-concept script that opens a SYM file, parses it, and returns a LineInfo thing.

I then spent some quality time looking at the Tecken code. The SYM file parsing is entangled with symbol file downloading, caching of parsed output, and also the symbolication view. I'm now extracting the SYM file parsing code into a separate module with well-defined boundaries and tests. That makes it easier to build a second SYM file parser using symbolic, write some scaffolding to parse a file with both and compare the results, and then iterate on that.

So far, the project still seems doable. I haven't hit any hard blockers, yet.

Some questions that still need answering:

  1. Is the SYM file parsing code too entangled to be extracted without breaking other things?
  2. Does symbolic parse the SYM file and return the information the symbolicate view needs?
  3. Is the symbolic parsing code slower or less performant in a meaningful way?

I think I've gotten about as far as I can go with researching this.

The current symbolicate API v4/v5 view code does the following:

  1. go through the stack and known modules to figure out which modules it needs symbols for
  2. find and download the SYM files from AWS S3--it doesn't save these to disk, but rather passes generators to the parser
  3. parse the SYM file stream from the downloader returning symbol maps; it caches these symbol maps in Redis for future optimization
  4. look up the symbols on the stack using the symbol maps and return the result

symbolic is a Rust crate with a Python wrapper. The symbolic Rust crate and Python wrapper have a different API and we can't drop it in to our existing code.

  1. symbolic can only parse SYM files on disk--it doesn't work with bytes or byte arrays
  2. the result of parsing a SYM file is a SymCache which only lets you do lookups--you can't extract the symbol information so we can't continue using our Redis-based cache
  3. we can save SYMC files to disk; it takes about half the space as the SYM file and parsing is suuuuper fast

Given that, I think we need to write a new symbolicate API view that does this:

  1. go through the stack and known modules to figure out which modules it needs symbols for
  2. find and download the SYM files from AWS S3
  3. for each SYM file, save it to disk, parse it with symbolic, then save the SYMC file to disk for future optimization
  4. look up symbols on the stack using symcache and return the result

The symbolic version uses disk rather than Redis for caching. It'll accumulate SYMC files, so we'll need an LRU SYMC manager. I don't think Tecken has anything like that which we can reuse, so this is something new.

I have no idea how switching to symbolic will affect the performance of the symbolicate API.

I don't know if switching to symbolic will generate the same symbolicate API results. I bet if it doesn't, the symbolic version will be more correct.

I think if we want to switch to symbolic for the symbolicate API, then we should write a v6 API endpoint that we can compare with the v5 API endpoint for correctness and load testing, iterate on the v6 API endpoint, then migrate users and deprecate the v4 and v5 API endpoints.

That feels doable, but a big project--it's a rewrite of the symbolicate API.

Turning this into a research bug. The next step is figuring out whether we want to do this and then generating a project. I'll keep this open until we figure out whether to go forward or not.

Summary: switch to symbolic crate for parsing sym files → [research] switch to symbolic crate for parsing sym files

Gabriele and Calixte and Markus pointed out that once Tecken is using symbolic, it can parse out symbols from binary artifacts rather than looking at the SYM file which is missing all kinds of information like inlined functions.

We should look at that as a follow-up step.

I have a couple of things to figure out still like how to manage the disk cache.

However, I think this is the way forward. I wrote up bug #1636210 for that work.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Component: Symbolication → General
Product: Tecken → Eliot
You need to log in before you can comment on or make changes to this bug.