Closed Bug 1318758 Opened 8 years ago Closed 5 years ago

Create a top-crashers list from crash pings, separate from the existing one

Categories

(Toolkit :: Telemetry, defect, P3)

49 Branch
defect

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: chutten, Assigned: ddurst)

References

Details

Attachments

(1 file)

Once bug 1317968 is in, we should start getting stacks sent through the pipeline.

See whether we can reproduce a telemetry-based "top crashers list[1]" for Nightly when they do start flowing. It'd be especially nice to see what differences there are in crash counts per stack and whatever else we can glean.

[1]: https://crash-stats.mozilla.com/topcrashers/?product=Firefox&version=53.0a1&days=7
Component: General → Telemetry
Product: Firefox → Toolkit
Priority: -- → P3
They're starting to come in: https://gist.github.com/chutten/65e146a68034ebdd8140a8b23a26facd

Not sure how to start crunching these into useful reports. They are... verbose structures. gsvelto, any hints?
Flags: needinfo?(gsvelto)
There's a short description of the format in our docs:

https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/crash-ping.html

I'll leave the NI for a more detailed explanation tomorrow.
To elaborate on this, the StackTraces entry has three interesting parts: the "crash_info" field which describes what caused the crash, the "modules" array which contains the list of executables/libraries loaded in memory complete with information to retrieve the associated symbol files and the "threads" array which contains the stack traces proper. Each entry in the "threads" array is an array of stack frames for that thread. Each frame has an index to the module it belongs to (which we'll use for symbolication), the IP address of that frame and a trust value. In Socorro the common practice is to treat all frames that have trust "scan" or "none" as unreliable.

What we'd like to do is to try and generate a signature out of this information, this involves a few steps:

- Symbolication, this will require going over the list of frames, and asking for the associated symbol by sending to Snappy information about the frame, i.e. the module (I think Snappy takes either the debug_id or code_id) and the offset which can be obtained by substracting the module's "base_addr" field from the frame's "ip" field. Some sanity checks should also be done like ensuring the the IP falls between the module's "base_addr" and "end_addr" fields, etc...

- Generation of the signature. Admittedly I still don't know how to do it because I didn't have time to study how Socorro generates its signatures. Ted filed a bug specifically to document this (bug 1306643) so we might want to add to that if we're going to duplicate what Socorro does.

We could restrict ourselves to the crashing thread for now which is the one specified by "crash_info.crashing_thread" (it's an index in the "threads" array). Once we have the signatures we should first of all check if they correspond to the ones from the corresponding reports in Socorro; the ping will contain the crash UUID under the "crashId" field so if the user sent the report to Socorro it'll be easy to compare. Once we'll be sure that those match correctly we can start have fun with statistics of course (which are the most common crashes? Are certain crashes underreported on Socorro? etc...).
Flags: needinfo?(gsvelto)
(In reply to Gabriele Svelto [:gsvelto] from comment #3)
> - Symbolication, this will require going over the list of frames, and asking
> for the associated symbol by sending to Snappy information about the frame,
> i.e. the module (I think Snappy takes either the debug_id or code_id) and
> the offset which can be obtained by substracting the module's "base_addr"
> field from the frame's "ip" field. Some sanity checks should also be done
> like ensuring the the IP falls between the module's "base_addr" and
> "end_addr" fields, etc...

If you look at the docs for snappy:
https://github.com/mozilla/Snappy-Symbolication-Server#symserver

You can see that it accepts a JSON request which really only has two fields (ignoring "version"): "stacks" and "memoryMap". The former is an array of arrays of [module_index, module_offset], which is basically an array of threads, where each thread is an array of stack frames consisting of module and offset. The latter is an array of [debug_filename, debug_id] for each module, where they're in order so that the module_index in the stack frames matches up. Since I took the time to write this down I figured I'd put that info somewhere useful as well: https://github.com/mozilla/Snappy-Symbolication-Server/pull/66

Given that, you could symbolicate crash pings with something like (untested, copy-pasting from your gist):
```
import requests

symbolicated_stacks = stacks.map(lambda s: requests.post('http://symbolapi.mozilla.org/', data=json.dumps({'stacks': [[[f['module_index'], f['ip'] - s['modules'][f['module_index']]] for f in s['threads'][s['crash_info']['crashing_thread']]]], 'memoryMap': [[m['debug_file'], m['debug_id']] for m in s['modules']], 'version': 4})).json()['symbolicatedStacks'])

That would have been a lot easier to write if I had just asked you how to get a hold of that data in Telemetry. Obviously that will do a single HTTP POST to the Snappy server per crash ping. The resulting stack frames will be of the form 'XREMain::XRE_mainRun() (in xul.pdb)' because that's what Snappy returns, but that should be close enough for testing. (We might want to make some changes to Snappy if we plan on using this in production.)

> - Generation of the signature. Admittedly I still don't know how to do it
> because I didn't have time to study how Socorro generates its signatures.
> Ted filed a bug specifically to document this (bug 1306643) so we might want
> to add to that if we're going to duplicate what Socorro does.

I think this is going to be a PITA as it currently stands unless we add an API to Socorro to generate signatures for us. (bug 828452)
That Snappy PR got merged quickly, so the Snappy API is now documented here:
https://github.com/mozilla/Snappy-Symbolication-Server#symserver
Attached file snappy request
I needed to manipulate the magic incantation a little:

def symbolicate(s):
    data = json.dumps({
            'stacks': [[[f['module_index'], int(f['ip'], 16) - int(s['modules'][f['module_index']]['base_addr'], 16)] for f in s['threads'][s['crash_info']['crashing_thread']]['frames']]],
            'memoryMap': [[m['debug_file'], m['debug_id']] for m in s['modules']], 'version': 4})
    result = requests.post('http://symbolapi.mozilla.org/', data=data)
    result_json = result.json()
    return result_json['symbolicatedStacks']

Which gets us close, but snappy returns us a 400 with an empty response. Clearly something's up. The request itself is the attachment. Maybe "XUL" and "firefox" are not the names of the debug files that snappy is expecting? Maybe it's because it's a Mac crash?
Maybe it was because I was using an old crash, or maybe it was because it was a Mac crash, but when I tried it on a Windows crash from 20170106, I received (after some time) a 200 OK and

[[u"mozilla::`anonymous namespace'::RunWatchdog (in xul.pdb)",
  u'PR_NativeRunThread (in nss3.pdb)',
  u'pr_root (in nss3.pdb)',
  u'o__realloc_base (in ucrtbase.pdb)',
  u'BaseThreadInitThunk (in kernel32.pdb)',
  u'RtlUserThreadStart (in ntdll.pdb)',
  u'GetLegacyComposition (in kernelbase.pdb)']]

Seems legit to me. Here's my code: https://gist.github.com/chutten/65e146a68034ebdd8140a8b23a26facd
Ah-ha. :ted noticed that the problem appears to be snappy 400'ing on any request that contains a module name that has a space in it. ni?Kirk for determining if this is a "feature" of the new service as well, and if there happens to be a staging version of the new service we can test our magic incantation against.
Flags: needinfo?(ksteuber)
I'm afraid that the "new service" still hasn't been deployed. The server running on symbolapi.mozilla.org is still the old version. I am still working on getting the new version deployed. I'm afraid there is not much I can do to help with problems with the old version.

The server code lives here [1], so you can feel free to try it out locally. I don't think I have tested module names with spaces in them. If you give me an example, I would be happy to find out (and fix it if it doesn't work).

[1] https://github.com/mozilla/Snappy-Symbolication-Server
Flags: needinfo?(ksteuber)
Here's the one :ted found:

curl -v -d '{"stacks":[[[0,4304]]],"memoryMap":[["firefox","7DE7CD0C3A533406BC090D445534DCB80"], ["foo (deleted)", ""]],"version":4}' http://symbolapi.mozilla.org/
Flags: needinfo?(ksteuber)
My local server gives this response:

> {"symbolicatedStacks": [["main (in firefox)"]], "knownModules": [true, false]}

So I think that symbolapi.mozilla.org will work properly once it is updated to the new version.
Flags: needinfo?(ksteuber)
(In reply to Kirk Steuber [:bytesized] from comment #11)
> So I think that symbolapi.mozilla.org will work properly once it is updated
> to the new version.

+1

Is there anything preventing us from doing a permanent test deployment of your new code? We don't need to replace http://symbolapi.mozilla.org to test it.
Is there a bug tracking the deployment of the rewritten snappy server?
It looks like the current Snappy instance doesn't like spaces or parentheses. In the interim to get data out you could just remove them, since these modules aren't likely to have symbols anyway. Try:
'memoryMap': [[m['debug_file'].translate(None, ' ()'), m['debug_id']] for m in s['modules']]
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #13)
> Is there a bug tracking the deployment of the rewritten snappy server?

There isn't a good one. We'll file when we rewrite the production one (late Feb).
There is a new prototype hosted on Heroku. 
The URL is https://snappy2-zero.herokuapp.com/ and it's a rewrite based on Snappy-Symbolication-Server. 

(As a prototype it's using a limited Redis store as a cache of downloaded symbol files. Note the word limited. This will get vastly better when we dedicate some proper resources and host it properly on AWS)

Mike Conley send me a monster of a JSON dump of unsymbolicated crashes from crash pings. I threw them against this server and there were no errors. It did take an age though. It's hard for to know exactly if the output is what a C++ debugger actually expects.
(In reply to Peter Bengtsson [:peterbe] from comment #16)
> It did take an age though. It's hard for to know exactly if the output is
> what a C++ debugger actually expects.

Keep in mind that our crash ping stack traces often use stack scanning when frame pointers are not available (e.g. on Win64 or Linux) so there's chances that after a few frames the results are wrong not because the symbolication isn't doing its job but because the IPs are wrong.
The win64 issue is bug 1333126 right? That probably needs to be picked off the backlog soon considering that we're planning on rolling win64 out by default starting in 55.
Flags: needinfo?(gsvelto)
(In reply to Benjamin Smedberg [:bsmedberg] from comment #18)
> The win64 issue is bug 1333126 right? That probably needs to be picked off
> the backlog soon considering that we're planning on rolling win64 out by
> default starting in 55.

Yes, and I hope to be able to tackle it as soon as I'm done with all the crash pings-related stuff.
Flags: needinfo?(gsvelto)
Back to the original intent of this bug, adding agashlin because #c0 covers one of our next steps.
Not...sure what the status or future of this is, actually. To :ddurst who might decide to just close it in favour of work already completed.
Assignee: chutten → ddurst
Just one quick not-so-small change...
Summary: See what stacks we get over Telemetry crash pings → Create a top-crashers list from crash pings, separate from the existing one

If we're not interested in this kind of data, then there's no good reason to track this.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INCOMPLETE

I was told that someone, somewhere was working on this in the last few months. I can't remember who though :(

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: