Gather metrics about crashes from the extension process

RESOLVED FIXED in Firefox 55

Status

enhancement
P1
normal
RESOLVED FIXED
2 years ago
10 months ago

People

(Reporter: bsilverberg, Assigned: gsvelto)

Tracking

(Blocks 1 bug)

51 Branch
mozilla55
Dependency tree / graph
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(firefox55 fixed)

Details

(Whiteboard: [metrics] triaged)

Attachments

(2 attachments)

(Reporter)

Description

2 years ago
This bug is about collecting info when an extension causes a crash. From my notes: 

- We need to annotate the crash report with process information: web content, file urls, large allocations, extensions.
- Identify which extension crashed the process.
(Reporter)

Comment 1

2 years ago
Andy, this is the bug about crash stats for the extension process.
Assignee: bob.silverberg → nobody
Flags: needinfo?(amckay)

Updated

2 years ago
Assignee: nobody → aswan
Flags: needinfo?(amckay)
I think the description here is quite misleading, the intention is that the extension process remains a content process.  Since we use content processes for a bunch of different things, content processes have a "remoteType" attribute that distinguishes the extension process from other types.  This bug is just to add an annotation to content process crash reports with that attribute.

And the item that says "Identify which extension crashed the process" isn't at all practical, that's the job of whoever or whatever processes crash reports...

So, for adding the annotation I've attached a simple patch that I think does this.  But I'm hampered by the fact that we don't generate crash reports when the extension process crashes.  I filed bug 1362457 for that but I don't have the background (or the time to acquire the necessary background) to dive deeply into that at the moment.
Depends on: 1362457
Unassigning myself while that dependency remains unresolved.
Assignee: aswan → nobody
When the environment is set up just right (see bug 1362457) then the patch in this bug WFM - I've submitted https://crash-stats.mozilla.com/report/index/701b6933-e290-4301-94ef-b8e5f0170516

I can see "RemoteType": "extension" in the annotation collected by crash-stats. Also the URL is set to a moz-extension://{UUID}/.../ which is handy.

I think there are three dependent issues which need to be resolved:

1) crash-stats needs to differentiate extension and web content crashes in the UI and search
2) the Telemetry crash ping needs some kind of annotation too (I can't tell the different looking at about:telemetry between an extension and web content crash)
3) there should be a test along the lines of http://searchfox.org/mozilla-central/source/dom/ipc/tests/test_CrashService_crash.html to ensure that crashes for the extension process work and the proper annotations are attached

There should be a test for the Telemetry crash ping too. Not sure if that should be part of #3 or in the Telemetry tests, needs investigation. Probably whatever we do for web content crashes.

However - these could all be followup bugs, I think landing this as-is would be a good thing.
Depends on: 1365368
Depends on: 1365380
(Reporter)

Comment 6

2 years ago
Thanks for doing the research into crash reporting, updating the bug and filing the Socorro bugs, Rob. From comment 5 it sounds like there are still a couple of things that need to be done (numbers 2 and 3). Shall I file follow-up bugs for those?

Andrew, it sounds like the patch attached to this bug should land, but it doesn't look like review has been requested. Can you proceed with trying to get it landed?
Flags: needinfo?(rhelmer)
Flags: needinfo?(aswan)
(In reply to Bob Silverberg [:bsilverberg] from comment #6)
> Andrew, it sounds like the patch attached to this bug should land, but it
> doesn't look like review has been requested. Can you proceed with trying to
> get it landed?

It should have a test before it lands.  Ideally we would also just do point 2 from comment 5 in this bug.  I have a bunch of other things on my plate right now so I won't be able to do this for some time.  Bob, feel free to take it over if you like, otherwise I'll get to it eventually
Flags: needinfo?(aswan)
(In reply to Bob Silverberg [:bsilverberg] from comment #6)
> Thanks for doing the research into crash reporting, updating the bug and
> filing the Socorro bugs, Rob. From comment 5 it sounds like there are still
> a couple of things that need to be done (numbers 2 and 3). Shall I file
> follow-up bugs for those?
> 
> Andrew, it sounds like the patch attached to this bug should land, but it
> doesn't look like review has been requested. Can you proceed with trying to
> get it landed?

I think point 2 (making it clear in the Telemetry crash ping that this is extension not web content) might be handled by bug 1361661 but I'd suggest checking in with Georg to make sure there's nothing they need from us.
Flags: needinfo?(rhelmer)

Comment 9

2 years ago
It's unlikely that the work in bug 1361661 will affect the crash ping.

But we really need to talk about not just the data collection but how/who it's going to be used. Is the addons team going to monitor the crash rate, for example?

There are downstream datasets that are going to need to be explicitly aware of this, and assuming addon crashes are supposed to be monitored as part of release monitoring that is an explicit work item as well.
(Reporter)

Comment 10

2 years ago
(In reply to Robert Helmer [:rhelmer] from comment #8)
> (In reply to Bob Silverberg [:bsilverberg] from comment #6)
> > Thanks for doing the research into crash reporting, updating the bug and
> > filing the Socorro bugs, Rob. From comment 5 it sounds like there are still
> > a couple of things that need to be done (numbers 2 and 3). Shall I file
> > follow-up bugs for those?
> > 
> > Andrew, it sounds like the patch attached to this bug should land, but it
> > doesn't look like review has been requested. Can you proceed with trying to
> > get it landed?
> 
> I think point 2 (making it clear in the Telemetry crash ping that this is
> extension not web content) might be handled by bug 1361661 but I'd suggest
> checking in with Georg to make sure there's nothing they need from us.

Comment 9 suggests there isn't anything in bug 1361661 that will address item #2. Rob, do you know what changes are required to make this happen? I can try to make the changes if you can give me a clue about what needs to be done.
Flags: needinfo?(rhelmer)
(In reply to Bob Silverberg [:bsilverberg] from comment #10)
> (In reply to Robert Helmer [:rhelmer] from comment #8)
> > (In reply to Bob Silverberg [:bsilverberg] from comment #6)
> > > Thanks for doing the research into crash reporting, updating the bug and
> > > filing the Socorro bugs, Rob. From comment 5 it sounds like there are still
> > > a couple of things that need to be done (numbers 2 and 3). Shall I file
> > > follow-up bugs for those?
> > > 
> > > Andrew, it sounds like the patch attached to this bug should land, but it
> > > doesn't look like review has been requested. Can you proceed with trying to
> > > get it landed?
> > 
> > I think point 2 (making it clear in the Telemetry crash ping that this is
> > extension not web content) might be handled by bug 1361661 but I'd suggest
> > checking in with Georg to make sure there's nothing they need from us.
> 
> Comment 9 suggests there isn't anything in bug 1361661 that will address
> item #2. Rob, do you know what changes are required to make this happen? I
> can try to make the changes if you can give me a clue about what needs to be
> done.

I know that there is a "crash ping" that Telemetry sends, I believe it is documented here:

http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/docs/data/crash-ping.rst

I think we just want to make sure that the RemoteType is also considered, so we can tell web content crashes apart from extension content crashes.

Other than that I'd defer to bsmedberg and Georg if there is anything else we need to do - addressing comment 9 with regard to how this will be monitored would be good. We could set up a simple dashboard that uses Telemetry perhaps, just need someone to watch it, ideally with some kind of alert that goes to a mailing list the addons team watches.

I could see for instance watching Telemetry for an elevated crash rate, and then using crash-stats for the extra details to determine the root cause.
Flags: needinfo?(rhelmer) → needinfo?(gfritzsche)
(In reply to Robert Helmer [:rhelmer] from comment #11)
> (In reply to Bob Silverberg [:bsilverberg] from comment #10)
> > (In reply to Robert Helmer [:rhelmer] from comment #8)
> > > (In reply to Bob Silverberg [:bsilverberg] from comment #6)
> > > > Thanks for doing the research into crash reporting, updating the bug and
> > > > filing the Socorro bugs, Rob. From comment 5 it sounds like there are still
> > > > a couple of things that need to be done (numbers 2 and 3). Shall I file
> > > > follow-up bugs for those?
> > > > 
> > > > Andrew, it sounds like the patch attached to this bug should land, but it
> > > > doesn't look like review has been requested. Can you proceed with trying to
> > > > get it landed?
> > > 
> > > I think point 2 (making it clear in the Telemetry crash ping that this is
> > > extension not web content) might be handled by bug 1361661 but I'd suggest
> > > checking in with Georg to make sure there's nothing they need from us.
> > 
> > Comment 9 suggests there isn't anything in bug 1361661 that will address
> > item #2. Rob, do you know what changes are required to make this happen? I
> > can try to make the changes if you can give me a clue about what needs to be
> > done.
> 
> I know that there is a "crash ping" that Telemetry sends, I believe it is
> documented here:
> 
> http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/
> docs/data/crash-ping.rst
> 
> I think we just want to make sure that the RemoteType is also considered, so
> we can tell web content crashes apart from extension content crashes.

Correct, bug 1361661 doesn't help with any crash metrics here.
@ddurst, can you answer or redirect on the crash ping?

We also need to make sure that other consumers of this data are aware of changes.
You can flag mdoglio for the crash-aggregates work.

What about the other crash metrics?
http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/concepts/crashes.html#child-process-crashes
Do we need to break the extension process out here?
Flags: needinfo?(gfritzsche)
We're going more towards Telemetry for monitoring rates/overall health and reports for deeper dives, as you said. If only because of that, we should be able to separate extension content crashes from web content crashes. It should behave as if it were its own processType -- even if it's identical in function and form to content process crashes. Then everything downstream using this data (be it Telemetry or crash reports) would be altered consistently to reflect processType.

We've talked about this briefly via email, but I'll NI gsvelto so that we can look into that more. It may merit its own bug, depending on the decision of how to handle it.
Flags: needinfo?(gsvelto)
(Assignee)

Comment 14

2 years ago
Sorry for the late reply. I've dug our sources a little bit more to understand what the process' remote type actually meant and now I'm unsure whether we should give the web extension process it's own process type or not. We currently have three remote types for content processes: "web" and "file" (to distinguish between remote and locally stored web pages), and "extension".

I'm not sure what was the rationale behind the extension process re-using a content process with a different remote type; but in my eyes it signals the explicit intent of using a content process. To clarify what I mean, different processes (GPU, GMP plugin, etc...) explicitly declare a separate process type and they do so by also re-implementing the parent/child IPC machinery.

The web extension process doesn't seem to need this level of distinction, so the plan described by Robert in comment 5 seems like a reasonable approach. If there's consensus around this I'll be happy to implement the necessary changes (tests included).
Flags: needinfo?(gsvelto)
(Reporter)

Comment 15

2 years ago
Rob, can you please reply to comment #14, above?
Flags: needinfo?(rhelmer)
(In reply to Bob Silverberg [:bsilverberg] from comment #15)
> Rob, can you please reply to comment #14, above?

I think aswan could better answer why we don't have a unique process type - I don't have strong opinions on it.
Flags: needinfo?(rhelmer) → needinfo?(aswan)
(In reply to Robert Helmer [:rhelmer] from comment #16)
> I think aswan could better answer why we don't have a unique process type -
> I don't have strong opinions on it.

Let me flip the question around -- why should the extension process have its own process type?  That would be a bunch of work, for what purpose?  That's a rhetorical question though, no need for an answer.  Lets just take :gsvelto up on the offer from the end of comment 14.
Flags: needinfo?(aswan) → needinfo?(gsvelto)
(Assignee)

Comment 18

2 years ago
Alright, taking this. I'll be adding the necessary annotations both to the crash dumps and the crash pings. As a bonus this is going to allow us to distinguish between content process crashes from remote (web) and local (file) pages too.
Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Flags: needinfo?(gsvelto)
Comment hidden (mozreview-request)
(Assignee)

Comment 20

2 years ago
Here's the patch. Benjamin, consider this also a data review since I'm adding a field to the crash ping (metadata.RemoteType). I've included documentation and tests.

Comment 21

2 years ago
mozreview-review
Comment on attachment 8873387 [details]
Bug 1353168 - Add a crash annotation to distinguish between web, file and extension content processes;

https://reviewboard.mozilla.org/r/144820/#review149220

r+dr=me

I suspect we need a followup bug to get this data added to summary and crash aggregates derived data. Could you file that?
Attachment #8873387 - Flags: review?(benjamin) → review+

Comment 22

2 years ago
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9d26051a28ea
Add a crash annotation to distinguish between web, file and extension content processes; r=bsmedberg
(Assignee)

Comment 23

2 years ago
(In reply to Benjamin Smedberg [:bsmedberg] from comment #21)
> I suspect we need a followup bug to get this data added to summary and crash
> aggregates derived data. Could you file that?

Yes, I'll do it right away.
(Assignee)

Updated

2 years ago
Blocks: 1370178
(Assignee)

Comment 24

2 years ago
Filed bug 1370178 for showing this data in the crash aggregates; I hope it's in the right component.

Comment 25

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/9d26051a28ea
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla55

Updated

10 months ago
Product: Toolkit → WebExtensions
You need to log in before you can comment on or make changes to this bug.