Closed Bug 1353168 Opened 3 years ago Closed 3 years ago

Gather metrics about crashes from the extension process

Categories

(WebExtensions :: General, enhancement, P1)

51 Branch
enhancement

Tracking

(firefox55 fixed)

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: bsilverberg, Assigned: gsvelto)

References

(Blocks 1 open bug)

Details

(Whiteboard: [metrics] triaged)

Attachments

(2 files)

This bug is about collecting info when an extension causes a crash. From my notes: 

- We need to annotate the crash report with process information: web content, file urls, large allocations, extensions.
- Identify which extension crashed the process.
Andy, this is the bug about crash stats for the extension process.
Assignee: bob.silverberg → nobody
Flags: needinfo?(amckay)
Assignee: nobody → aswan
Flags: needinfo?(amckay)
I think the description here is quite misleading, the intention is that the extension process remains a content process.  Since we use content processes for a bunch of different things, content processes have a "remoteType" attribute that distinguishes the extension process from other types.  This bug is just to add an annotation to content process crash reports with that attribute.

And the item that says "Identify which extension crashed the process" isn't at all practical, that's the job of whoever or whatever processes crash reports...

So, for adding the annotation I've attached a simple patch that I think does this.  But I'm hampered by the fact that we don't generate crash reports when the extension process crashes.  I filed bug 1362457 for that but I don't have the background (or the time to acquire the necessary background) to dive deeply into that at the moment.
Depends on: 1362457
Unassigning myself while that dependency remains unresolved.
Assignee: aswan → nobody
When the environment is set up just right (see bug 1362457) then the patch in this bug WFM - I've submitted https://crash-stats.mozilla.com/report/index/701b6933-e290-4301-94ef-b8e5f0170516

I can see "RemoteType": "extension" in the annotation collected by crash-stats. Also the URL is set to a moz-extension://{UUID}/.../ which is handy.

I think there are three dependent issues which need to be resolved:

1) crash-stats needs to differentiate extension and web content crashes in the UI and search
2) the Telemetry crash ping needs some kind of annotation too (I can't tell the different looking at about:telemetry between an extension and web content crash)
3) there should be a test along the lines of http://searchfox.org/mozilla-central/source/dom/ipc/tests/test_CrashService_crash.html to ensure that crashes for the extension process work and the proper annotations are attached

There should be a test for the Telemetry crash ping too. Not sure if that should be part of #3 or in the Telemetry tests, needs investigation. Probably whatever we do for web content crashes.

However - these could all be followup bugs, I think landing this as-is would be a good thing.
Depends on: 1365368
Depends on: 1365380
Thanks for doing the research into crash reporting, updating the bug and filing the Socorro bugs, Rob. From comment 5 it sounds like there are still a couple of things that need to be done (numbers 2 and 3). Shall I file follow-up bugs for those?

Andrew, it sounds like the patch attached to this bug should land, but it doesn't look like review has been requested. Can you proceed with trying to get it landed?
Flags: needinfo?(rhelmer)
Flags: needinfo?(aswan)
(In reply to Bob Silverberg [:bsilverberg] from comment #6)
> Andrew, it sounds like the patch attached to this bug should land, but it
> doesn't look like review has been requested. Can you proceed with trying to
> get it landed?

It should have a test before it lands.  Ideally we would also just do point 2 from comment 5 in this bug.  I have a bunch of other things on my plate right now so I won't be able to do this for some time.  Bob, feel free to take it over if you like, otherwise I'll get to it eventually
Flags: needinfo?(aswan)
(In reply to Bob Silverberg [:bsilverberg] from comment #6)
> Thanks for doing the research into crash reporting, updating the bug and
> filing the Socorro bugs, Rob. From comment 5 it sounds like there are still
> a couple of things that need to be done (numbers 2 and 3). Shall I file
> follow-up bugs for those?
> 
> Andrew, it sounds like the patch attached to this bug should land, but it
> doesn't look like review has been requested. Can you proceed with trying to
> get it landed?

I think point 2 (making it clear in the Telemetry crash ping that this is extension not web content) might be handled by bug 1361661 but I'd suggest checking in with Georg to make sure there's nothing they need from us.
Flags: needinfo?(rhelmer)
It's unlikely that the work in bug 1361661 will affect the crash ping.

But we really need to talk about not just the data collection but how/who it's going to be used. Is the addons team going to monitor the crash rate, for example?

There are downstream datasets that are going to need to be explicitly aware of this, and assuming addon crashes are supposed to be monitored as part of release monitoring that is an explicit work item as well.
(In reply to Robert Helmer [:rhelmer] from comment #8)
> (In reply to Bob Silverberg [:bsilverberg] from comment #6)
> > Thanks for doing the research into crash reporting, updating the bug and
> > filing the Socorro bugs, Rob. From comment 5 it sounds like there are still
> > a couple of things that need to be done (numbers 2 and 3). Shall I file
> > follow-up bugs for those?
> > 
> > Andrew, it sounds like the patch attached to this bug should land, but it
> > doesn't look like review has been requested. Can you proceed with trying to
> > get it landed?
> 
> I think point 2 (making it clear in the Telemetry crash ping that this is
> extension not web content) might be handled by bug 1361661 but I'd suggest
> checking in with Georg to make sure there's nothing they need from us.

Comment 9 suggests there isn't anything in bug 1361661 that will address item #2. Rob, do you know what changes are required to make this happen? I can try to make the changes if you can give me a clue about what needs to be done.
Flags: needinfo?(rhelmer)
(In reply to Bob Silverberg [:bsilverberg] from comment #10)
> (In reply to Robert Helmer [:rhelmer] from comment #8)
> > (In reply to Bob Silverberg [:bsilverberg] from comment #6)
> > > Thanks for doing the research into crash reporting, updating the bug and
> > > filing the Socorro bugs, Rob. From comment 5 it sounds like there are still
> > > a couple of things that need to be done (numbers 2 and 3). Shall I file
> > > follow-up bugs for those?
> > > 
> > > Andrew, it sounds like the patch attached to this bug should land, but it
> > > doesn't look like review has been requested. Can you proceed with trying to
> > > get it landed?
> > 
> > I think point 2 (making it clear in the Telemetry crash ping that this is
> > extension not web content) might be handled by bug 1361661 but I'd suggest
> > checking in with Georg to make sure there's nothing they need from us.
> 
> Comment 9 suggests there isn't anything in bug 1361661 that will address
> item #2. Rob, do you know what changes are required to make this happen? I
> can try to make the changes if you can give me a clue about what needs to be
> done.

I know that there is a "crash ping" that Telemetry sends, I believe it is documented here:

http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/docs/data/crash-ping.rst

I think we just want to make sure that the RemoteType is also considered, so we can tell web content crashes apart from extension content crashes.

Other than that I'd defer to bsmedberg and Georg if there is anything else we need to do - addressing comment 9 with regard to how this will be monitored would be good. We could set up a simple dashboard that uses Telemetry perhaps, just need someone to watch it, ideally with some kind of alert that goes to a mailing list the addons team watches.

I could see for instance watching Telemetry for an elevated crash rate, and then using crash-stats for the extra details to determine the root cause.
Flags: needinfo?(rhelmer) → needinfo?(gfritzsche)
(In reply to Robert Helmer [:rhelmer] from comment #11)
> (In reply to Bob Silverberg [:bsilverberg] from comment #10)
> > (In reply to Robert Helmer [:rhelmer] from comment #8)
> > > (In reply to Bob Silverberg [:bsilverberg] from comment #6)
> > > > Thanks for doing the research into crash reporting, updating the bug and
> > > > filing the Socorro bugs, Rob. From comment 5 it sounds like there are still
> > > > a couple of things that need to be done (numbers 2 and 3). Shall I file
> > > > follow-up bugs for those?
> > > > 
> > > > Andrew, it sounds like the patch attached to this bug should land, but it
> > > > doesn't look like review has been requested. Can you proceed with trying to
> > > > get it landed?
> > > 
> > > I think point 2 (making it clear in the Telemetry crash ping that this is
> > > extension not web content) might be handled by bug 1361661 but I'd suggest
> > > checking in with Georg to make sure there's nothing they need from us.
> > 
> > Comment 9 suggests there isn't anything in bug 1361661 that will address
> > item #2. Rob, do you know what changes are required to make this happen? I
> > can try to make the changes if you can give me a clue about what needs to be
> > done.
> 
> I know that there is a "crash ping" that Telemetry sends, I believe it is
> documented here:
> 
> http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/
> docs/data/crash-ping.rst
> 
> I think we just want to make sure that the RemoteType is also considered, so
> we can tell web content crashes apart from extension content crashes.

Correct, bug 1361661 doesn't help with any crash metrics here.
@ddurst, can you answer or redirect on the crash ping?

We also need to make sure that other consumers of this data are aware of changes.
You can flag mdoglio for the crash-aggregates work.

What about the other crash metrics?
http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/concepts/crashes.html#child-process-crashes
Do we need to break the extension process out here?
Flags: needinfo?(gfritzsche)
We're going more towards Telemetry for monitoring rates/overall health and reports for deeper dives, as you said. If only because of that, we should be able to separate extension content crashes from web content crashes. It should behave as if it were its own processType -- even if it's identical in function and form to content process crashes. Then everything downstream using this data (be it Telemetry or crash reports) would be altered consistently to reflect processType.

We've talked about this briefly via email, but I'll NI gsvelto so that we can look into that more. It may merit its own bug, depending on the decision of how to handle it.
Flags: needinfo?(gsvelto)
Sorry for the late reply. I've dug our sources a little bit more to understand what the process' remote type actually meant and now I'm unsure whether we should give the web extension process it's own process type or not. We currently have three remote types for content processes: "web" and "file" (to distinguish between remote and locally stored web pages), and "extension".

I'm not sure what was the rationale behind the extension process re-using a content process with a different remote type; but in my eyes it signals the explicit intent of using a content process. To clarify what I mean, different processes (GPU, GMP plugin, etc...) explicitly declare a separate process type and they do so by also re-implementing the parent/child IPC machinery.

The web extension process doesn't seem to need this level of distinction, so the plan described by Robert in comment 5 seems like a reasonable approach. If there's consensus around this I'll be happy to implement the necessary changes (tests included).
Flags: needinfo?(gsvelto)
Rob, can you please reply to comment #14, above?
Flags: needinfo?(rhelmer)
(In reply to Bob Silverberg [:bsilverberg] from comment #15)
> Rob, can you please reply to comment #14, above?

I think aswan could better answer why we don't have a unique process type - I don't have strong opinions on it.
Flags: needinfo?(rhelmer) → needinfo?(aswan)
(In reply to Robert Helmer [:rhelmer] from comment #16)
> I think aswan could better answer why we don't have a unique process type -
> I don't have strong opinions on it.

Let me flip the question around -- why should the extension process have its own process type?  That would be a bunch of work, for what purpose?  That's a rhetorical question though, no need for an answer.  Lets just take :gsvelto up on the offer from the end of comment 14.
Flags: needinfo?(aswan) → needinfo?(gsvelto)
Alright, taking this. I'll be adding the necessary annotations both to the crash dumps and the crash pings. As a bonus this is going to allow us to distinguish between content process crashes from remote (web) and local (file) pages too.
Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Flags: needinfo?(gsvelto)
Here's the patch. Benjamin, consider this also a data review since I'm adding a field to the crash ping (metadata.RemoteType). I've included documentation and tests.
Comment on attachment 8873387 [details]
Bug 1353168 - Add a crash annotation to distinguish between web, file and extension content processes;

https://reviewboard.mozilla.org/r/144820/#review149220

r+dr=me

I suspect we need a followup bug to get this data added to summary and crash aggregates derived data. Could you file that?
Attachment #8873387 - Flags: review?(benjamin) → review+
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9d26051a28ea
Add a crash annotation to distinguish between web, file and extension content processes; r=bsmedberg
(In reply to Benjamin Smedberg [:bsmedberg] from comment #21)
> I suspect we need a followup bug to get this data added to summary and crash
> aggregates derived data. Could you file that?

Yes, I'll do it right away.
Blocks: 1370178
Filed bug 1370178 for showing this data in the crash aggregates; I hope it's in the right component.
https://hg.mozilla.org/mozilla-central/rev/9d26051a28ea
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Product: Toolkit → WebExtensions
You need to log in before you can comment on or make changes to this bug.