Closed Bug 1352496 Opened 4 years ago Closed 4 years ago

CrashManager only permits (non-main) crash pings from content process types

Categories

(Toolkit :: Crash Reporting, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla55
Tracking Status
firefox54 --- verified
firefox55 --- verified

People

(Reporter: ddurst, Assigned: Dexter)

References

(Blocks 1 open bug)

Details

(Whiteboard: [measurement:client])

Attachments

(2 files, 1 obsolete file)

Per https://bugzilla.mozilla.org/show_bug.cgi?id=1293656#c10, only PROCESS_TYPE_CONTENT crash pings are being sent. [https://dxr.mozilla.org/mozilla-central/source/toolkit/components/crashes/CrashManager.jsm#472]

Which means that anything else (PROCESS_TYPE_GPU, for example) is not.

Leaving it up to someone more responsible to decide whether we should discretely add process types to control inclusion, or to just accept all things that are not main or null/undefined.
NI Alessio, since I know he's been looking at (admittedly some different) parts of this machinery lately. Alessio -- I would say we should patch this and uplift it as far as we can. Is that something you can do easily in gsvelto's absence?
Flags: needinfo?(alessio.placitelli)
Assignee: nobody → alessio.placitelli
Flags: needinfo?(alessio.placitelli)
Benjamin mentioned that Gabriele added this constraint deliberately. Should we still change it? I have no background for this functionality, so I might be missing some important piece of the puzzle
Flags: needinfo?(ddurst)
Attachment #8854004 - Flags: review?(ted)
It was added deliberately when we first introduced the content process (https://bugzilla.mozilla.org/show_bug.cgi?id=1293656#c10). We absolutely should change it, the question is just about how.

Should we:

a) introduce each process type's inclusion to sending its crash ping there, so that they're gated in one place until we're satisfied with them and then refactor it to be all process types that are not main?

b) just constrain it to not main and that way anything that is a valid declared type could be included?

c) add some more process type control so that we could easily toggle those off if necessary (that's scope creep, but seemed worth suggesting)?
Flags: needinfo?(ddurst)
(Oh, and I should add that Milan needs GPU process crashes yesterday, which is the driver for this.)
Realistically I think we probably should send all process types. I imagine the following concerns:

* bandwidth/storage: make sure that this won't blow telemetry budgets (P1)
* analysis: make sure that our various aggregates and analyses are expecting this new data (P1)
* quality control:
** make sure that this actually works for GPU/plugin/GMP (P1)
** and the submitted data is reasonably correct (P1)
** and that it doesn't break stuff in the browser (P1)
* analysis part 2: make sure that our aggregates are actively presenting this data (P2)

In terms of solving the immediate problem, can we do this for GPU right away and then come back to plugins? There are a lot of plugin crashes, and I'm not sure we care about them much. Making Flash CtP will dramatically reduce that crash rate where it won't affect budget as much.
Flags: needinfo?(ddurst)
Flags: needinfo?(alessio.placitelli)
Comment on attachment 8854004 [details]
Bug 1352496 - Enable sending other child processes crash pings from the CrashManager.

https://reviewboard.mozilla.org/r/126002/#review128656

Do we know if `addCrash` gets called for crashes from these other process types? Is this a thing that happens by default for new process types or does someone need to wire it up? Given the discussion of new process types we should probably get these sorts of things documented so people don't miss steps and not get crash data when they were expecting to.

::: toolkit/components/crashes/tests/xpcshell/test_crash_manager.js:529
(Diff revision 2)
> +      [["payload", "crashId"], id],
> +    ]);
> +    Assert.ok(!found, "No telemetry ping must be submitted for invalid process types");
> +  }
> +});
> +/*

Did you intend to leave this test commented out?
Attachment #8854004 - Flags: review?(ted) → review+
(In reply to Benjamin Smedberg [:bsmedberg] from comment #7)
> Realistically I think we probably should send all process types. I imagine
> the following concerns:
> 
> * bandwidth/storage: make sure that this won't blow telemetry budgets (P1)
> * analysis: make sure that our various aggregates and analyses are expecting
> this new data (P1)
> * quality control:
> ** make sure that this actually works for GPU/plugin/GMP (P1)
> ** and the submitted data is reasonably correct (P1)
> ** and that it doesn't break stuff in the browser (P1)
> * analysis part 2: make sure that our aggregates are actively presenting
> this data (P2)
> 
> In terms of solving the immediate problem, can we do this for GPU right away
> and then come back to plugins? There are a lot of plugin crashes, and I'm
> not sure we care about them much. Making Flash CtP will dramatically reduce
> that crash rate where it won't affect budget as much.

Clearing my ni?, as ddurst is probably the best person to answer this :)
Flags: needinfo?(alessio.placitelli)
Whiteboard: [measurement:client]
> (In reply to Benjamin Smedberg [:bsmedberg] from comment #7)
> > Realistically I think we probably should send all process types. I imagine
> > the following concerns:
> > 
> > * bandwidth/storage: make sure that this won't blow telemetry budgets (P1)
> > * analysis: make sure that our various aggregates and analyses are expecting
> > this new data (P1)
> > * quality control:
> > ** make sure that this actually works for GPU/plugin/GMP (P1)
> > ** and the submitted data is reasonably correct (P1)
> > ** and that it doesn't break stuff in the browser (P1)
> > * analysis part 2: make sure that our aggregates are actively presenting
> > this data (P2)
> > 
> > In terms of solving the immediate problem, can we do this for GPU right away
> > and then come back to plugins? There are a lot of plugin crashes, and I'm
> > not sure we care about them much. Making Flash CtP will dramatically reduce
> > that crash rate where it won't affect budget as much.

I agree -- we should enable GPU explicitly right now and complete the analysis (both phases) listed above.

Then we should do a thorough look through the bandwidth and repeat for other types once verified OK.
Flags: needinfo?(ddurst)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #8)
> Do we know if `addCrash` gets called for crashes from these other process
> types? Is this a thing that happens by default for new process types or does
> someone need to wire it up? Given the discussion of new process types we
> should probably get these sorts of things documented so people don't miss
> steps and not get crash data when they were expecting to.

I assumed that the specific case for PROCESS_TYPE_CONTENT proved that it did, but that's probably too trusting.

Milan -- is this something your team would have wired, or are you using default behavior (in which case, I will track it down)?
Flags: needinfo?(milan)
Comment on attachment 8854004 [details]
Bug 1352496 - Enable sending other child processes crash pings from the CrashManager.

https://reviewboard.mozilla.org/r/126002/#review128656

Given the conversation on the bug, I've changed CrashManager.jsm to only allow content and GPU crashes.

> Did you intend to leave this test commented out?

Whoops, sorry, no. Leftover :-)
David, question in comment 11.
Flags: needinfo?(milan) → needinfo?(dvander)
(In reply to David Durst [:ddurst] from comment #11)
> (In reply to Ted Mielczarek [:ted.mielczarek] from comment #8)
> > Do we know if `addCrash` gets called for crashes from these other process
> > types? Is this a thing that happens by default for new process types or does
> > someone need to wire it up? Given the discussion of new process types we
> > should probably get these sorts of things documented so people don't miss
> > steps and not get crash data when they were expecting to.
> 
> I assumed that the specific case for PROCESS_TYPE_CONTENT proved that it
> did, but that's probably too trusting.
> 
> Milan -- is this something your team would have wired, or are you using
> default behavior (in which case, I will track it down)?

If I understand what you're asking correctly, AddCrash is called unconditionally when crashes are reported in Gecko:

http://searchfox.org/mozilla-central/rev/381a7b8f8a78b481336cfa384cb0a5536e217e4a/ipc/glue/CrashReporterHost.cpp#160
Flags: needinfo?(dvander)
This should answer ted's question from #c8. Alessio, it's all you.
(In reply to David Durst [:ddurst] from comment #16)
> This should answer ted's question from #c8. Alessio, it's all you.

Thanks David. By looking at what :dvander said, I noticed that a few lines above [1] Gabriele added this:

>    if (mProcessType == nsICrashService::PROCESS_TYPE_CONTENT) {
>      CrashReporter::RunMinidumpAnalyzer(mChildDumpID);
>    }

Which, by looking at [2], seems to "extract the stack traces from the dump and store them in JSON format as an annotation in the extra file associated with the crash". Is this something that we need to do for the GPU process in this bug?

[1] - http://searchfox.org/mozilla-central/rev/381a7b8f8a78b481336cfa384cb0a5536e217e4a/ipc/glue/CrashReporterHost.cpp#144
[2] - http://searchfox.org/mozilla-central/source/toolkit/crashreporter/nsExceptionHandler.cpp#3096
Flags: needinfo?(ddurst)
I assume so, based on the content process crash handling being the prototype for new/additional process crash handling.

NI ted to confirm that [2] would be necessary for GPU (any, really) process crashes as well. I believe the original intent holds for all non-main process crashes (https://bugzilla.mozilla.org/show_bug.cgi?id=1293656#c11).
Flags: needinfo?(ddurst) → needinfo?(ted)
Yes, I had forgotten about that, but we'll need to do that. The minidump-analyzer is what produces the JSON containing the threads and modules etc that we submit in the crash ping.
Flags: needinfo?(ted)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #19)
> Yes, I had forgotten about that, but we'll need to do that. The
> minidump-analyzer is what produces the JSON containing the threads and
> modules etc that we submit in the crash ping.

No problem, I've updated the patch to only allow crash pings on content + GPU crashes and made it run the minidum-analyzer for gpu crashes as well! Does that need a new review?

@Benjamin, I'm flagging you because I'm assuming this needs a data review. I've updated the documentation bits to mention that the crash ping is sent for the gpu process as well.
Flags: needinfo?(ted)
Comment on attachment 8854004 [details]
Bug 1352496 - Enable sending other child processes crash pings from the CrashManager.

https://reviewboard.mozilla.org/r/126002/#review131616

data-r=me
Attachment #8854004 - Flags: review?(benjamin) → review+
(In reply to Alessio Placitelli [:Dexter] from comment #21)
> No problem, I've updated the patch to only allow crash pings on content +
> GPU crashes and made it run the minidum-analyzer for gpu crashes as well!
> Does that need a new review?

If those are the only changes then you can carry forward my r+.
Flags: needinfo?(ted)
Pushed by alessio.placitelli@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/0788b396df3c
Enable sending other child processes crash pings from the CrashManager. r=bsmedberg,ted
https://hg.mozilla.org/mozilla-central/rev/0788b396df3c
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Hi :milan!

Before flagging for QA validation on this bug, could you suggest any way to trigger a "GPU crash" in Firefox? Is this even possible?
Flags: needinfo?(milan)
Nightly has a "Terminate GPU Process" button in about:support.  The other option is to look up the GPU process ID (also in about:support), find it in the task manager, and terminate the process there.
Flags: needinfo?(milan)
Terminating the GPU process from the task manager doesn't cause it to crash and won't trigger a crash report. I don't know how "terminate GPU process" works so I can't say whether that triggers a crash or not.

But Alessio you should be able to use crashfirefox.exe to trigger a crash; you'll need to specify the PID of the GPU process.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #28)
> Terminating the GPU process from the task manager doesn't cause it to crash
> and won't trigger a crash report. I don't know how "terminate GPU process"
> works so I can't say whether that triggers a crash or not.
> 
> But Alessio you should be able to use crashfirefox.exe to trigger a crash;
> you'll need to specify the PID of the GPU process.

Grand! Thanks Benjamin, I was getting crazy, "crashfirefox.exe" finally did the trick and I can see crash pings being sent.

@Abe, could you please validate this bug? Let's make sure that:

- Crash pings are generated when a GPU crash happens. They should be sent when Firefox restarts.
- They validate correctly against our schema

To crash the GPU process you need to build "crashfirefox.exe" from [1]. After its built, you want to execute "crashfirefox.exe <gpu pid>".
To get the PID of the GPU process, go to about:support and look for the "GPUProcessPid" entry: the number is the PID.

Please note that GPU acceleration might not be available if you're running Firefox in a virtual machine (didn't work for me in Firefox on Ubuntu in a VM).

[1] - https://github.com/bsmedberg/crashfirefox-intentionally
[2] - https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/master/telemetry/crash.schema.json
Flags: needinfo?(amasresha)
Prebuilt crashfirefox binary is available at https://ftp.mozilla.org/pub/utilities/crashfirefox-intentionally/
(In reply to Benjamin Smedberg [:bsmedberg] from comment #30)
> Prebuilt crashfirefox binary is available at
> https://ftp.mozilla.org/pub/utilities/crashfirefox-intentionally/

Benjamin, what tools do you suggest to crash the GPU process in Mac and Linux?
Flags: needinfo?(amasresha) → needinfo?(benjamin)
n.b.: killing the GPU process with SIGABRT on OS X won't currently generate a crash report, but bug 1350917 landed on autoland yesterday so it should be in Nightly soon.
I have tested this on 64 bits of Windows-10 with 32 bits of Nightly and it works as expected. Here are the test cases: https://public.etherpad-mozilla.org/p/Bug_1352496#lineNumber=151

However, I have two concerns:
First, it seems that we do not support GPU processes in Mac and Linux. So the testing will not cover those operating systems.

Second, "crashfirefox.exe" only crashes 32 bits of Firefox, not 64 bits of Firefox. So what other tool do you recommend to crash the GPU process in 64 bits of Firefox? 
Thanks
Flags: needinfo?(alessio.placitelli)
(In reply to Abe - QA (:Abe_LV) from comment #34)
> I have tested this on 64 bits of Windows-10 with 32 bits of Nightly and it
> works as expected. Here are the test cases:
> https://public.etherpad-mozilla.org/p/Bug_1352496#lineNumber=151
> 
> However, I have two concerns:
> First, it seems that we do not support GPU processes in Mac and Linux. So
> the testing will not cover those operating systems.

Ok! Good!

> Second, "crashfirefox.exe" only crashes 32 bits of Firefox, not 64 bits of
> Firefox. So what other tool do you recommend to crash the GPU process in 64
> bits of Firefox? 
> Thanks

I've compiled the "crashfirefox.exe" executable on my own (x64), and it correctly crashes my x64 Firefox.
I've uploaded it for you here: https://drive.google.com/a/mozilla.com/file/d/0B1RACVTBwZ94VENQbFRDbTlkMHM/view?usp=sharing
Flags: needinfo?(alessio.placitelli)
Verified-fixed.

We have completed testing on windows 10 with 32 bits and 64 bits of latest nightly, and all is green.
Test cases and runs are here:https://public.etherpad-mozilla.org/p/Bug_1352496
Please needinfo or email me if you have questions about the testing. Thanks
Status: RESOLVED → VERIFIED
Comment on attachment 8854004 [details]
Bug 1352496 - Enable sending other child processes crash pings from the CrashManager.

Approval Request Comment
[User impact if declined]: The GFX team won't have access to crash ping Telemetry triggered by GPU crashes
[Is this code covered by automated tests?]: yes
[Has the fix been verified in Nightly?]: yes
[Needs manual test from QE? If yes, steps to reproduce]: 
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: Low risk
[Why is the change risky/not risky?]: The changes to enable GPU code are trivial and could only compromise sending GPU crash pings, which were not available prior to this patch. The biggest chunk of the patch is to improve automated test coverage.
[String changes made/needed]: None
Attachment #8854004 - Flags: approval-mozilla-aurora?
Comment on attachment 8854004 [details]
Bug 1352496 - Enable sending other child processes crash pings from the CrashManager.

54 was merged to Beta today.
Attachment #8854004 - Flags: approval-mozilla-aurora? → approval-mozilla-beta?
(In reply to [Out of Office Until 24-April] Ryan VanderMeulen [:RyanVM] from comment #38)
> Comment on attachment 8854004 [details]
> Bug 1352496 - Enable sending other child processes crash pings from the
> CrashManager.
> 
> 54 was merged to Beta today.

Whoops, thanks Ryan.
Comment on attachment 8854004 [details]
Bug 1352496 - Enable sending other child processes crash pings from the CrashManager.

Fix a crash ping telemetry issue and was verified. Beta54+.
Attachment #8854004 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
needs rebasing for beta because of problems like :

warning: conflicts while merging toolkit/components/crashes/tests/xpcshell/test_crash_manager.js! (edit, then use 'hg resolve --mark')
abort: unresolved conflicts, can't continue
Flags: needinfo?(alessio.placitelli)
Flags: needinfo?(alessio.placitelli)
Attachment #8859479 - Attachment is patch: true
See Also: → 1324801
Blocks: 1310716
Not sure if tied to the eslint failures or not, but the test was failing too.
https://treeherder.mozilla.org/logviewer.html#?job_id=92863062&repo=mozilla-beta
Gah, sorry, I made a mistake during the rebase, I've tested this locally and it works.

Yes, the eslint failures were related. Sorry for the mess.

Tomcat, would you mind relanding this patch on beta?
Attachment #8859479 - Attachment is obsolete: true
Flags: needinfo?(alessio.placitelli) → needinfo?(cbook)
Alessio, should Release QA manually verify this on Beta 54 as well?
Flags: needinfo?(alessio.placitelli)
(In reply to Andrei Vaida [:avaida], Desktop Release QA – please ni? me from comment #48)
> Alessio, should Release QA manually verify this on Beta 54 as well?

David, do you think that's useful?
Flags: needinfo?(alessio.placitelli) → needinfo?(ddurst)
Yes! I'm always in favor of verification.
Flags: needinfo?(ddurst)
(In reply to David Durst [:ddurst] from comment #50)
> Yes! I'm always in favor of verification.

Roger that. We can use Comment 36 for context.
Flags: qe-verify+
Verified as fixed using Firefox 54 beta 7 32- and 64-bit under Win 10 64-bit.
You need to log in before you can comment on or make changes to this bug.