Open Bug 1447161 Opened 6 years ago Updated 2 years ago

Low/no gmplugin crashes being reported on Windows

Categories

(Toolkit :: Crash Reporting, enhancement)

enhancement

Tracking

()

People

(Reporter: wlach, Unassigned)

Details

According to error aggregates (https://docs.telemetry.mozilla.org/datasets/streaming/error_aggregates/reference.html), we have seen almost no gmplugin crashes on Windows lately:

https://sql.telemetry.mozilla.org/queries/52125/source

The code to gather these crashes is quite straightforward:

https://github.com/mozilla/telemetry-streaming/blob/545d1685c4d675f150344d7109c3a8dc0f0f79db/src/main/scala/com/mozilla/telemetry/streaming/ErrorAggregator.scala#L276

And also we do see some crashes on other platforms, for example Mac:

https://sql.telemetry.mozilla.org/queries/52127/source

While I would like to believe that our Windows code is now so robust that it doesn't crash anymore, I also worry that something might be wrong here.

I talked a bit in #media about this and (at least as of this writing, discussion ongoing) no one seemed quite sure why this would be the case.

https://mozilla.logbot.info/media/20180319#c14481974

Gabriel, you touched the reporting code recently (https://searchfox.org/mozilla-central/rev/877c99c523a054419ec964d4dfb3f0cadac9d497/ipc/glue/CrashReporterHost.cpp#267). Might you have any ideas what's happening?
Flags: needinfo?(gsvelto)
Some further discussion on #media, potentially useful:

10:15:46 <wlach> we seem to be lacking people with an intersection of understanding between the telemetry/crash reporting side and the gmplugin side
10:16:12 <drno> wlach: indeed that is probably a resource close to zero
10:16:41 <drno> wlach: two questions come to my mind
10:17:10 <drno> 1) is the sandbox on windows different then on the other OS’s?
10:17:58 <drno> 2) do we start a separate process for the GMP plugin on Windows? I only tested on Mac, and since I’m traveline I don’t have a Win machine at hand
10:18:01 <jld> Sandboxing is pretty much completely separate on different OSes.
10:19:29 <jld> And GMPs are always their own processes as I understand it.
10:20:31 <jld> Also, the internals of crash reporting are probably pretty different across OSes, especially Windows vs. Unix/Mac.
And yet more:

10:22:56 <drno> wlach: have a look at this https://sql.telemetry.mozilla.org/queries/52126/source
10:23:39 <drno> looks like GMP crash reporting on Win was working up till 53, then stopped working and now is back since 58.0.2
10:24:06 <wlach> drno: the 58.0.2 column is probably just one ping
10:24:21 <wlach> drno: and based on my experience, it is likely bogus
10:24:42 <wlach> but I would agree with the working up to 53 part :)
10:24:44 <drno> wlach: fair enough. So it’s broken since 54 then :)
I've done a bunch of changes to our crash reporting machinery in the last year so I might have broken something myself. I'm currently on parental leave but I'll try to look at this ASAP. Leaving the NI? for now.
Quick update; I'm on parental leave for another week but I should have some time to look into this between today and tomorrow. In the meantime I was wondering if it would be useful to enable crash pings for GMP crashes. It's just a matter of adding the relevant type in [1] for crash pings to be sent. It could be useful not only for measuring crash rates but also for having a redundant data source in case crash submission isn't working properly as it seems to be the case here. Note that crash pings now carry a rather rich amount of info including raw stack traces and the list of loaded modules. If there's interest in it I can enable them and ensure that they're processed correctly on our back-ends.

[1] https://searchfox.org/mozilla-central/rev/8220783953b0311e1d64c2366f732a159f05ed7e/toolkit/components/crashes/CrashManager.jsm#473
I just tried manually crashing the plugin and the crash submission toolbar showed up correctly, I clicked on the submit button and here it is:

https://crash-stats.mozilla.com/report/index/55259020-a528-4b73-9f46-a6b760180325

So crash submission seems to be working fine.

However I just realized that comment 0 specifically referred to gmplugin crashes recorded in the telemetry data gathered from the main ping. I haven't touched that code so I'll have to dig a little deeper. Enabling gmplugin crash pings as per comment 4 would be a good idea though. Another thing comes to mind, Chris found a significant discrepancy between the content process crashes recorded in the main ping and the actual content crash pings. Chris, do you think we might be seeing a similar problem here with gmplugin crashes being under-counted?
Flags: needinfo?(gsvelto) → needinfo?(chutten)
With no conclusion (yet) to bug 1413172 I don't have a theory of the mechanism by which these counts are disagreeing. However, whatever's doing that doesn't care which OS the system is, so I don't think it's likely to be the same mechanism preventing gmplugin crashes from being counted only on Windows.

But I'll keep an eye out for the possibility while I keep at it.
Flags: needinfo?(chutten)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.